|Previous||Table of Contents||Next|
|The Pentium, on the other hand, has two separate 8K caches, one for code and one for data, so code prefetches can never collide with data fetches; the prefetch queue can stall only when the code being fetched isnt in the internal code cache.|
(And yes, self-modifying code still works; as with all Pentium changes, the dual caches introduce no incompatibilities with 386/486 code.) Also, because the code and data caches are separate, code cant be driven out of the cache in a tight loop that accesses a lot of data, unlike the 486. In addition, the Pentium expands the 486s 32-byte prefetch queue to 128 bytes. In conjunction with the branch prediction feature (described next), which allows the Pentium to prefetch properly at most branches, this larger prefetch queue means that the Pentiums two pipes should be better fed than those of any previous x86 processor.
There are three other characteristics of the Pentium that make for a healthy supply of instruction bytes. One is that the Pentium can prefetch instructions across cache lines. Unlike the 486, where there is a 3-cycle penalty for branching to an instruction that spans a cache line, theres no such penalty on the Pentium. The second is that the cache line size (the number of bytes fetched from the external cache or main memory on a cache miss) on the Pentium is 32 bytes, twice the size of the 486s cache line, so a cache miss causes a longer run of instructions to be placed in the cache than on the 486. The third is that the Pentiums external bus is twice as wide as the 486s, at 64 bits, and runs twice as fast, at 66 MHz, so the Pentium can fetch both instruction and data bytes from the external cache four times as fast as the 486.
|Even when the Pentium is running flat-out with both pipes in use, it can generally consume only about twice as many bytes as the 486; so the ratio of external memory bandwidth to processing power is much improved, although real-world performance is heavily dependent on the size and speed of the external cache.|
The upshot of all this is that at the same clock speed, with code and data that are mostly in the internal caches, the Pentium maxes out somewhere around twice as fast as a 486. (When the caches are missed a lot, the Pentium can get as much as three to four times faster, due to the superior external bus and bigger caches.) Most of this wont affect how you program, but it is useful to know that you dont have to worry about instruction fetching. Its also useful to know the sizes of the caches because a high cache hit rate is crucial to Pentium performance. Cache misses are vastly slower than cache hits (anywhere from two to 50 or more times as slow, depending on the speed of the external cache and whether the external cache misses as well), and the Pentium cant use the V-pipe on code that hasnt already been executed out of the cache at least once. This means that it is very important to get the working sets of critical loops to fit in the internal caches.
One change in the Pentium that you definitely do have to worry about is superscalar execution. Utilization of the V-pipe can range from near zero percent to 100 percent, depending on the code being executed, and careful rearrangement of code can have amazing effects. Maxing out V-pipe use is not a trivial task; Ill spend all of the next chapter discussing it so as to have time to cover it properly. In the meantime, two good references for superscalar programming and other Pentium information are Intels Pentium Processor Users Manual: Volume 3: Architecture and Programming Manual (ISBN 1-55512-195-0; Intel order number 241430-001), and the article Optimizing Pentium Code by Mike Schmidt, in Dr. Dobbs Journal for January 1994.
There are two other interesting changes in the Pentiums cache organization. First, the cache is two-way set-associative, whereas the 486 is four-way set-associative. The details of this dont matter, but simply put, this, combined with the 32-byte cache line size, means that the Pentium has somewhat coarser granularity in both space and time than the 486 in terms of packing bytes into the cache, although the total cache space is now bigger. Theres nothing you can do about this, but it may make it a little harder to get a loops working set into the cache. Second, the internal cache can now be configured (by the BIOS or OS; you wont have to worry about it) for write-back rather than write-through operation. This means that writes to the internal data cache dont necessarily get propagated to the external bus until other demands for cache space force the data out of the cache, making repeated writes to memory variables such as loop counters cheaper on average than on the 486, although not as cheap as registers.
As a final note on Pentium architecture for this chapter, the pipeline stalls (what Intel calls AGIs, for Address Generation Interlocks) that I discussed earlier in this book (see Chapter 12) are still present in the Pentium. In fact, theyre there in spades on the Pentium; the two pipelines mean that an AGI can now slow down execution of an instruction thats three instructions away from the AGI (because four instructions can execute in two cycles). So, for example, the code sequence
add edx,4 ;U-pipe cycle 1 mov ecx,[ebx] ;V-pipe cycle 1 add ebx,4 ;U-pipe cycle 2 mov [edx],ecx ;V-pipe cycle 3 ; due to AGI ; (would have been ; V-pipe cycle 2)
takes three cycles rather than the two cycles it should take, because EDX was modified on cycle 1 and an attempt was made to use it on cycle two, before the AGI had time to cleareven though there are two instructions between the instructions that are actually involved in the AGI. Rearranging the code like
mov ecx,[ebx] ;U-pipe cycle 1 add ebx,4 ;V-pipe cycle 1 mov [edx+4],ecx ;U-pipe cycle 2 add edx,4 ;V-pipe cycle 2
makes it functionally identical, but cuts the cycles to 2a 50 percent improvement. Clearly, avoiding AGIs becomes a much more challenging and rewarding game in a superscalar world, one to which Ill return in the next chapter.
|Previous||Table of Contents||Next|