|Previous||Table of Contents||Next|
One brand-spanking-new feature of the Pentium is branch prediction, whereby the Pentium tries to guess, based on past history, which way (or, for conditional jumps, whether or not), your code will jump at each branch, and prefetches along the likelier path. If the guess is correct, the branch or fall-through takes only 1 cycle2 cycles less than a branch and the same as a fall-through on the 486; if the guess is wrong, the branch or fall-through takes 4 or 5 cycles (if it executes in the U- or V-pipe, respectively)1 or 2 cycles more than a branch and 3 or 4 cycles more than a fall-through on the 486.
|Branch prediction is unprecedented in the x86, and fundamentally alters the nature of pedal-to-the-metal optimization, for the simple reason that it renders unrolled loops largely obsolete. Rare indeed is the loop that cant afford to spare even 1 or 0 (yes, zero!) cycles per iteration for loop counting, and thats how low the cost can go for maintaining a loop on the Pentium.|
Also, unrolled loops are bigger than normal loops, so there are extra (and expensive) cache misses the first time through the loop if the entire loop isnt already in the cache; then, too, an unrolled loop will shoulder other code out of the internal and external caches. If in a critical loop you absolutely need the time taken by the loop control instructions, or if you need an extra register that can be freed by unrolling a loop, then by all means unroll the loop. Dont expect the sort of speed-up you get from this on the 486 or especially the 386, though, and watch out for the cache effects.
You may well wonder exactly when the Pentium correctly predicts branching. Alas, this is one area that Intel has declined to document, beyond saying that you should endeavor to fall through branches when you have a choice. Thats good advice on every other x86 processor, anyway, so its well worth following. Also, its a pretty safe bet that in a tight loop, the Pentium will start guessing the right branch direction at the bottom of the loop pretty quickly, so you can treat loop branches as one-cycle instructions.
Its an equally safe bet that its a bad move to have in a loop a conditional branch that goes both ways on a random basis; its hard to see how the Pentium could consistently predict such branches correctly, and mispredicted branches are more expensive than they might appear to be. Not only does a mispredicted branch take 4 or 5 cycles, but the Pentium can potentially execute as many as 8 or 10 instructions in that time3 times as many as the 486 can execute during its branch timeso correct branch prediction (or eliminating branch instructions, if possible) is very important in inner loops. Note that on the 486 you can count on a branch to take 1 cycle when it falls through, but on the Pentium you cant be sure whether it will take 1 or either 4 or 5 cycles on any given iteration.
|As things currently stand, branch prediction is an annoyance for assembly language optimization because its impossible to be certain exactly how code will perform until you measure it, and even then its difficult to be sure exactly where the cycles went. All I can say is try to fall through branches if possible, and try to be consistent in your branching if not.|
The Pentium has all the instructions of the 486, plus a few new ones. One much-needed instruction that has finally made it into the instruction set is CPUID, which allows your code to determine what processor its running on. CPUID is 15 years late, but at least its finally here. Another new instruction is CMPXCHG8B, which does a compare and conditional exchange on a qword. CMPXCHG8B doesnt seem to me to be a particularly useful instruction, but Im sure Intel wouldnt have added it without a reason; if you know of a use for it, please pass it along to me.
Many Pentium optimizations help, or at least dont hurt, on the 486. Many, but not alland many do hurt on the 386. As I discuss various Pentium optimizations, I will attempt to note the effects on the 486 as well, but doing this in complete detail would double the sizes of these discussions and make them hard to follow. In general, Id recommend reserving Pentium optimization for your most critical code, and even there, its a good idea to have at least two code paths, one for the 386 and one for the 486/Pentium. Its also a good idea to time your code on a 486 before and after Pentium-optimizing it, to make sure you havent hurt performance on what will be, after all, by far the most important processor over the next couple of years.
With that in mind, is optimizing for the Pentium even worthwhile today? That depends on your application and its marketbut if you want absolutely the best possible performance for your DOS and Windows apps on the fastest hardware, Pentium optimization can make your code scream.
In the next chapter, well look into the single biggest element of Pentium performance, cranking up the Pentiums second execution pipe. This is the area in which compiler technology is most touted for the Pentium, the two thoughts apparently being that (1) most existing code is in C, so recompiling to use the second pipe better is an automatic win, and (2) its so complicated to optimize Pentium code that only a compiler can do it well. The first point is a reasonable one, but it does suffer from one flaw for large programs, in that Pentium-optimized code is larger than 486- or 386-optimized code, for reasons that will become apparent in the next chapter. Larger code means more cache misses and more page faults; and while most of the code in any program is not critical to performance, compilers optimize code indiscriminately.
The result is that Pentium compiler optimization not only expands code, but can be less beneficial than expected or even slower in some cases. What makes more sense is enabling Pentium optimization only for key code. Better yet, you could hand-tune the most important codeand yes, you can absolutely do a better job with a small, critical loop than any PC compiler Ive ever seen, or expect to see. Sure, you keep hearing how great each new compiler generation is, and compilers certainly have improved; but they play by the same rules we do, and were more flexible and know more about what were doingand now we have the wonderfully complex and powerful Pentium upon which to loose our carbon-based optimizers.
A compiler that generates better code than a good assembly programmer? Thatll be the day.
|Previous||Table of Contents||Next|