Previous | Table of Contents | Next |
I can still remember the day I did my first 8088 programming. I had just moved over from the distantly related Z80, so the 8088 wasnt totally alien, but it was nonetheless an incredibly exciting processor. The 8088s instruction set was vastly more powerful and varied than the Z80s, and as someone who thrives on puzzles of all sorts, from crosswords to Freecell to jigsaws to assembly language optimization, I was delighted to find that the 8088 made the optimization universe an order of magnitude more complicatedand correspondingly more interesting.
Well, the years went by and the Z80 just died, and 8088 optimization got ever more complex and intriguing as I discovered the hazards of the 8088s cycle-eaters. By the time 1989 rolled around, I had written Zen of Assembly Language, in which I described all that I had learned about the 8088 and concluded that 8088 optimization was a black art of infinite subtlety. Unfortunately, by that time the 286 was the standard, with the 386 coming on strong, and if the 286 was less amenable to hand optimization than the 8088 (and it surely was), then the 386 was downright unfriendly. Sure, assembly optimization could buy some performance on the 386, but only 20, 30, 40 percent or soa far cry from the 100 to 400 percent of the 8088. At the same time, compiler technology was improving quickly, and the days of hand tuning seemed numbered.
Happily, the 486 traveled to the beat of a different drum. The 486 had some interesting internal pipeline hazards, as well as an internal cache that made cycle counting more meaningful than ever before, and careful code massaging sometimes yielded startling results. Nonetheless, the 486 was still too simple to mark a return to the golden age of optimization.
Then the Pentium came around, and filled our code with optimization hazards, and life was good again. The Pentium has two execution pipelines and enough rules and exceptions to those rules to bring joy to the heart of the hardest-core assembly junkie. For a change, Intel documented most of the Pentium optimization rules and spread the word about them, so we dont have to go through as much spelunking of the Pentium as with its predecessors. Theyve done this, I suspect, largely because more than any previous x86 processor, the Pentiums performance is highly dependent on properly optimized code.
In the worst case, where the second execution pipe is dormant most of the time, the Pentium wont perform all that much better than a 486 at the same clock speed. In the best case, where the second pipe is heavily used and the Pentiums other advantages (such as branch prediction, write-back cache, 64-bit full speed external bus, and dual 8K caches) can kick in, the Pentium can be more than twice as fast as a 486. In a critical inner loop, hand optimization can double or even triple performance over 486-optimized codeand thats on top of the sorts of algorithmic and design optimizations that are routinely performed on any processor. Good compilers can make a big difference on the Pentium, too, but there are some gotchas there, to which Ill return later.
Its been a long time coming, but hard-core, big-payoff assembly language optimization is back in style, and for the rest of this book Ill be delving into the Byzantine wonders of the Pentium. In this chapter, Ill do a quick overview, then cover a variety of smaller Pentium optimization topics. In the next chapter, Ill tackle the 900-pound gorilla of Pentium optimization: superscalar (dual execution pipe) programming. Trust me, thisll be fun.
Listen, do you want to know a secret? This lead-in has been brought to you with the help of classic rockanother way of saying music Baby Boomers listened to back when they cared more about music than 401Ks and regular flossing. There are so many of us Boomers that our music, even the worst of it, will never go away. When were 90 years old, propped up in our Kraftmatic adjustable beds and surfing the 5,000-channel information superhighway from one infomercial to the next, the sound system in the retirement community will be piping in a Muzak version of Louie, Louie, while on the holovid Country Joe McDonald and the Fish pitch Preparation H. I can hardly wait.
Gimme a P....
Architecturally, the Pentium is vastly different in many ways from the 486, but most of those differences are transparent to programmers. After all, the whole idea behind the Pentium is that it runs the same code as previous x86 processors, but faster; otherwise, Intel could have made a faster, cheaper RISC processor. Still, knowledge of the Pentiums architecture is useful for understanding exactly how code will perform, and a few of the architectural differences are most decidedly not transparent to performance programmers.
The Pentium is essentially one full 486 execution unit (EU), plus a second stripped-down 486 EU, on a single chip. The first EU is referred to as the U execution pipe, or U-pipe; the second, more limited one is called the V-pipe. The two pipes are capable of executing instructions simultaneously, have separate write buffers, and can even access the data cache simultaneously (although with certain limitations that Ill discuss in the next chapter), so on the Pentium it is possible to execute two instructions, even instructions that access memory, in a single clock. The cycle times for instruction execution in a given pipe (both pipes process instructions at the same speed) are comparable to those for the 486, although some instructionsnotably MUL, the repeated string instructions, and some of the shifts and rotateshave gotten faster.
My first thought upon hearing of the Pentiums dual pipes was to wonder how often the prefetch queue stalls for lack of instruction bytes, given that the demand for instruction bytes can be twice that of the 486. The answer is: rarely indeed, and then only because the code is not in the internal cache. The 486 has a single 8K cache that stores both code and data, and prefetching can stall if data fetching doesnt allow time for prefetching to occur (although this rarely happens in practice).
Previous | Table of Contents | Next |