Previous | Table of Contents | Next |
Its a sad but true fact that 84 percent of American schoolchildren are ignorant of 92 percent of American history. Not my daughter, though. We recently visited historical Revolutionary-War-vintage Fort Ticonderoga, and shes now 97 percent aware of a key element of our national heritage: that the basic uniform for soldiers in those days was what appears to be underwear, plus a hat so that no one could complain that they were undermining family values. Ha! Just kidding! Actually, what she learned was that in those days, it was pure coincidence if a cannonball actually hit anything it was aimed at, which isnt surprising considering the lack of rifling, precision parts, and ballistics. The guides at the fort shot off three cannons; the closest they came to the target was about 50 feet, and that was only because the wind helped. I think the idea in early wars was just to put so much lead in the air that some of it was bound to hit something; preferably, but not necessarily, the enemy.
Nowadays, of course, we have automatic weapons that allow a teenager to singlehandedly defeat the entire U.S. Army, not to mention so-called smart bombs, which are smart in the sense that they can seek out and empty a taxpayers wallet without being detected by radar. Theres an obvious lesson here about progress, which I leave you to deduce for yourselves.
Heres the same lesson, in another form. Ten years ago, we had a slow processor, the 8088, for which it was devilishly hard to optimize, and for which there was no good optimization documentation available. Now we have a processor, the 486, thats 50 to 100 times faster than the 8088and for which there is no good optimization documentation available. Sure, Intel provides a few tidbits on optimization in the back of the i486 Microprocessor Programmers Reference Manual, but, as I discussed in Chapter 12, that information is both incomplete and not entirely correct. Besides, most assembly language programmers dont bother to read Intels manuals (which are extremely informative and well done, but only slightly more fun to read than the phone book), and go right on programming the 486 using outdated 8088 optimization techniques, blissfully unaware of a new and heavily mutated generation of cycle-eaters that interact with their code in ways undreamt of even on the 386.
For example, consider how Terje Mathisen doubled the speed of his word-counting program on a 486 simply by shuffling a couple of instructions.
Ive mentioned Terje Mathisen in my writings before. Terje is an assembly language programmer extraordinaire, and author of the incredibly fast public-domain word-counting program WC (which comes complete with source code; well worth a look, if you want to see what really fast code looks like). Terjes a regular participant in the ibm.pc/fast.code topic on Bix. In a thread titled 486 Pipeline Optimization, or TANSTATFC (There Aint No Such Thing As The Fastest Code), he detailed the following optimization to WC, perhaps the best example of 486 pipeline optimization Ive yet seen.
Terjes inner loop originally looked something like the code in Listing 13.1. (Ive taken a few liberties for illustrative purposes.) Of course, Terje unrolls this loop a few times (128 times, to be exact). By the way, in Listing 13.1 youll notice that Terje counts not only words but also lines, at a rate of three instructions for every two characters!
LISTING 13.1 L13-1.ASM
mov di,[bp+OFFS] ;get the next pair of characters mov bl,[di] ;get the state value for the pair add dx,[bx+8000h] ;increment word and line count ; appropriately for the pair
Listing 13.1 looks as tight as it could be, with just two one-cycle instructions, one two-cycle instruction, and no branches. It is tight, but those three instructions actually take a minimum of 8 cycles to execute, as shown in Figure 13.1. The problem is that DI is loaded just before being used to address memory, and that costs 2 cycles because it interrupts the 486s internal instruction pipeline. Likewise, BX is loaded just before being used to address memory, costing another two cycles. Thus, this loop takes twice as long as cycle counts would seem to indicate, simply because two registers are loaded immediately before being used, disrupting the 486s pipeline.
Listing 13.2 shows Terjes immediate response to these pipelining problems; he simply swapped the instructions that load DI and BL. This one change cut execution time per character pair from eight cycles to five cycles! The load of BL is now separated by one instruction from the use of BX to address memory, so the pipeline penalty is reduced from two cycles to one cycle. The load of DI is also separated by one instruction from the use of DI to address memory (remember, the loop is unrolled, so the last instruction is followed by the first instruction), but because the intervening instruction takes two cycles, theres no penalty at all.
Figure 13.1 Cycle-eaters in the original WC.
Remember, pipeline penalties diminish with increasing number of cycles, not instructions, between the pipeline disrupter and the potentially affected instruction. |
LISTING 13.2 L13-2.ASM
mov bl,[di] ;get the state value for the pair mov di,[bp+OFFS] ;get the next pair of characters add dx,[bx+8000h] ;increment word and line count ; appropriately for the pair
At this point, Terje had nearly doubled the performance of this code simply by moving one instruction. (Note that swapping the instructions also made it necessary to preload DI at the start of the loop; Listing 13.2 is not exactly equivalent to Listing 13.1.) Ill let Terje describe his next optimization in his own words:
Previous | Table of Contents | Next |