|Previous||Table of Contents||Next|
You dont need to understand every corner of the 486 universe unless youre a diehard ASMhead who does this stuff for fun. Just learn enough to be able to speed up the key portions of your programs, and spend the rest of your time on a fast design and overall implementation.
Rule #4: Dont load any byte register exactly 2 cycles before using any register to address memory.
This, the last of this chapters rules, is the strangest of the lot. If any byte register is loaded, and then two cycles later any register is used to point to memory, one cycle is lost. So, for example, this code
mov al,bl mov cx,dx mov si,[di]
takes four rather than the expected three cycles to execute. Note that it is not required that the byte register be part of the register used to address memory; any byte register will do the trick.
Worse still, loading byte registers both one and two cycles before a register is used to address memory costs two cycles, as in
mov bl,al mov cl,3 mov bx,[si]
which takes five rather than three cycles to run. However, there is no penalty if a byte register is loaded one cycle but not two cycles before a register is used to address memory. Therefore,
mov cx,3 mov dl,al mov si,[bx]
runs in the expected three cycles.
In truth, I do not know why this happens. Clearly, it has something to do with interrupting the start of the addressing pipeline, and I have my theories about how this works, but at this point theyre pure speculation. Whatever the reason for this rule, ignorance of itand of its interaction with the other rulescould lead to considerable performance loss in seemingly air-tight code. For instance, a casual observer would expect the following code to run in 3 cycles:
mov bx,offset MemVar mov cl,al mov ax,[bx]
A more sophisticated programmer would expect to lose one cycle, because BX is loaded two cycles before being used to address memory. In fact, though, this code takes 5 cycles2 cycles, or 67 percent, longer than normal. Why? Well, under normal conditions, loading a byte registerCL in this caseone cycle before using a register to address memory produces no penalty; loading 2 cycles ahead is the only case that normally incurs a penalty. However, think of Rule #4 as meaning that loading a byte register disrupts the memory addressing pipeline as it starts up. Viewed that way, we can see that MOV BX,OFFSET MemVar interrupts the addressing pipeline, forcing it to start again, and then, presumably, MOV CL,AL interrupts the pipeline again because the pipeline is now on its first cycle: the one that loading a byte register can affect.
|I knowit seems awfully complicated. It isnt, really. Generally, try not to use byte destinations exactly two cycles before using a register to address memory, and try not to load a register either one or two cycles before using it to address memory, and youll be fine.|
In case you want to do some 486 performance analysis of your own, let me show you how I arrived at one of the above conclusions; at the same time, I can warn you of the timing hazards of the cache. Listings 12.1 and 12.2 show the code I ran through the Zen timer in order to establish the effects of loading a byte register before using a register to address memory. Listing 12.1 ran in 120 µs on a 33 MHz 486, or 4 cycles per repetition (120 µs/1000 repetitions = 120 ns per repetition; 120 ns per repetition/30 ns per cycle = 4 cycles per repetition); Listing 12.2 ran in 90 µs, or 3 cycles, establishing that loading a byte register costs a cycle only when its performed exactly 2 cycles before addressing memory.
LISTING 12.1 LST12-1.ASM
; Measures the effect of loading a byte register 2 cycles before ; using a register to address memory. mov bp,2 ;run the test code twice to make sure ; its cached sub bx,bx CacheFillLoop: call ZTimerOn ;start timing rept 1000 mov dl,cl nop mov ax,[bx] endm call ZTimerOff ;stop timing dec bp jz Done jmp CacheFillLoop Done:
LISTING 12.2 LST12-2.ASM
; Measures the effect of loading a byte register 1 cycle before ; using a register to address memory. mov bp,2 ;run the test code twice to make sure ; its cached sub bx,bx CacheFillLoop: call ZTimerOn ;start timing rept 1000 nop mov dl,cl mov ax,[bx] endm call ZTimerOff ;stop timing dec bp jz Done jmp CacheFillLoop Done:
Note that Listings 12.1 and 12.2 each repeat the timing of the code under test a second time, to make sure that the instructions are in the cache on the second pass, the one for which results are displayed. Also note that the code is less than 8K in size, so that it can all fit in the 486s 8K internal cache. If I double the REPT value in Listing 12.2 to 2,000, making the test code larger than 8K, the execution time more than doubles to 224 µs, or 3.7 cycles per repetition; the extra seven-tenths of a cycle comes from fetching non-cached instruction bytes.
|Whenever you see non-integral timing results of this sort, its a good bet that the test code or data isnt cached.|
Theres certainly plenty more 486 lore to explore, including the 486s unique prefetch queue, more optimization rules, branching optimizations, performance implications of the cache, the cost of cache misses for reads, and the implications of cache write-through for writes. Nonetheless, weve covered quite a bit of ground in this chapter, and I trust youve gotten a feel for the considerable extent to which 486 optimization differs from what youre used to. Odd as 486 optimization is, though, its well worth mastering, for the 486 is, at its best, so staggeringly fast that carefully crafted 486 code can do more than twice as much per cycle as the best 386 codewhich makes it perhaps 50 times as fast as optimized code for the original PC.
Sometimes it is hard to believe were still in Kansas!
|Previous||Table of Contents||Next|