Previous Table of Contents Next

Finally, it’s possible in real mode to use the 386’s new addressing modes, in which any 32-bit general-purpose register or pair of registers can be used to address memory. What’s more, multiplication of memory-addressing registers by 2, 4, or 8 for look-ups in word, doubleword, or quadword tables can be built right into the memory addressing mode. (The 32-bit addressing modes are discussed further in later chapters.) In protected mode, these new addressing modes allow you to address a full 4 gigabytes per segment, but in real mode you’re still limited to 64K, even with 32-bit registers and the new addressing modes, unless you play some unorthodox tricks with the segment registers.

Note well: Those tricks don’t necessarily work with system software such as Windows, so I’d recommend against using them. If you want 4-gigabyte segments, use a 32-bit environment such as Win32.

Optimization Rules: The More Things Change...

Let’s see what we’ve learned about 286/386 optimization. Mostly what we’ve learned is that our familiar PC cycle-eaters still apply, although in somewhat different forms, and that the major optimization rules for the PC hold true on ATs and 386-based computers. You won’t go wrong on any of these computers if you keep your instructions short, use the registers heavily and avoid memory, don’t branch, and avoid accessing display memory like the plague.

Although we haven’t touched on them, repeated string instructions are still desirable on the 286 and 386 since they provide a great deal of functionality per instruction byte and eliminate both the prefetch queue cycle-eater and branching. However, string instructions are not quite so spectacularly superior on the 286 and 386 as they are on the 8088 since non-string memory-accessing instructions have been speeded up considerably on the newer processors.

There’s one cycle-eater with new implications on the 286 and 386, and that’s the data alignment cycle-eater. From the data alignment cycle-eater we get a new rule: Word-align your word-sized variables, and start your subroutines at even addresses.

Detailed Optimization

While the major 8088 optimization rules hold true on computers built around the 286 and 386, many of the instruction-specific optimizations no longer hold, for the execution times of most instructions are quite different on the 286 and 386 than on the 8088. We have already seen one such example of the sometimes vast difference between 8088 and 286/386 instruction execution times: MOV [WordVar],0, which has an Execution Unit execution time of 20 cycles on the 8088, has an EU execution time of just 3 cycles on the 286 and 2 cycles on the 386.

In fact, the performance of virtually all memory-accessing instructions has been improved enormously on the 286 and 386. The key to this improvement is the near elimination of effective address (EA) calculation time. Where an 8088 takes from 5 to 12 cycles to calculate an EA, a 286 or 386 usually takes no time whatsoever to perform the calculation. If a base+index+displacement addressing mode, such as MOV AX,[WordArray+bx+si], is used on a 286 or 386, 1 cycle is taken to perform the EA calculation, but that’s both the worst case and the only case in which there’s any EA overhead at all.

The elimination of EA calculation time means that the EU execution time of memory-addressing instructions is much closer to the EU execution time of register-only instructions. For instance, on the 8088 ADD [WordVar],100H is a 31-cycle instruction, while ADD DX,100H is a 4-cycle instruction—a ratio of nearly 8 to 1. By contrast, on the 286 ADD [WordVar],100H is a 7-cycle instruction, while ADD DX,100H is a 3-cycle instruction—a ratio of just 2.3 to 1.

It would seem, then, that it’s less necessary to use the registers on the 286 than it was on the 8088, but that’s simply not the case, for reasons we’ve already seen. The key is this: The 286 can execute memory-addressing instructions so fast that there’s no spare instruction prefetching time during those instructions, so the prefetch queue runs dry, especially on the AT, with its one-wait-state memory. On the AT, the 6-byte instruction ADD [WordVar],100H is effectively at least a 15-cycle instruction, because 3 cycles are needed to fetch each of the three instruction words and 6 more cycles are needed to read WordVar and write the result back to memory.

Granted, the register-only instruction ADD DX,100H also slows down—to 6 cycles—because of instruction prefetching, leaving a ratio of 2.5 to 1. Now, however, let’s look at the performance of the same code on an 8088. The register-only code would run in 16 cycles (4 instruction bytes at 4 cycles per byte), while the memory-accessing code would run in 40 cycles (6 instruction bytes at 4 cycles per byte, plus 2 word-sized memory accesses at 8 cycles per word). That’s a ratio of 2.5 to 1, exactly the same as on the 286.

This is all theoretical. We put our trust not in theory but in actual performance, so let’s run this code through the Zen timer. On a PC, Listing 11.4, which performs register-only addition, runs in 3.62 ms, while Listing 11.5, which performs addition to a memory variable, runs in 10.05 ms. On a 10 MHz AT clone, Listing 11.4 runs in 0.64 ms, while Listing 11.5 runs in 1.80 ms. Obviously, the AT is much faster...but the ratio of Listing 11.5 to Listing 11.4 is virtually identical on both computers, at 2.78 for the PC and 2.81 for the AT. If anything, the register-only form of ADD has a slightly larger advantage on the AT than it does on the PC in this case.

Theory confirmed.

LISTING 11.4 L11-4.ASM

; *** Listing 11.4 ***
; Measures the performance of adding an immediate value
; to a register, for comparison with Listing 11.5, which
; adds an immediate value to a memory variable.
        call    ZTimerOn
        rept    1000
        add     dx,100h
        call    ZTimerOff

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash