Michael Abrash's Graphics Programming Black Book Special Edition: Pushing the 486

Table of Contents

There is, of course, no guarantee that I’m entirely correct about the optimizations discussed in this chapter. Without knowing the internals of the 486, all I can do is time code and make inferences from the results; I invite you to deduce your own rules and cross-check them against mine. Also, most likely there are other optimizations that I’m unaware of. If you have further information on these or any other undocumented optimizations, please write and let me know. And, of course, if anyone from Intel is reading this and wants to give us the gospel truth, please do!

Stack Addressing and Address Pipelining

Rule #2A: Rule #2 sometimes, but not always, applies to the stack pointer when it is implicitly used to point to memory.

Intel states that the stack pointer is an implied destination register for CALL, ENTER, LEAVE, RET, PUSH, and POP (which alter (E)SP), and that it is the implied base addressing register for PUSH, POP, and RET (which use (E)SP to address memory). Intel then implies that the aforementioned addressing pipeline penalty is incurred whenever the stack pointer is used as a destination by one of the first set of instructions and is then immediately used to address memory by one of the second set. This raises the specter of unpleasant programming contortions such as intermixing PUSHes and POPs with other instructions to avoid interrupting the addressing pipeline. Fortunately, matters are actually not so grim as Intel’s documentation would indicate; my tests indicate that the addressing pipeline penalty pops up only spottily when the stack pointer is involved.

For example, you’d certainly expect a sequence such as

:
pop    ax
ret
pop    ax
et
:

to exhibit the addressing pipeline interruption phenomenon (SP is both destination and addressing register for both instructions, according to Intel), but this code runs in six cycles per POP/RET pair, matching the official execution times exactly. Likewise, a sequence like

pop    dx
pop    cx
pop    bx
pop    ax

runs in one cycle per instruction, just as it should.

On the other hand, performing arithmetic directly on SP as an explicit destination—for example, to deallocate local variables—and then using PUSH, POP, or RET, definitely can interrupt the addressing pipeline. For example

add    sp,10h
ret

loses two cycles because SP is the explicit destination of one instruction and then the implied addressing register for the next, and the sequence

add    sp,10h
pop    ax

loses two cycles for the same reason.

I certainly haven’t tried all possible combinations, but the results so far indicate that the stack pointer incurs the addressing pipeline penalty only if (E)SP is the explicit destination of one instruction and is then used by one of the two following instructions to address memory. So, for instance, SP isn’t the explicit operand of POP AX—AX is—and no cycles are lost if POP AX is followed by POP or RET. Happily, then, we need not worry about the sequence in which we use PUSH and POP. However, adding to, moving to, or subtracting from the stack pointer should ideally be done at least two cycles before PUSH, POP, RET, or any other instruction that uses the stack pointer to address memory.

Problems with Byte Registers

There are two ways to lose cycles by using byte registers, and neither of them is documented by Intel, so far as I know. Let’s start with the lesser and simpler of the two.

Rule #3: Do not load a byte portion of a register during one instruction, then use that register in its entirety as a source register during the next instruction.

So, for example, it would be a bad idea to do this

mov    ah,o
            :
mov    cx,[MemVar1]
mov    al,[MemVar2]
add    cx,ax

because AL is loaded by one instruction, then AX is used as the source register for the next instruction. A cycle can be saved simply by rearranging the instructions so that the byte register load isn’t immediately followed by the word register usage, like so:

mov    ah,o
            :
mov    al,[MemVar2]
mov    cx,[MemVar1]
add    cx,ax

Strange as it may seem, this rule is neither arbitrary nor nonsensical. Basically, when a byte destination register is part of a word source register for the next instruction, the 486 is unable to directly use the result from the first instruction as the source for the second instruction, because only part of the register required by the second instruction is contained in the first instruction’s result. The full, updated register value must be read from the register file, and that value can’t be read out until the result from the first instruction has been written into the register file, a process that takes an extra cycle. I’m not going to explain this in great detail because it’s not important that you understand why this rule exists (only that it does in fact exist), but it is an interesting window on the way the 486 works.

In case you’re curious, there’s no such penalty for the typical XLAT sequence like

mov    bx,offset MemTable
       :
mov    al,[si]
xlat

even though AL must be converted to a word by XLAT before it can be added to BX and used to address memory. In fact, none of the penalties mentioned in this chapter apply to XLAT, apparently because XLAT is so slow—4 cycles—that it gives the 486 time to perform addressing calculations during the course of the instruction.

While it’s nice that XLAT doesn’t suffer from the various 486 addressing penalties, the reason for that is basically that XLAT is slow, so there’s still no compelling reason to use XLAT on the 486.

In general, penalties for interrupting the 486’s pipeline apply primarily to the fast core instructions of the 486, most notably register-only instructions and MOV, although arithmetic and logical operations that access memory are also often affected. I don’t know all the performance dependencies, and I don’t plan to; figuring all of them out would be a big, boring job of little value. Basically, on the 486 you should concentrate on using those fast core instructions when performance matters, and all the rules I’ll discuss do indeed apply to those instructions.

Table of Contents