|Previous||Table of Contents||Next|
There is, of course, no guarantee that Im entirely correct about the optimizations discussed in this chapter. Without knowing the internals of the 486, all I can do is time code and make inferences from the results; I invite you to deduce your own rules and cross-check them against mine. Also, most likely there are other optimizations that Im unaware of. If you have further information on these or any other undocumented optimizations, please write and let me know. And, of course, if anyone from Intel is reading this and wants to give us the gospel truth, please do!
Rule #2A: Rule #2 sometimes, but not always, applies to the stack pointer when it is implicitly used to point to memory.
Intel states that the stack pointer is an implied destination register for CALL, ENTER, LEAVE, RET, PUSH, and POP (which alter (E)SP), and that it is the implied base addressing register for PUSH, POP, and RET (which use (E)SP to address memory). Intel then implies that the aforementioned addressing pipeline penalty is incurred whenever the stack pointer is used as a destination by one of the first set of instructions and is then immediately used to address memory by one of the second set. This raises the specter of unpleasant programming contortions such as intermixing PUSHes and POPs with other instructions to avoid interrupting the addressing pipeline. Fortunately, matters are actually not so grim as Intels documentation would indicate; my tests indicate that the addressing pipeline penalty pops up only spottily when the stack pointer is involved.
For example, youd certainly expect a sequence such as
: pop ax ret pop ax et :
to exhibit the addressing pipeline interruption phenomenon (SP is both destination and addressing register for both instructions, according to Intel), but this code runs in six cycles per POP/RET pair, matching the official execution times exactly. Likewise, a sequence like
pop dx pop cx pop bx pop ax
runs in one cycle per instruction, just as it should.
On the other hand, performing arithmetic directly on SP as an explicit destinationfor example, to deallocate local variablesand then using PUSH, POP, or RET, definitely can interrupt the addressing pipeline. For example
add sp,10h ret
loses two cycles because SP is the explicit destination of one instruction and then the implied addressing register for the next, and the sequence
add sp,10h pop ax
loses two cycles for the same reason.
I certainly havent tried all possible combinations, but the results so far indicate that the stack pointer incurs the addressing pipeline penalty only if (E)SP is the explicit destination of one instruction and is then used by one of the two following instructions to address memory. So, for instance, SP isnt the explicit operand of POP AXAX isand no cycles are lost if POP AX is followed by POP or RET. Happily, then, we need not worry about the sequence in which we use PUSH and POP. However, adding to, moving to, or subtracting from the stack pointer should ideally be done at least two cycles before PUSH, POP, RET, or any other instruction that uses the stack pointer to address memory.
There are two ways to lose cycles by using byte registers, and neither of them is documented by Intel, so far as I know. Lets start with the lesser and simpler of the two.
Rule #3: Do not load a byte portion of a register during one instruction, then use that register in its entirety as a source register during the next instruction.
So, for example, it would be a bad idea to do this
mov ah,o : mov cx,[MemVar1] mov al,[MemVar2] add cx,ax
because AL is loaded by one instruction, then AX is used as the source register for the next instruction. A cycle can be saved simply by rearranging the instructions so that the byte register load isnt immediately followed by the word register usage, like so:
mov ah,o : mov al,[MemVar2] mov cx,[MemVar1] add cx,ax
Strange as it may seem, this rule is neither arbitrary nor nonsensical. Basically, when a byte destination register is part of a word source register for the next instruction, the 486 is unable to directly use the result from the first instruction as the source for the second instruction, because only part of the register required by the second instruction is contained in the first instructions result. The full, updated register value must be read from the register file, and that value cant be read out until the result from the first instruction has been written into the register file, a process that takes an extra cycle. Im not going to explain this in great detail because its not important that you understand why this rule exists (only that it does in fact exist), but it is an interesting window on the way the 486 works.
In case youre curious, theres no such penalty for the typical XLAT sequence like
mov bx,offset MemTable : mov al,[si] xlat
even though AL must be converted to a word by XLAT before it can be added to BX and used to address memory. In fact, none of the penalties mentioned in this chapter apply to XLAT, apparently because XLAT is so slow4 cyclesthat it gives the 486 time to perform addressing calculations during the course of the instruction.
|While its nice that XLAT doesnt suffer from the various 486 addressing penalties, the reason for that is basically that XLAT is slow, so theres still no compelling reason to use XLAT on the 486.|
In general, penalties for interrupting the 486s pipeline apply primarily to the fast core instructions of the 486, most notably register-only instructions and MOV, although arithmetic and logical operations that access memory are also often affected. I dont know all the performance dependencies, and I dont plan to; figuring all of them out would be a big, boring job of little value. Basically, on the 486 you should concentrate on using those fast core instructions when performance matters, and all the rules Ill discuss do indeed apply to those instructions.
|Previous||Table of Contents||Next|