|Previous||Table of Contents||Next|
On the 386, ROR was the only way to split a 32-bit register into two 16-bit registers. On the 486, however, BSWAP can not only do the job, but can do it better, because BSWAP executes in just one cycle. BSWAP has the added benefit of not affecting any flags, unlike ROR. With BSWAP-based code like that in Listing 13.6, the upper 16 bits of a register can be accessed with only 2 cycles of overhead and without altering any flags, making the technique of packing two 16-bit registers into one 32-bit register much more useful.
LISTING 13.6 L13-6.ASM
mov cx,[initialskip] bswap ecx ;put skip value in upper half of ECX mov cx,100 ;put loop count in CX looptop: : bswap ecx ;make skip value word accessible in CX add bx,cx ;skip BX ahead inc cx ;set next skip value bswap ecx ;put loop count in CX dec cx ;count down loop jnz looptop
Pushing or popping a memory location, as in PUSH WORD PTR [BX] or POP [MemVar], is a compact, easy way to get a value onto or off of the stack, especially when pushing parameters for calling a C-compatible function. However, on a 486, these are unattractive instructions from a performance perspective. Pushing a memory location takes four cycles; by contrast, loading a memory location into a register takes only one cycle, and pushing a register takes just 1 more cycle, for a total of two cycles. Therefore,
mov ax,[bx] push ax
is twice as fast as
push word ptr [bx]
and the only cost is that the previous contents of AX are destroyed.
Likewise, popping a memory location takes six cycles, but popping a register and writing it to memory takes only two cycles combined. The i486 Microprocessor Programmers Reference Manual lists a 4-cycle execution time for popping a register, but pay that no mind; popping a register takes only 1 cycle.
Why is it that such a convenient operation as pushing or popping memory is so slow? The rule on the 486 is that simple operations, which can be executed in a single cycle by the 486s RISC core, are fast; whereas complex operations, which must be carried out in microcode just as they were on the 386, are almost all relatively slow. Slow, complex operations include all the string instructions except REP MOVS, as well as XLAT, LOOP, and, of course, PUSH mem and POP mem.
|Whenever possible, try to use the 486s 1-cycle instructions, including MOV, ADD, SUB, CMP, ADC, SBB, XOR, AND, OR, TEST, LEA, and PUSH reg and POP reg. These instructions have an added benefit in that its often possible to rearrange them for maximum pipeline efficiency, as is the case with Terjes optimization described earlier in this chapter.|
On a 486, the n-bit forms of the shift and rotate instructionsas in ROR AX,2 and SHL BX,9are 2-cycle instructions, but the 1-bit formsas in ROR AX,1 and SHL BX,1are 3-cycle instructions. Go figure.
Assemblers default to the 1-bit instruction for 1-bit shifts and rotates. Thats not unreasonable since the 1-bit form is a byte shorter and is just as fast as the n-bit forms on a 386 and faster on a 286, and the n-bit form doesnt even exist on an 8088. In a really critical loop, however, it might be worth hand-assembling the n-bit form of a single-bit shift or rotate in order to save that cycle. The easiest way to do this is to assemble a 2-bit form of the desired instruction, as in SHL AX,2, then look at the hex codes that the assembler generates and use DB to insert them in your program code, with the value two replaced with the value one. For example, you could determine that SHL AX,2 assembles to the bytes 0C1H 0E0H 002H, either by looking at the disassembly in a debugger or by having the assembler generate a listing file. You could then insert the n-bit version of SHL AX,1 in your code as follows:
mov ax,1 db 0c1h, 0e0h, 001h mov dx,ax
At the end of this sequence, DX will contain 2, and the fast n-bit version of SHL AX,1 will have executed. If you use this approach, Id recommend using a macro, rather than sticking DBs in the middle of your code.
Again, this technique is advantageous only on a 486. It also doesnt apply to RCL and RCR, where you definitely want to use the 1-bit versions whenever you can, because the n-bit versions are horrendously slow. But if youre optimizing for the 486, these tidbits can save a few critical cyclesand Lord knows that if youre optimizing for the 486that is, if you need even more performance than you get from unoptimized code on a 486you almost certainly need all the speed you can get.
|Previous||Table of Contents||Next|