|Previous||Table of Contents||Next|
Next, we come to an item that cycle counters will love, especially since it involves apparently incorrect documentation on Intels part. According to Intels documents, all RCR and RCL instructions, which perform rotations through the Carry flag, as shown in Figure 9.4, take 9 cycles on the 386 when working with a register operand. My measurements indicate that the 9-cycle execution time almost holds true for multibit rotate-through-carries, which Ive timed at 8 cycles apiece; for example, RCR AX,CL takes 8 cycles on my 386, as does RCL DX,2. Contrast that with ROR and ROL, which can rotate the contents of a register any number of bits in just 3 cycles.
However, rotating by one bit through the Carry flag does not take 9 cycles, contrary to Intels 80386 Programmers Reference Manual, or even 8 cycles. In fact, RCR reg,1 and RCL reg,1 take 3 cycles, just like ROR, ROL, SHR, and SHL. At least, thats how fast they run on my 386, and I very much doubt that youll find different execution times on other 386s. (Please let me know if you do, though!)
Figure 9.4 Performing rotate instructions using the Carry flag.
Interestingly, according to Intels i486 Microprocessor Programmers Reference Manual, the 486 can RCR or RCL a register by one bit in 3 cycles, but takes between 8 and 30 cycles to perform a multibit register RCR or RCL!
No great lesson here, just a caution to be leery of multibit RCR and RCL when performance mattersand to take cycle-time documentation with a grain of salt.
Did you ever wonder how to code a far jump to an absolute address in assembly language? Probably not, but if you ever do, youre going to be glad for this next item, because the obvious solution doesnt work. You might think all it would take to jump to, say, 1000:5 would be JMP FAR PTR 1000:5, but youd be wrong. That wont even assemble. You might then think to construct in memory a far pointer containing 1000:5, as in the following:
Ptr dd ? : mov word ptr [Ptr],5 mov word ptr [Ptr+2],1000h jmp [Ptr]
That will work, but at a price in performance. On an 8088, JMP DWORD PTR [mem] (an indirect far jump) takes at least 37 cycles; JMP DWORD PTR label (a direct far jump) takes only 15 cycles (plus, almost certainly, some cycles for instruction fetching). On a 386, an indirect far jump is documented to take at least 43 cycles in real mode (31 in protected mode); a direct far jump is documented to take at least 12 cycles, about three times faster. In truth, the difference between those two is nowhere near that big; the fastest Ive measured for a direct far jump is 21 cycles, and Ive measured indirect far jumps as fast as 30 cycles, so direct is still faster, but not by so much. (Oh, those cycle-time documentation blues!) Also, a direct far jump is documented to take at least 27 cycles in protected mode; why the big difference in protected mode, I have no idea.
At any rate, to return to our original problem of jumping to 1000:5: Although an indirect far jump will work, a direct far jump is still preferable.
Listing 9.7 shows a short program that performs a direct far call to 1000:5. (Dont run it, unless you want to crash your system!) It does this by creating a dummy segment at 1000H, so that the label FarLabel can be created with the desired far attribute at the proper location. (Segments created with AT dont cause the generation of any actual bytes or the allocation of any memory; theyre just templates.) Its a little kludgey, but at least it does work. There may be a better solution; if you have one, pass it along.
LISTING 9.7 L9-7.ASM
; Program to perform a direct far jump to address 1000:5. ; *** Do not run this program! Its just an example of how *** ; *** to build a direct far jump to an absolute address *** ; ; Tested with TASM 2 and MASM 5. FarSeg segment at 01000h org 5 FarLabel label far FarSeg ends .model small .code start: jmp FarLabel end start
By the way, if youre wondering how I figured this out, I merely applied my good friend Dan Illowskys long-standing rule for dealing with MASM:
If the obvious doesnt work (and it usually doesnt), just try everything you can think of, no matter how ridiculous, until you find something that doesa rule with plenty of history on its side.
To finish up this chapter, consider these two items. First, in 32-bit protected mode,
sub eax,eax inc eax
takes 4 cycles to execute, but is only 3 bytes long, while
takes only 2 cycles to execute, but is 5 bytes long (because native mode constants are dwords and the MOV instruction doesnt sign-extend). Both code fragments are ways to set EAX to 1 (although the first affects the flags and the second doesnt); this is a classic trade-off of speed for space. Second,
takes 2 cycles to execute and is 3 bytes long, while
takes 2 cycles to execute and is 5 bytes long. Both instructions set EBX to -1; this is a classic trade-off ofgee, its not a trade-off at all, is it? OR is a better way to set a 32-bit register to all 1-bits, just as SUB or XOR is a better way to set a register to all 0-bits. Who woulda thunk it? Just goes to show how the 32-bit displacements and constants of 386 native mode change the familiar landscape of 80×86 optimization.
Be warned, though, that Ive found OR, AND, ADD, and the like to be a cycle slower than MOV when working with immediate operands on the 386 under some circumstances, for reasons that thus far escape me. This just reinforces the first rule of optimization: Measure your code in action, and place not your trust in documented cycle times.
|Previous||Table of Contents||Next|