|Previous||Table of Contents||Next|
As another example of local optimization, consider the matter of rotating or shifting a mask into position. First, lets look at the simple task of setting bit N of AX to 1.
The obvious way to do this is to place N in CL, rotate the bit into position, and OR it with AX, as follows:
MOV BX,1 SHL BX,CL OR AX,BX
This solution is obvious because it takes good advantage of the special ability of the x86 family to shift or rotate by the variable number of bits specified by CL. However, it takes an average of about 45 cycles on an 8088. Its actually far faster to precalculate the results, pass the bit number in BX, and look the shifted bit up, as shown in Listing 7.3.
LISTING 7.3 L7-3.ASM
SHL BX,1 ;prepare for word sized look up OR AX,ShiftTable[BX] ;look up the bit and OR it in : ShiftTable LABEL WORD BIT_PATTERN=0001H REPT 16 DW BIT_PATTERN BIT_PATTERN=BIT_PATTERN SHL 1 ENDM
Even though it accesses memory, this approach takes only 20 cyclesmore than twice as fast as the variable shift. Once again, we were able to improve performance considerablynot by knowing the fastest instructions, but by selecting the fastest sequence of instructions.
In the particular example above, we once again run into the difficulty of optimizing across the x86 family. The table lookup is faster on the 8088 and 286, but its slightly slower on the 386 and no faster on the 486. However, 386/486-specific code could use enhanced addressing to accomplish the whole job in just one instruction, along the lines of the code snippet in Listing 7.4.
LISTING 7.4 L7-4.ASM
OR EAX,ShiftTable[EBX*4] ;look up the bit and OR it in : ShiftTable LABEL DWORD BIT_PATTERN=0001H REPT 32 DD BIT_PATTERN BIT_PATTERN=BIT_PATTERN SHL 1 ENDM
|Besides illustrating the advantages of local optimization, this example also shows that it generally pays to precalculate results; this is often done at or before assembly time, but precalculated tables can also be built at run time. This is merely one aspect of a fundamental optimization rule: Move as much work as possible out of your critical code by whatever means necessary.|
The NOT instruction flips all the bits in the operand, from 0 to 1 or from 1 to 0. Thats as simple as could be, but NOT nonetheless has a minor but interesting talent: It doesnt affect the flags. That can be irritating; I once spent a good hour tracking down a bug caused by my unconscious assumption that NOT does set the flags. After all, every other arithmetic and logical instruction sets the flags; why not NOT? Probably because NOT isnt considered to be an arithmetic or logical instruction at all; rather, its a data manipulation instruction, like MOV and the various rotates. (These are RCR, RCL, ROR, and ROL, which affect only the Carry and Overflow flags.) NOT is often used for tasks, such as flipping masks, where theres no reason to test the state of the result, and in that context it can be handy to keep the flags unmodified for later testing.
|Besides, if you want to NOT an operand and set the flags in the process, you can just XOR it with -1. Put another way, the only functional difference between NOT AX and XOR AX,0FFFFH is that XOR modifies the flags and NOT doesnt.|
The x86 instruction set offers many ways to accomplish almost any task. Understanding the subtle distinctions between the instructionswhether and which flags are set, for examplecan be critical when youre trying to optimize a code sequence and youre running out of registers, or when youre trying to minimize branching.
Another case in which there are two slightly different ways to perform a task involves adding 1 to an operand. You can do this with INC, as in INC AX, or you can do it with ADD, as in ADD AX,1. Whats the difference? The obvious difference is that INC is usually a byte or two shorter (the exception being ADD AL,1, which at two bytes is the same length as INC AL), and is faster on some processors. Less obvious, but no less important, is that ADD sets the Carry flag while INC leaves the Carry flag untouched.
Why is that important? Because it allows INC to function as a data pointer manipulation instruction for multi-word arithmetic. You can use INC to advance the pointers in code like that shown in Listing 7.5 without having to do any work to preserve the Carry status from one addition to the next.
LISTING 7.5 L7-5.ASM
CLC ;clear the Carry for the initial addition LOOP_TOP: MOV AX,[SI];get next source operand word ADC [DI],AX;add with Carry to dest operand word INC SI ;point to next source operand word INC SI INC DI ;point to next dest operand word INC DI LOOP LOOP_TOP
If ADD were used, the Carry flag would have to be saved between additions, with code along the lines shown in Listing 7.6.
LISTING 7.6 L7-6.ASM
CLC ;clear the carry for the initial addition LOOP_TOP: MOV AX,[SI] ;get next source operand word ADC [DI],AX ;add with carry to dest operand word LAHF ;set aside the carry flag ADD SI,2 ;point to next source operand word ADD DI,2 ;point to next dest operand word SAHF ;restore the carry flag LOOP LOOP_TOP
Its not that the Listing 7.6 approach is necessarily better or worse; that depends on the processor and the situation. The Listing 7.6 approach is different, and if you understand the differences, youll be able to choose the best approach for whatever code you happen to write. (DEC has the same property of preserving the Carry flag, by the way.)
There are a couple of interesting aspects to the last example. First, note that LOOP doesnt affect any flags at all; this allows the Carry flag to remain unchanged from one addition to the next. Not altering the arithmetic flags is a common characteristic of program control instructions (as opposed to arithmetic and logical instructions like SUB and AND, which do alter the flags).
|The rule is not that the arithmetic flags change whenever the CPU performs a calculation; rather, the flags change whenever you execute an arithmetic, logical, or flag control (such as CLC to clear the Carry flag) instruction.|
Not only do LOOP and JCXZ not alter the flags, but REP MOVS, which counts down CX to 0, doesnt affect the flags either.
The other interesting point about the last example is the use of LAHF and SAHF, which transfer the low byte of the FLAGS register to and from AH, respectively. These instructions were created to help provide compatibility with the 8080s (thats 8080, not 8088) PUSH PSW and POP PSW instructions, but turn out to be compact (one byte) instructions for saving and restoring the arithmetic flags. A word of caution, however: SAHF restores the Carry, Zero, Sign, Auxiliary Carry, and Parity flagsbut not the Overflow flag, which resides in the high byte of the FLAGS register. Also, be aware that LAHF and SAHF provide a fast way to preserve the flags on an 8088 but are relatively slow instructions on the 486 and Pentium.
There are times when its a clear liability that INC doesnt set the Carry flag. For instance
INC AX ADC DX,0
does not increment the 32-bit value in DX:AX. To do that, youd need the following:
ADD AX,1 ADC DX,0
As always, pay attention!
|Previous||Table of Contents||Next|