Michael Abrash's Graphics Programming Black Book Special Edition: Local Optimization

Rotating and Shifting with Tables

As another example of local optimization, consider the matter of rotating or shifting a mask into position. First, let’s look at the simple task of setting bit N of AX to 1.

The obvious way to do this is to place N in CL, rotate the bit into position, and OR it with AX, as follows:

MOV  BX,1
SHL  BX,CL
OR   AX,BX

This solution is obvious because it takes good advantage of the special ability of the x86 family to shift or rotate by the variable number of bits specified by CL. However, it takes an average of about 45 cycles on an 8088. It’s actually far faster to precalculate the results, pass the bit number in BX, and look the shifted bit up, as shown in Listing 7.3.

LISTING 7.3 L7-3.ASM

     SHL  BX,1                ;prepare for word sized look up
     OR   AX,ShiftTable[BX]   ;look up the bit and OR it in
          :
ShiftTable     LABEL     WORD
BIT_PATTERN=0001H
     REPT 16
     DW   BIT_PATTERN
BIT_PATTERN=BIT_PATTERN SHL 1
     ENDM

Even though it accesses memory, this approach takes only 20 cycles—more than twice as fast as the variable shift. Once again, we were able to improve performance considerably—not by knowing the fastest instructions, but by selecting the fastest sequence of instructions.

In the particular example above, we once again run into the difficulty of optimizing across the x86 family. The table lookup is faster on the 8088 and 286, but it’s slightly slower on the 386 and no faster on the 486. However, 386/486-specific code could use enhanced addressing to accomplish the whole job in just one instruction, along the lines of the code snippet in Listing 7.4.

LISTING 7.4 L7-4.ASM

     OR   EAX,ShiftTable[EBX*4]    ;look up the bit and OR it in
          :
ShiftTable     LABEL     DWORD
BIT_PATTERN=0001H
     REPT 32
     DD   BIT_PATTERN
BIT_PATTERN=BIT_PATTERN SHL 1
     ENDM

Besides illustrating the advantages of local optimization, this example also shows that it generally pays to precalculate results; this is often done at or before assembly time, but precalculated tables can also be built at run time. This is merely one aspect of a fundamental optimization rule: Move as much work as possible out of your critical code by whatever means necessary.

NOT Flips Bits—Not Flags

The NOT instruction flips all the bits in the operand, from 0 to 1 or from 1 to 0. That’s as simple as could be, but NOT nonetheless has a minor but interesting talent: It doesn’t affect the flags. That can be irritating; I once spent a good hour tracking down a bug caused by my unconscious assumption that NOT does set the flags. After all, every other arithmetic and logical instruction sets the flags; why not NOT? Probably because NOT isn’t considered to be an arithmetic or logical instruction at all; rather, it’s a data manipulation instruction, like MOV and the various rotates. (These are RCR, RCL, ROR, and ROL, which affect only the Carry and Overflow flags.) NOT is often used for tasks, such as flipping masks, where there’s no reason to test the state of the result, and in that context it can be handy to keep the flags unmodified for later testing.

Besides, if you want to NOT an operand and set the flags in the process, you can just XOR it with -1. Put another way, the only functional difference between NOT AX and XOR AX,0FFFFH is that XOR modifies the flags and NOT doesn’t.

The x86 instruction set offers many ways to accomplish almost any task. Understanding the subtle distinctions between the instructions—whether and which flags are set, for example—can be critical when you’re trying to optimize a code sequence and you’re running out of registers, or when you’re trying to minimize branching.

Incrementing with and without Carry

Another case in which there are two slightly different ways to perform a task involves adding 1 to an operand. You can do this with INC, as in INC AX, or you can do it with ADD, as in ADD AX,1. What’s the difference? The obvious difference is that INC is usually a byte or two shorter (the exception being ADD AL,1, which at two bytes is the same length as INC AL), and is faster on some processors. Less obvious, but no less important, is that ADD sets the Carry flag while INC leaves the Carry flag untouched.

Why is that important? Because it allows INC to function as a data pointer manipulation instruction for multi-word arithmetic. You can use INC to advance the pointers in code like that shown in Listing 7.5 without having to do any work to preserve the Carry status from one addition to the next.

LISTING 7.5 L7-5.ASM

        CLC                  ;clear the Carry for the initial addition
LOOP_TOP:
        MOV    AX,[SI];get next source operand word
        ADC    [DI],AX;add with Carry to dest operand word
        INC    SI            ;point to next source operand word
        INC    SI
        INC    DI            ;point to next dest operand word
        INC    DI
        LOOP   LOOP_TOP

If ADD were used, the Carry flag would have to be saved between additions, with code along the lines shown in Listing 7.6.

LISTING 7.6 L7-6.ASM

     CLC            ;clear the carry for the initial addition
LOOP_TOP:
     MOV  AX,[SI]   ;get next source operand word
     ADC  [DI],AX   ;add with carry to dest operand word
     LAHF           ;set aside the carry flag
     ADD  SI,2      ;point to next source operand word
     ADD  DI,2      ;point to next dest operand word
     SAHF           ;restore the carry flag
     LOOP LOOP_TOP

It’s not that the Listing 7.6 approach is necessarily better or worse; that depends on the processor and the situation. The Listing 7.6 approach is different, and if you understand the differences, you’ll be able to choose the best approach for whatever code you happen to write. (DEC has the same property of preserving the Carry flag, by the way.)

There are a couple of interesting aspects to the last example. First, note that LOOP doesn’t affect any flags at all; this allows the Carry flag to remain unchanged from one addition to the next. Not altering the arithmetic flags is a common characteristic of program control instructions (as opposed to arithmetic and logical instructions like SUB and AND, which do alter the flags).

The rule is not that the arithmetic flags change whenever the CPU performs a calculation; rather, the flags change whenever you execute an arithmetic, logical, or flag control (such as CLC to clear the Carry flag) instruction.

Not only do LOOP and JCXZ not alter the flags, but REP MOVS, which counts down CX to 0, doesn’t affect the flags either.

The other interesting point about the last example is the use of LAHF and SAHF, which transfer the low byte of the FLAGS register to and from AH, respectively. These instructions were created to help provide compatibility with the 8080’s (that’s 8080, not 8088) PUSH PSW and POP PSW instructions, but turn out to be compact (one byte) instructions for saving and restoring the arithmetic flags. A word of caution, however: SAHF restores the Carry, Zero, Sign, Auxiliary Carry, and Parity flags—but not the Overflow flag, which resides in the high byte of the FLAGS register. Also, be aware that LAHF and SAHF provide a fast way to preserve the flags on an 8088 but are relatively slow instructions on the 486 and Pentium.

There are times when it’s a clear liability that INC doesn’t set the Carry flag. For instance

INC   AX
ADC   DX,0

does not increment the 32-bit value in DX:AX. To do that, you’d need the following:

ADD   AX,1
ADC   DX,0

As always, pay attention!

Table of Contents