Previous Table of Contents Next

Faster Addressing and More

I’ll spend the rest of this chapter covering a variety of Pentium optimization tips. For starters, effective address calculations (that is, the addition and scaling required to calculate a memory operand’s address, as for example in MOV EAX,[EBX+ECX*2+4]) never take any extra cycles on the Pentium (other than possibly an AGI cycle), even for the use of base+index addressing (as in MOV [ESI+EDI],EAX) or scaling (*2, *4, or *8, as in INC ARRAY[ESI*4]). On the 486, both of the latter cases cause a 1-cycle penalty. The faster effective address calculations have the side effect of making LEA very attractive as an arithmetic instruction. LEA can add any two registers, one of which can be multiplied by one, two, four, or eight, plus a constant value, and can store the result in any register—all in one cycle, apart from AGIs. Not only that, but as we’ll see in the next chapter, LEA can go through either pipe, whereas SHL can only go through the U-pipe, so LEA is often a superior choice for multiplication by three, four, five, eight, or nine. (ADD is the best choice for multiplication by two.) If you use LEA for arithmetic, do remember that unlike ADD and SHL, it doesn’t modify any flags.

As on the 486, memory operands should not cross any more alignment boundaries than absolutely necessary. Word operands should be word-aligned, dword operands should be dword-aligned, and qword operands (double-precision variables) should be qword-aligned. Spanning a dword boundary, as in

mov ebx,3
mov eax,[ebx]

costs three cycles. On the other hand, as noted above, branch targets can now span cache lines with impunity, so on the Pentium there’s no good argument for the paragraph (that is, 16-byte) alignment that Intel recommends for 486 jump targets. The 32-byte alignment might make for slightly more efficient Pentium cache usage, but would make code much bigger overall.

In fact, given that most jump targets aren’t in performance-critical code, it’s hard to make a compelling argument for aligning branch targets even on the 486. I’d say that no alignment (except possibly where you know a branch target lies in a key loop), or at most dword alignment (for the 386) is plenty, and can shrink code size considerably.

Instruction prefixes are awfully expensive; avoid them if you can. (These include size and addressing prefixes, segment overrides, LOCK, and the 0FH prefixes that extend the instruction set with instructions such as MOVSX. The exceptions are conditional jumps, a fast special case.) At a minimum, a prefix byte generally takes an extra cycle and shuts down the V-pipe for that cycle, effectively costing as much as two normal instructions (although prefix cycles can overlap with previous multicycle instructions, or AGIs, as on the 486). This means that using 32-bit addressing or 32-bit operands in a 16-bit segment, or vice versa, makes for bigger code that’s significantly slower. So, for example, you should generally avoid 16-bit variables (shorts, in C) in 32-bit code, although if using 32-bit variables where they’re not needed makes your data space get a lot bigger, you may want to stick with shorts, especially since longs use the cache less efficiently than shorts. The trade-off depends on the amount of data and the number of instructions that reference that data. (eight-bit variables, such as chars, have no extra overhead and can be used freely, although they may be less desirable than longs for compilers that tend to promote variables to longs when performing calculations.) Likewise, you should if possible avoid putting data in the code segment and referring to it with a CS: prefix, or otherwise using segment overrides.

LOCK is a particularly costly instruction, especially on multiprocessor machines, because it locks the bus and requires that the hardware be brought into a synchronized state. The cost varies depending on the processor and system, but LOCK can make an INC [mem] instruction (which normally takes 3 cycles) 5, 10, or more cycles slower. Most programmers will never use LOCK on purpose—it’s primarily an operating system instruction—but there’s a hidden gotcha here because the XCHG instruction always locks the bus when used with a memory operand.

XCHG is a tempting instruction that’s often used in assembly language; for example, exchanging with video memory is a popular way to read and write VGA memory in a single instruction—but it’s now a bad idea. As it happens, on the 486 and Pentium, using MOVs to read and write memory is faster, anyway; and even on the 486, my measurements indicate a five-cycle tax for LOCK in general, and a nine-cycle execution time for XCHG with memory. Avoid XCHG with memory if you possibly can.

As with the 486, don’t use ENTER or LEAVE, which are slower than the equivalent discrete instructions. Also, start using TEST reg,reg instead of AND reg,reg or OR reg,reg to test whether a register is zero. The reason, as we’ll see in Chapter 21, is that TEST, unlike AND and OR, never modifies the target register. Although in this particular case AND and OR don’t modify the target register either, the Pentium has no way of knowing that ahead of time, so if AND or OR goes through the U-pipe, the Pentium may have to shut down the V-pipe for a cycle to avoid potential dependencies on the result of the AND or OR. TEST suffers from no such potential dependencies.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash