Previous Table of Contents Next

FDIV is a painfully slow instruction, taking 39 cycles at full precision and 33 cycles at double precision, which is the default precision for Visual C++ 2.0. While FDIV executes, the FPU is occupied, and can’t process subsequent FP instructions until FDIV finishes. However, during the cycles while FDIV is executing (with the exception of the one cycle during which FDIV starts), the integer unit can simultaneously execute instructions other than IMUL. (IMUL uses the FPU, and can only overlap with FDIV for a few cycles.) Since the integer unit can execute two instructions per cycle, this means it’s possible to have three instructions, an FDIV and two integer instructions, executing at the same time. That’s exactly what happens, for example, during the second cycle of this code:

FDIV ST(0),ST(1)

There’s an important limitation, though; if the instruction stream following the FDIV reaches a FP instruction (or an IMUL), then that instruction and all subsequent instructions, both integer and FP, must wait to execute until FDIV has finished.

When a FADD, FSUB, or FMUL instruction is executed, it is 3 cycles before the result can be used by another instruction. (There’s an exception: If the instruction that attempts to use the result is an FST to memory, there’s an extra cycle lost, so it’s 4 cycles from the start of an arithmetic instruction until an FST of that value can begin, so

FMUL ST(0),ST(1)
FST  [temp]

takes 6 cycles in all.) Again, it’s possible to execute integer-unit instructions during the 2 (or 3, for FST) cycles after one of these FP instructions starts. There’s a more exciting possibility here, though: Given properly structured code, the FPU is capable of averaging 1 cycle per FADD, FSUB, or FMUL. The secret is pipelining.

Pipelining, Latency, and Throughput

The Pentium’s FPU is the first pipelined x86 FPU. Pipelining means that the FPU is capable of starting an instruction every cycle, and can simultaneously handle several instructions in various stages of completion. Only certain x86 FP instructions allow another instruction to start on the next cycle, though: FADD, FSUB, and FMUL are pipelined, but FST and FDIV are not. (FLD executes in a single cycle, so pipelining is not an issue.) Thus, in the code sequence


FADD1 can start on cycle N, FSUB can start on cycle N+1, FADD2 can start on cycle N+2, and FMUL can start on cycle N+3. At the start of cycle N+3, the result of FADD1 is available in the destination operand, because it’s been 3 cycles since the instruction started; FSUB is starting the final cycle of calculation; FADD2 is starting its second cycle, with one cycle yet to go after this; and FMUL is about to be issued. Each of the instructions takes 3 cycles to produce a result from the time it starts, but because they’re simultaneously processed at different pipeline stages, one instruction is issued and one instruction completes every cycle. Thus, the latency of these instructions—that is, the time until the result is available—is 3 cycles, but the throughput—the rate at which the FPU can start new instructions—is 1 cycle. An exception is that the FPU is capable of starting an FMUL only every 2 cycles, so between these two instructions

FMUL ST(1),ST(0)
FMUL ST(2),ST(0)

there’s a 1-cycle stall, and the following three instructions execute just as fast as the above pair:

FMUL ST(1),ST(0)
FLD  ST(4)
FMUL ST(0),ST(1)

There’s a caveat here, though: A FP instruction can’t be issued until its operands are available. The FPU can reach a throughput of 1 cycle per instruction on this code

FADD ST(1),ST(0)
FLD  [temp]
FSUB ST(1),ST(0)

because neither the FLD nor the FSUB needs the result from the FADD. Consider, however

FADD ST(0),ST(2)
FSUB ST(0),ST(1)

where the ST(0) operand to FSUB is calculated by FADD. Here, FSUB can’t start until FADD has completed, so there are 2 stall cycles between the two instructions. When dependencies like this occur, the FPU runs at latency rather than throughput speeds, and performance can drop by as much as two-thirds.


One piece of the puzzle is still missing. Clearly, to get maximum throughput, we need to interleave FP instructions, such that at any one time ideally three instructions are in the pipeline at once. Further, these instructions must not depend on one another for operands. But ST(0) must always be one of the operands; worse, FLD can only push into ST(0), and FST can only store from ST(0). How, then, can we keep three independent instructions going?

The easy answer would be for Intel to change the FP registers from a stack to a set of independent registers. Since they couldn’t do that, thanks to compatibility issues, they did the next best thing: They made the FXCH instruction, which swaps ST(0) and any other FP register, virtually free. In general, if FXCH is both preceded and followed by FP instructions, then it takes no cycles to execute. (Application Note 500, “Optimizations for Intel’s 32-bit Processors,” February 1994, available from, describes all .the conditions under which FXCH is free.) This allows you to move the target of a pending operation from ST(0) to another register, at the same time bringing another register into ST(0) where it can be used, all at no cost. So, for example, we can start three multiplications, then use FXCH to swap back to start adding the results of the first two multiplications, without incurring any stalls, as shown in Listing 63.1.

Listing 63.1 L63-1.ASM

; use of fxch to allow addition of first two; products to start while third : multiplication finishes
        fld     [vec0+0]        ;starts & ends on cycle 0
        fmul    [vec1+0]        ;starts on cycle 1
        fld     [vec0+4]        ;starts & ends on cycle 2
        fmul    [vec1+4]        ;starts on cycle 3
        fld     [vec0+8]        ;starts & ends on cycle 4
        fmul    [vec1+8]        ;starts on cycle 5
        fxch    st(1)           ;no cost
        faddp   st(2),st(0)     ;starts on cycle 6

The Dot Product

Now we’re ready to look at fast FP for common 3-D operations; we’ll start by looking at how to speed up the dot product. As discussed in Chapter 30, the dot product is heavily used in 3-D to calculate cosines and to project points along vectors. The dot product is calculated as d = u1v1 + u2v2 + u3v3; with three loads, three multiplies, two adds, and a store, the theoretical minimum time for this calculation is 10 cycles.

Listing 63.2 shows a straightforward dot product implementation. This version loses 7 cycles to stalls. Listing 63.3 cuts the loss to 5 cycles by doing all three FMULs first, then using FXCH to set the third FXCH aside to complete while the results of the first two FMULs, which have completed, are added. Listing 43.3 still loses 50 percent to stalls, but unless some other code is available to be interleaved with the dot product code, that’s all we can do to speed things up. Fortunately, dot products are often used in contexts where there’s plenty of interleaving potential, as we’ll see when we discuss transformation.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash