Previous Table of Contents Next

When it comes to implementation, however, transformation is quite different from three separate dot products and additions, because once again the magic number three is involved. Three separate dot products and additions would take 60 cycles if each were calculated using the unoptimized dot-product code of Listing 63.2, and would take 54 cycles if done one after the other using the faster dot-product code of Listing 63.3, in each case followed by the a final addition per dot product.

When fully interleaved, however, only a single cycle is lost (again to the extra cycle of FST latency), and the cycle count drops to 34, as shown in Listing 63.6. This means that on a 100 MHz Pentium, it’s theoretically possible to do nearly 3,000,000 transforms per second, although that’s a purely hypothetical number, due to cache effects and set-up costs. Still, more than 1,000,000 transforms per second is certainly feasible; at a frame rate of 30 Hz, that’s an impressive 30,000 transforms per frame.

Listing 63.6 L63-6.ASM

;optimized transformation: 34 cycles
      fld      [vec0+0]           ;starts & ends on cycle 0
      fmul     [matrix+0]         ;starts on cycle 1
      fld      [vec0+0]           ;starts & ends on cycle 2
      fmul     [matrix+16]        ;starts on cycle 3
      fld      [vec0+0]           ;starts & ends on cycle 4
      fmul     [matrix+32]        ;starts on cycle 5
      fld      [vec0+4]           ;starts & ends on cycle 6
      fmul     [matrix+4]         ;starts on cycle 7
      fld      [vec0+4]           ;starts & ends on cycle 8
      fmul     [matrix+20]        ;starts on cycle 9
      fld      [vec0+4]           ;starts & ends on cycle 10
      fmul     [matrix+36]        ;starts on cycle 11
      fxch     st(2)              ;no cost
      faddp    st(5),st(0)        ;starts on cycle 12
      faddp    st(3),st(0)        ;starts on cycle 13
      faddp    st(1),st(0)        ;starts on cycle 14
      fld      [vec0+8]           ;starts & ends on cycle 15
      fmul     [matrix+8]         ;starts on cycle 16
      fld      [vec0+8]           ;starts & ends on cycle 17
      fmul     [matrix+24]        ;starts on cycle 18
      fld      [vec0+8]           ;starts & ends on cycle 19
      fmul     [matrix+40]        ;starts on cycle 20
      fxch     st(2)              ;no cost
      faddp    st(5),st(0)        ;starts on cycle 21
      faddp    st(3),st(0)        ;starts on cycle 22
      faddp    st(1),st(0)        ;starts on cycle 23
      fxch     st(2)              ;no cost
      fadd     [matrix+12]        ;starts on cycle 24
      fxch     st(1)              ;starts on cycle 25
      fadd     [matrix+28]        ;starts on cycle 26
      fxch     st(2)              ;no cost
      fadd     [matrix+44]        ;starts on cycle 27
      fxch     st(1)              ;no cost
      fstp     [vec1+0]           ;starts on cycle 28,
                                  ; ends on cycle 29
      fstp     [vec1+8]           ;starts on cycle 30,
                                  ; ends on cycle 31
      fstp     [vec1+4]           ;starts on cycle 32,
                                  ; ends on cycle 33


The final optimization we’ll look at is projection to screenspace. Projection itself is basically nothing more than a divide (to get 1/z), followed by two multiplies (to get x/z and y/z), so there wouldn’t seem to be much in the way of FP optimization possibilities there. However, remember that although FDIV has a latency of up to 39 cycles, it can overlap with integer instructions for all but one of those cycles. That means that if we can find enough independent integer work to do before we need the 1/z result, we can effectively reduce the cost of the FDIV to one cycle. Projection by itself doesn’t offer much with which to overlap, but other work such as clamping, window-relative adjustments, or 2-D clipping could be interleaved with the FDIV for the next point.

Another dramatic speed-up is possible by setting the precision of the FPU down to single precision via FLDCW, thereby cutting the time FDIV takes to a mere 19 cycles. I don’t have the space to discuss reduced precision in detail in this book, but be aware that along with potentially greater performance, it carries certain risks, as well. The reduced precision, which affects FADD, FSUB, FMUL, FDIV, and FSQRT, can cause subtle differences from the results you’d get using compiler defaults. If you use reduced precision, you should be on the alert for precision-related problems, such as clipped values that vary more than you’d expect from the precise clip point, or the need for using larger epsilons in comparisons for point-on-plane tests.

Rounding Control

Another useful area that I can note only in passing here is that of leaving the FPU in a particular rounding mode while performing bulk operations of some sort. For example, conversion to int via the FIST instruction requires that the FPU be in chop mode. Unfortunately, the FLDCW instruction must be used to get the FPU into and out of chop mode, and each FLDCW takes 7 cycles, meaning that compilers often take at least 14 cycles for each float->int conversion. In assembly, you can just set the rounding state (or, likewise, the precision, for faster FDIVs) once at the start of the loop, and save all those FLDCW cycles each time through the loop. This is even more true for ceil(), which many compilers implement as horrendously inefficient subroutines, even though there are rounding modes for both ceil() and floor(). Again, though, be aware that results of FP calculations will be subtly different from compiler default behavior while chop, ceil, or floor mode is in effect.

A final note: There are some speed-ups to be had by manipulating FP variables with integer instructions. Check out Chris Hecker’s column in the February/March 1996 issue of Game Developer for details.

A Farewell to 3-D Fixed-Point

As with most optimizations, there are both benefits and hazards to floating-point acceleration, especially pedal-to-the-metal optimizations such as the last few I’ve mentioned. Nonetheless, I’ve found floating-point to be generally both more robust and easier to use than fixed-point even with those maximum optimizations. Now that floating-point is fast enough for real time, I don’t expect to be doing a whole lot of fixed-point 3-D math from here on out.

And I won’t miss it a bit.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash