Previous | Table of Contents | Next |

When it comes to implementation, however, transformation is quite different from three separate dot products and additions, because once again the magic number *three* is involved. Three separate dot products and additions would take 60 cycles if each were calculated using the unoptimized dot-product code of Listing 63.2, and would take 54 cycles if done one after the other using the faster dot-product code of Listing 63.3, in each case followed by the a final addition per dot product.

When fully interleaved, however, only a single cycle is lost (again to the extra cycle of FST latency), and the cycle count drops to 34, as shown in Listing 63.6. This means that on a 100 MHz Pentium, it’s theoretically possible to do nearly 3,000,000 transforms per second, although that’s a purely hypothetical number, due to cache effects and set-up costs. Still, more than 1,000,000 transforms per second is certainly feasible; at a frame rate of 30 Hz, that’s an impressive 30,000 transforms per frame.

**Listing 63.6 L63-6.ASM**

;optimized transformation: 34 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [matrix+0] ;starts on cycle 1 fld [vec0+0] ;starts & ends on cycle 2 fmul [matrix+16] ;starts on cycle 3 fld [vec0+0] ;starts & ends on cycle 4 fmul [matrix+32] ;starts on cycle 5 fld [vec0+4] ;starts & ends on cycle 6 fmul [matrix+4] ;starts on cycle 7 fld [vec0+4] ;starts & ends on cycle 8 fmul [matrix+20] ;starts on cycle 9 fld [vec0+4] ;starts & ends on cycle 10 fmul [matrix+36] ;starts on cycle 11 fxch st(2) ;no cost faddp st(5),st(0) ;starts on cycle 12 faddp st(3),st(0) ;starts on cycle 13 faddp st(1),st(0) ;starts on cycle 14 fld [vec0+8] ;starts & ends on cycle 15 fmul [matrix+8] ;starts on cycle 16 fld [vec0+8] ;starts & ends on cycle 17 fmul [matrix+24] ;starts on cycle 18 fld [vec0+8] ;starts & ends on cycle 19 fmul [matrix+40] ;starts on cycle 20 fxch st(2) ;no cost faddp st(5),st(0) ;starts on cycle 21 faddp st(3),st(0) ;starts on cycle 22 faddp st(1),st(0) ;starts on cycle 23 fxch st(2) ;no cost fadd [matrix+12] ;starts on cycle 24 fxch st(1) ;starts on cycle 25 fadd [matrix+28] ;starts on cycle 26 fxch st(2) ;no cost fadd [matrix+44] ;starts on cycle 27 fxch st(1) ;no cost fstp [vec1+0] ;starts on cycle 28, ; ends on cycle 29 fstp [vec1+8] ;starts on cycle 30, ; ends on cycle 31 fstp [vec1+4] ;starts on cycle 32, ; ends on cycle 33

The final optimization we’ll look at is projection to screenspace. Projection itself is basically nothing more than a divide (to get 1/z), followed by two multiplies (to get x/z and y/z), so there wouldn’t seem to be much in the way of FP optimization possibilities there. However, remember that although FDIV has a latency of up to 39 cycles, it can overlap with integer instructions for all but one of those cycles. That means that if we can find enough independent integer work to do before we need the 1/z result, we can effectively reduce the cost of the FDIV to one cycle. Projection by itself doesn’t offer much with which to overlap, but other work such as clamping, window-relative adjustments, or 2-D clipping could be interleaved with the FDIV for the next point.

Another dramatic speed-up is possible by setting the precision of the FPU down to single precision via FLDCW, thereby cutting the time FDIV takes to a mere 19 cycles. I don’t have the space to discuss reduced precision in detail in this book, but be aware that along with potentially greater performance, it carries certain risks, as well. The reduced precision, which affects FADD, FSUB, FMUL, FDIV, and FSQRT, can cause subtle differences from the results you’d get using compiler defaults. If you use reduced precision, you should be on the alert for precision-related problems, such as clipped values that vary more than you’d expect from the precise clip point, or the need for using larger epsilons in comparisons for point-on-plane tests.

Another useful area that I can note only in passing here is that of leaving the FPU in a particular rounding mode while performing bulk operations of some sort. For example, conversion to int via the FIST instruction requires that the FPU be in chop mode. Unfortunately, the FLDCW instruction must be used to get the FPU into and out of chop mode, and each FLDCW takes 7 cycles, meaning that compilers often take at least 14 cycles for each float->int conversion. In assembly, you can just set the rounding state (or, likewise, the precision, for faster FDIVs) once at the start of the loop, and save all those FLDCW cycles each time through the loop. This is even more true for **ceil()**, which many compilers implement as horrendously inefficient subroutines, even though there are rounding modes for both **ceil()** and **floor()**. Again, though, be aware that results of FP calculations will be subtly different from compiler default behavior while chop, ceil, or floor mode is in effect.

A final note: There are some speed-ups to be had by manipulating FP variables with integer instructions. Check out Chris Hecker’s column in the February/March 1996 issue of *Game Developer* for details.

As with most optimizations, there are both benefits and hazards to floating-point acceleration, especially pedal-to-the-metal optimizations such as the last few I’ve mentioned. Nonetheless, I’ve found floating-point to be generally both more robust and easier to use than fixed-point even with those maximum optimizations. Now that floating-point is fast enough for real time, I don’t expect to be doing a whole lot of fixed-point 3-D math from here on out.

And I won’t miss it a bit.

Previous | Table of Contents | Next |

Graphics Programming Black Book © 2001 Michael Abrash