Previous | Table of Contents | Next |

**Listing 63.2 1 L63-2.ASM**

; unoptimized dot product; 17 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [vec1+0] ;starts on cycle 1 fld [vec0+4] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 fld [vec0+8] ;starts & ends on cycle 4 fmul [vec1+8] ;starts on cycle 5 ;stalls for cycles 6-7 faddp st(1),st(0) ;starts on cycle 8 ;stalls for cycles 9-10 faddp st(1),st(0) ;starts on cycle 11 ;stalls for cycles 12-14 fstp [dot] ;starts on cycle 15, ; ends on cycle 16

**Listing 63.3 L63-3.ASM**

;optimized dot product; 15 cycles fld [vec0+0] ;starts & ends on cycle 0 fmul [vec1+0] ;starts on cycle 1 fld [vec0+4] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 fld [vec0+8] ;starts & ends on cycle 4 fmul [vec1+8] ;starts on cycle 5 fxch st(1) ;no cost faddp st(2),st(0) ;starts on cycle 6 ;stalls for cycles 7-8 faddp st(1),st(0) ;starts on cycle 9 ;stalls for cycles 10-12 fstp [dot] ;starts on cycle 13, ; ends on cycle 14

When last we looked at the cross product, we found that it’s handy for generating a vector that’s normal to two other vectors. The cross product is calculated as [u_{2}v_{3}-u_{3}v_{2} u_{3}v_{1}-u_{1}v_{3} u_{1}v_{2}-u_{2}v_{1}]. The theoretical minimum cycle count for the cross product is 21 cycles. Listing 63.4 shows a straightforward implementation that calculates each component of the result separately, losing 15 cycles to stalls.

**Listing 63.4 L63-4.ASM**

;unoptimized cross product; 36 cycles fld [vec0+4] ;starts & ends on cycle 0 fmul [vec1+8] ;starts on cycle 1 fld [vec0+8] ;starts & ends on cycle 2 fmul [vec1+4] ;starts on cycle 3 ;stalls for cycles 4-5 fsubrp st(1),st(0) ;starts on cycle 6 ;stalls for cycles 7-9 fstp [vec2+0] ;starts on cycle 10, ; ends on cycle 11 fld [vec0+8] ;starts & ends on cycle 12 fmul [vec1+0] ;starts on cycle 13 fld [vec0+0] ;starts & ends on cycle 14 fmul [vec1+8] ;starts on cycle 15 ;stalls for cycles 16-17 fsubrp st(1),st(0) ;starts on cycle 18 ;stalls for cycles 19-21 fstp [vec2+4] ;starts on cycle 22, ; ends on cycle 23 fld [vec0+0] ;starts & ends on cycle 24 fmul [vec1+4] ;starts on cycle 25 fld [vec0+4] ;starts & ends on cycle 26 fmul [vec1+0] ;starts on cycle 27 ;stalls for cycles 28-29 fsubrp st(1),st(0) ;starts on cycle 30 ;stalls for cycles 31-33 fstp [vec2+8] ;starts on cycle 34, ; ends on cycle 35

We couldn’t get rid of many of the stalls in the dot product code because with six inputs and one output, it was impossible to interleave all the operations. However, the cross product, with three outputs, is much more amenable to optimization. In fact, three is the magic number; because we have three calculation streams and the latency of FADD, FSUB, and FMUL is 3 cycles, we can eliminate almost every single stall in the cross-product calculation, as shown in Listing 63.5. Listing 63.5 loses only one cycle to a stall, the cycle before the first FST; the relevant FSUB has just finished on the preceding cycle, so we run into the extra cycle of latency associated with FST. Listing 63.5 is more than 60 percent faster than Listing 63.4, a striking illustration of the power of properly managing the Pentium’s FP pipeline.

**Listing 63.5 L63-5.ASM**

;optimized cross product; 22 cycles fld [vec0+4] ;starts & ends on cycle 0 fmul [vec1+8] ;starts on cycle 1 fld [vec0+8] ;starts & ends on cycle 2 fmul [vec1+0] ;starts on cycle 3 fld [vec0+0] ;starts & ends on cycle 4 fmul [vec1+4] ;starts on cycle 5 fld [vec0+8] ;starts & ends on cycle 6 fmul [vec1+4] ;starts on cycle 7 fld [vec0+0] ;starts & ends on cycle 8 fmul [vec1+8] ;starts on cycle 9 fld [vec0+4] ;starts & ends on cycle 10 fmul [vec1+0] ;starts on cycle 11 fxch st(2) ;no cost fsubrp st(5),st(0) ;starts on cycle 12 fsubrp st(3),st(0) ;starts on cycle 13 fsubrp st(1),st(0) ;starts on cycle 14 fxch st(2) ;no cost ;stalls for cycle 15 fstp [vec2+0] ;starts on cycle 16, ; ends on cycle 17 fstp [vec2+4] ;starts on cycle 18, ; ends on cycle 19 fstp [vec2+8] ;starts on cycle 20, ; ends on cycle 21

Transforming a point, for example from worldspace to viewspace, is one of the most heavily used FP operations in realtime 3-D. Conceptually, transformation is nothing more than three dot products and three additions, as I will discuss in Chapter 61. (Note that I’m talking about a subset of a general 4×4 transformation matrix, where the fourth row is always implicitly [0 0 0 1]. This limited form suffices for common transformations, and does 25 percent less work than a full 4×4 transformation.)

Transformation is calculated as:

- - - - - - v_{1}m_{11}m_{12}m_{13}m_{14}u_{1}v_{2}= m_{21}m_{22}m_{23}m_{24}u_{2}v_{3}m_{31}m_{32}m_{33}m_{34}u_{3}1 0 0 0 1 1 - - - - - -

or

v_{1}= m_{11}u_{1}+ m_{12}u_{2}+ m_{13}u_{3}+ m_{14}v_{2}= m_{21}u_{1}+ m_{22}u_{2}+ m_{23}u_{3}+ m_{24}v_{3}= m_{31}u_{1}+ m_{32}u_{2}+ m_{33}u_{3}+ m_{34.}

Previous | Table of Contents | Next |

Graphics Programming Black Book © 2001 Michael Abrash