Previous Table of Contents Next

And indeed it does. When the test program is modified to draw to a local buffer, both the C and assembly language versions get 0.29 seconds faster, that being a measure of the time taken by display memory wait states. With those wait states factored out, the assembly language version of DrawHorizontalLineList becomes almost three times as fast as the C code.

There is a lesson here. An optimization has no fixed payoff; its value fluctuates according to the context in which it is used. There’s relatively little benefit to further optimizing code that already spends half its time waiting for display memory; no matter how good your optimizations, you’ll get only a two-times speedup at best, and generally much less than that. There is, on the other hand, potential for tremendous improvement when drawing to system memory, so if that’s where most of your drawing will occur, optimizations such as Listing 39.3 are well worth the effort.
Know the environments in which your code will run, and know where the cycles go in those environments.

LISTING 39.3 L39-3.ASM

; Draws all pixels in the list of horizontal lines passed in, in
; mode 13h, the VGA’s 320×200 256-color mode. Uses REP STOS to fill
; each line.
; C near-callable as:
;     void DrawHorizontalLineList(struct HLineList * HLineListPtr,
;          int Color);
; All assembly code tested with TASM and MASM

SCREEN_WIDTH        equ   320
SCREEN_SEGMENT      equ   0a000h

XStart        dw   ?          ;X coordinate of leftmost pixel in line
XEnd          dw   ?          ;X coordinate of rightmost pixel in line
HLine         ends

HLineList struc
Lngth         dw   ?          ;# of horizontal lines
YStart        dw   ?          ;Y coordinate of topmost line
HLinePtr      dw   ?          ;pointer to list of horz lines
HLineList     ends

Parms struc
              dw   2 dup(?)   ;return address & pushed BP
HLineListPtr  dw   ?          ;pointer to HLineList structure
Color         dw   ?          ;color with which to fill
Parms         ends

      .model small
      public _DrawHorizontalLineList
      align   2
_DrawHorizontalLineList   proc
   push bp                    ;preserve caller’s stack frame
   mov  bp,sp                 ;point to our stack frame
   push   si                  ;preserve caller’s register variables
   push   di
   cld                        ;make string instructions inc pointers

   mov   ax,SCREEN_SEGMENT
   mov   es,ax                ;point ES to display memory for REP STOS

   mov   si,[bp+HLineListPtr] ;point to the line list
   mov   ax,SCREEN_WIDTH      ;point to the start of the first scan
   mul   [si+YStart]          ; line in which to draw
   mov   dx,ax                ;ES:DX points to first scan line to
                              ; draw
   mov   bx,[si+HLinePtr]     ;point to the XStart/XEnd descriptor
                              ; for the first (top) horizontal line
   mov   si,[si+Lngth]        ;# of scan lines to draw
   and   si,si                ;are there any lines to draw?
   jz    FillDone             ;no, so we’re done
   mov   al,byte ptr [bp+Color];color with which to fill
   mov   ah,al                ;duplicate color for STOSW
   mov   di,[bx+XStart]       ;left edge of fill on this line
   mov   cx,[bx+XEnd]         ;right edge of fill
   sub   cx,di
   js    LineFillDone         ;skip if negative width
   inc   cx                   ;width of fill on this line
   add   di,dx                ;offset of left edge of fill
   test  di,1                 ;does fill start at an odd address?
   jz    MainFill             ;no
   stosb                      ;yes, draw the odd leading byte to
                              ; word-align the rest of the fill
   dec   cx                   ;count off the odd leading byte
   jz    LineFillDone         ;done if that was the only byte
   shr   cx,1                 ;# of words in fill
   rep   stosw                ;fill as many words as possible
   adc   cx,cx                ;1 if there’s an odd trailing byte to
                              ; do, 0 otherwise
   rep   stosb                ;fill any odd trailing byte
   add   bx,size HLine        ;point to the next line descriptor
   add   dx,SCREEN_WIDTH      ;point to the next scan line
   dec   si                   ;count off lines to fill
   jnz   FillLoop
   pop   di                   ;restore caller’s register variables
   pop   si
   pop   bp                   ;restore caller’s stack frame
_DrawHorizontalLineList   endp

Maximizing REP STOS

Listing 39.3 doesn’t take the easy way out and use REP STOSB to fill each scan line; instead, it uses REP STOSW to fill as many pixel pairs as possible via word-sized accesses, using STOSB only to do odd bytes. Word accesses to odd addresses are always split by the processor into 2-byte accesses. Such word accesses take twice as long as word accesses to even addresses, so Listing 39.3 makes sure that all word accesses occur at even addresses, by performing a leading STOSB first if necessary.

Listing 39.3 is another case in which it’s worth knowing the environment in which your code will run. Extra code is required to perform aligned word-at-a-time filling, resulting in extra overhead. For very small or narrow polygons, that overhead might overwhelm the advantage of drawing a word at a time, making plain old REP STOSB faster.

Faster Edge Tracing

Finally, Listing 39.4 is an assembly language version of ScanEdge. Listing 39.4 is a relatively straightforward translation from C to assembly, but is nonetheless about twice as fast as Listing 39.2.

The version of ScanEdge in Listing 39.4 could certainly be sped up still further by unrolling the loops. FillConvexPolygon, the overall coordination routine, hasn’t even been converted to assembly language, so that could be sped up as well. I haven’t bothered with these optimizations because all code other than DrawHorizontalLineList takes only 14 percent of the overall polygon filling time when drawing to display memory; the potential return on optimizing nondrawing code simply isn’t great enough to justify the effort. Part of the value of a profiler is being able to tell when to stop optimizing; with Listings 39.3 and 39.4 in use, more than two-thirds of the time taken to draw polygons is spent waiting for display memory, so optimization is pretty much maxed out. However, further optimization might be worthwhile when drawing to system memory, where wait states are out of the picture and the nondrawing code takes a significant portion (46 percent) of the overall time.

Again, know where the cycles go .

By the way, note that all the versions of ScanEdge and FillConvexPolygon that we’ve looked at are adapter-independent, and that the C code is also machine-independent; all adapter-specific code is isolated in DrawHorizontalLineList. This makes it easy to add support for other graphics systems, such as the 8514/A, the XGA, or, for that matter, a completely non-PC system.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash