Previous Table of Contents Next

Listing 16.7 L16-7.ASM

         lodsw           ;get the next 2 bytes (AL = first, AH = 2nd)
         xlat            ;look up first’s char/not status
         xor     dl,al   ;see if there’s a new char/not status
         add     di,dx   ;we add 1 for each char/not transition
         mov     dl,al
         mov     al,ah   ;look at the second byte
         xlat            ;look up its char/not status
         xor     dl,al   ;see if there’s a new char/not status
         add     di,dx   ;we add 1 for each char/not transition
         mov     dl,al
         dec     dx
         jnz     ScanLoop

John later divides the transition count by two to get the word count. (Food for thought: It’s also possible to use CMP and ADC to detect words without branching.)

John’s approach makes it clear that word-counting is nothing more than a fairly simple state machine. The interesting part, of course, is building the fastest state machine.

Level 3: Breakthrough

The boundaries between the levels of optimization are not sharply defined. In a sense, level 3 optimization is just like levels 1 and 2, but more so. At level 3, one takes whatever level 2 perspective seems most promising, and implements it as efficiently as possible on the x86. Even more than at level 2, at level 3 this means breaking out of familiar patterns of thinking.

In the case of word counting, level 3 means building a table-driven state machine dedicated to processing a buffer of bytes into a count of words with a minimum of branching. This level of optimization strips away many of the abstractions we usually use in coding, such as loops, tests, and named variables—look back to Listing 16.5, and you’ll see what I mean. Only a few people reached this level, and I don’t think any of them did it without long, hard thinking; David Stafford’s final entry (that is, the one I present as Listing 16.5) was at least the fifth entry he sent me.

The key concept at level 3 is the use of a massive (64K) lookup table that processes byte sequences directly into word-count actions. With such a table, it’s possible to look up the appropriate action for two bytes simultaneously in just a few instructions; next, I’m going to look at the inspired and highly unusual way that David’s code, shown in Listing 16.5, does exactly that. (Before assembling Listing 16.5, you must run the C code in Listing 16.8, to generate an include file defining the 64K lookup table. When you assemble Listing 16.5, TASM will report a “location counter overflow” warning; ignore it.)


 #include <stdio.h>
 #include <ctype.h>
 #define ChType( c )  (((c) & 0x7f) == ‘\’’ || isalnum((c) & 0x7f))
 int NoCarry[ 4 ] = { 0, 0x80, 1, 0x80 };
 int Carry[ 4 ]   = { 1, 0x81, 1, 0x80 };
 void main( void )
   int ahChar, alChar, i;
   FILE *t = fopen( “QSCAN3.INC”, “wt” );
   printf( “Building table.  Please wait...” );
   for( ahChar = 0; ahChar < 128; ahChar++ )
     for( alChar = 0; alChar < 256; alChar++ )
       i = ChType( alChar ) * 2 + ChType( ahChar );
       if( alChar % 8 == 0 )  fprintf( t, “\ndb %02Xh”, NoCarry[ i ] );
       else                   fprintf( t, “,%02Xh”, NoCarry[ i ] );
       fprintf( t, “,%02Xh”, Carry[ i ] );
   fclose( t );

David’s approach is simplicity itself, although his implementation arguably is not. Consider any three sequential bytes in the buffer. Those three bytes define two potential places where a word might be counted, as shown in Figure 16.1. Given the separator/non-separator states of the three bytes, you can instantly determine whether to count a word or not; you count a word if and only if somewhere in the sequence there is a non-separator followed by a separator. Note that a maximum of one word can be counted per three-byte sequence.

The trick, then, is to identify the separator/not statuses of each set of three bytes and turn them into a 1 (count word) or 0 (don’t count word), as quickly as possible. Assuming that the separator/not status for the first byte is in the Carry flag, this is easily accomplished by a lookup in a 64K table, based on the Carry flag and the other two bytes, as shown in Figure 16.2. (Remember that we’re counting 7-bit ASCII here, so the high bit is ignored.) Thus, David is able to add the word/not status for each pair of bytes to the main word count simply by getting the two bytes, working in the carry status from the last byte, and using the resulting value to index into the 64K table, adding in the 1 or 0 value found in that table. A sequence of MOV/ADC/ADD suffices to perform all word-counting tasks for a pair of bytes. Three instructions, no branches—pretty nearly perfect code.

Figure 16.1
  The two potential word count locations.

One detail remains to be attended to: setting the Carry flag for next time if the last byte was a non-separator. David does this in a bizarre and incredibly effective way: He presets the high bit of the count, and sets the high bit in the lookup table for those entries looked up by non-separators. When a non-separator’s lookup entry is added to the count, it will produce a carry, as desired. The high bit of the count is masked off before being added to the total count, so David is essentially using different parts of the count variables for different purposes (counting, and setting the Carry flag).

Figure 16.2
  Looking up a word count status.

There are a number of other interesting details in David’s code, including the unrolling of the loop 64 times, so that 256 bytes in a row are processed without a single branch. Unfortunately, I lack the space to discuss Listing 16.5 any further. Perhaps that’s not so unfortunate, after all; I’d hate to deny you the pleasure of discovering the wonders of this rather remarkable code yourself. I will say one more thing, though. The cycle count for David’s inner loop is 6.5 cycles per byte processed, and the actual measured time for his routine, overhead and all, is 7.9 cycles/byte. The original C code clocked in at around 100 cycles/byte.

Enough said, I trust.

Enough Word Counting Already!

Before I finish up this chapter, I’d like to mention that Terje Mathisen’s WC word-counting program, which I’ve mentioned previously and which is available, with source, on Bix, is in the ballpark with David’s code for performance. What’s more, Terje’s program handles 8-bit ASCII, counts lines as well as words, and supports user-definable separator sets. It’s wonderful code, well worth a look; it also happens to be a great word-counting utility. By the way, Terje builds his 64K table on the fly, at program initialization; this allows for customized tables, shrinks the size of the EXE, and, according to Terje’s calculations, takes less time than loading the table off disk as part of the EXE.

So, has David written the fastest possible word-counting code? Well, maybe—but I have a letter from Terry Holmes, of San Rafael, California, that calculates the theoretical maximum performance of native 386 word-counting code at 5.5 cycles/byte, which would be significantly faster than David’s code. Terry, alas, didn’t bother to implement his design, but maybe I’ll take a shot at it someday. It’d be fun, for sure—but jeez, I’ve got real work to do!

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash