|Previous||Table of Contents||Next|
LISTING 16.4 L16-4.ASM
; Assembly subroutine for Listing 16.2. Scans through Buffer, of ; length BufferLength, counting words and updating WordCount as ; appropriate, using a lookup table-based approach. BufferLength ; must be > 0. *CharFlag and *WordCount should equal 0 on the ; first call. Tested with TASM. ; C near-callable as: ; void ScanBuffer(char *Buffer, unsigned int BufferLength, ; char *CharFlag, unsigned long *WordCount); parms struc dw 2 dup(?) ;pushed return address & BP Buffer dw ? ;buffer to scan BufferLength dw ? ;length of buffer to scan CharFlag dw ? ;pointer to flag for state of last ;char processed on entry (0 on ;initial call). Updated on exit WordCount dw ? ;pointer to 32-bit count of words ; found (0 on initial call) parms ends .model small .data ; Table of char/not statuses for byte values 0-255 (128-255 are ; duplicates of 0-127 to effectively mask off bit 7, which some ; word processors set as an internal flag). CharStatusTable label byte REPT 2 db 39 dup(0) db 1 ;apostrophe db 8 dup(0) db 10 dup(1) ;0-9 db 7 dup(0) db 26 dup(1) ;A-Z db 6 dup(0) db 26 dup(1) ;a-z db 5 dup(0) ENDM .code public _ScanBuffer _ScanBuffer proc near push bp ;preserve callers stack frame mov bp,sp ;set up local stack frame push si ;preserve callers register vars push di mov si,[bp+Buffer] ;point to buffer to scan mov bx,[bp+WordCount] mov di,[bx] ;get current 32-bit word count mov dx,[bx+2] mov bx,[bp+CharFlag] mov al,[bx] ;get current CharFlag mov cx,[bp+BufferLength] ;get # of bytes to scan mov bx,offset CharStatusTable ScanLoop: and al,al ;ZF=0 if last byte was a char, ; ZF=1 if not lodsb ;get the next byte ;***doesnt change flags*** xlat ;look up its char/not status ;***doesnt change flags*** jz ScanLoopBottom ;dont count a word if last byte was ; not a character and al,al ;last byte was a character; is the ; current byte a character? jz CountWord ;no, so count a word ScanLoopBottom: dec cx ;count down buffer length jnz ScanLoop Done: mov si,[bp+CharFlag] mov [si],al ;set new CharFlag mov bx,[bp+WordCount] mov [bx],di ;set new word count mov [bx+2],dx pop di ;restore callers register vars pop si pop bp ;restore callers stack frame ret align 2 CountWord: add di,1 ;increment the word count adc dx,0 dec cx ;count down buffer length jnz ScanLoop jmp Done _ScanBuffer endp end
Listing 16.4 features several interesting tricks. First, it uses LODSB and XLAT in succession, a very neat way to get a pointed-to byte, advance the pointer, and look up the value indexed by the byte in a table, all with just two instruction bytes. (Interestingly, Listing 16.4 would probably run quite a bit better still on an 8088, where LODSB and XLAT have a greater advantage over conventional instructions. On the 486 and Pentium, however, LODSB and XLAT lose much of their appeal, and should be replaced with MOV instructions.) Better yet, LODSB and XLAT dont alter the flags, so the Zero flag status set before LODSB is still around to be tested after XLAT .
Finally, if you look closely, you will see that Listing 16.4 jumps out of the loop to increment the word count in the case where a word is actually found, with a duplicate of the loop-bottom code placed after the code that increments the word count, to avoid an extra branch back into the loop; this replaces the more intuitive approach of jumping around the incrementing code to the loop bottom when a word isnt found. Although this incurs a branch every time a word is found, a word is typically found only once every 5 or 6 bytes; on average, then, a branch is saved about two-thirds of the time. This is an excellent example of how understanding the nature of the data youre processing allows you to optimize in ways the compiler cant. Know your data!
So, gosh, Listing 16.4 is the best word-counting code in the universe, right? Not hardly. If theres one thing my years of toil in this vale of silicon have taught me, its that theres never a lack of potential for further optimization. Never! Off the top of my head, I can think of at least three ways to speed up Listing 16.4; and, since Turbo Profiler reports that even in Listing 16.4, 88 percent of the time is spent scanning the buffer (as opposed to reading the file), theres potential for those further optimizations to improve performance significantly. (However, it is true that when access is performed to a hard rather than RAM disk, disk access jumps to about half of overall execution time.) One possible optimization is unrolling the loop, although that is truly a last resort because it tends to make further changes extremely difficult.
|Exhaust all other optimizations before unrolling loops.|
The challenge I put to the readers of PC TECHNIQUES was to write a faster module to replace Listing 16.4. The author of the code that counted the words in my secret test file fastest on my 20 MHz cached 386 would be the winner and receive Numerous Valuable Prizes.
No listings were to be longer than 200 lines. No complete programs were to be accepted; submissions had to be plug-compatible with Listing 16.4. (This was to encourage people not to waste time optimizing outside the inner loop.) Finally, the code had to produce the same results as Listing 16.4; I didnt want to see functions that approximated the word count by dividing the number of characters by six instead of counting actual words!
So how did the entrants in this particular challenge stack up? More than one claimed a speed-up over my assembly word-counting code of more than three times. On top of the three-times speedup over the original C code that I had already realized, were almost up to an order of magnitude faster. You are, of course, entitled to your own opinion, but I consider an order of magnitude to be significant.
|Previous||Table of Contents||Next|