Michael Abrash's Graphics Programming Black Book Special Edition: There Ain't No Such Thing as the Fastest Code

Levels of Optimization

Three levels of optimization were evident in the word-counting entries I received in response to my challenge. I’d briefly describe them as “fine-tuning,” “new perspective,” and “table-driven state machine.” The latter categories produce faster code, but, by the same token, they are harder to design, harder to implement, and more difficult to understand, so they’re suitable for only the most demanding applications. (Heck, I don’t even guarantee that David Stafford’s entry works perfectly, although, knowing him, it probably does; the more complex and cryptic the code, the greater the chance for obscure bugs.)

Remember, optimize only when needed, and stop when further optimization will not be noticed. Optimization that’s not perceptible to the user is like buying Telly Savalas a comb; it’s not going to do any harm, but it’s nonetheless a waste of time.

Optimization Level 1: Good Code

The first level of optimization involves fine-tuning and clever use of the instruction set. The basic framework is still the same as my code (which in turn is basically the same as that of the original C code), but that framework is implemented more efficiently.

One obvious level 1 optimization is using a word rather than dword counter. ScanBuffer can never be called upon to handle more than 64K bytes at a time, so no more than 32K words can ever be found. Given that, it’s a logical step to use INC rather than ADD/ADC to keep count, adding the tally into the full 32-bit count only upon exiting the function. Another useful optimization is aligning loop tops and other branch destinations to word , or better yet dword , boundaries.

Eliminating branches was very popular, as it should be on x86 processors. Branches were eliminated in a remarkable variety of ways. Many of you unrolled the loop, a technique that does pay off nicely. A word of caution: Some of you unrolled the loop by simply stacking repetitions of the inner loop one after the other, with DEC CX/JZ appearing after each repetition to detect the end of the buffer. Part of the point of unrolling a loop is to reduce the number of times you have to check for the end of the buffer! The trick to this is to set CX to the number of repetitions of the unrolled loop and count down only once each time through the unrolled loop. In order to handle repetition counts that aren’t exact multiples of the unrolling factor, you must enter the loop by branching into the middle of it to perform whatever fraction of the number of unrolled repetitions is required to make the whole thing come out right. Listing 16.5 (QSCAN3.ASM) illustrates this technique.

Another effective optimization is the use of LODSW rather than LODSB , thereby processing two bytes per memory access. This has the effect of unrolling the loop one time, since with LODSW , looping is performed at most only once every two bytes.

Cutting down the branches used to loop is only part of the branching story. More often than not, my original code also branched in the process of checking whether it was time to count a word. There are many ways to reduce this sort of branching; in fact, it is quite possible to eliminate it entirely. The most straightforward way to reduce such branching is to employ two loops. One loop is used to look for the end of a word when the last byte was a non-separator, and one loop is used to look for the start of a word when the last byte was a separator. This way, it’s no longer necessary to maintain a flag to indicate the state of the last byte; that state is implied by whichever loop is currently executing. This considerably simplifies and streamlines the inner loop code.

Listing 16.6, contributed by Willem Clements, of Granada, Spain, illustrates a variety of level 1 optimizations: the two-loop approach, the use of a 16- rather than 32-bit counter, and the use of LODSW . Together, these optimizations made Willem’s code nearly twice as fast as mine in Listing 16.4. A few details could stand improvement; for example, AND AX,AX is a shorter way to test for zero than CMP AX,0 , and ALIGN 2 could be used. Nonetheless, this is good code, and it’s also fairly compact and reasonably easy to understand. In short, this is an excellent example of how an hour or so of hand-optimization might accomplish significantly improved performance at a reasonable cost in complexity and time. This level of optimization is adequate for most purposes (and, in truth, is beyond the abilities of most programmers).

Table of Contents