|Previous||Table of Contents||Next|
By the way, dont fall victim to the lures of JCXZ and do something like this:
and cx,ofh ;Isolate the desired field jcxz SkipLoop ;If field is 0, dont bother
The AND instruction has already set the Zero flag, so this
and cx,0fh ;Isolate the desired field jz SkipLoop ;If field is 0, dont bother
will do just fine and is faster on all processors. Use JCXZ only when the Zero flag isnt already set to reflect the status of CX.
What can we learn from LOOP and JCXZ? First, that a single instruction that is intended to do a complex task is not necessarily faster than several instructions that together do the same thing. Second, that the relative merits of instructions and optimization rules vary to a surprisingly large degree across the x86 family.
In particular, if youre going to write 386 protected mode code, which will run only on the 386, 486, and Pentium, youd be well advised to rethink your use of the more esoteric members of the x86 instruction set. LOOP, JCXZ, the various accumulator-specific instructions, and even the string instructions in many circumstances no longer offer the advantages they did on the 8088. Sometimes theyre just not any faster than more general instructions, so theyre not worth going out of your way to use; sometimes, as with LOOP, theyre actually slower, and youd do well to avoid them altogether in the 386/486 world. Reviewing the instruction cycle times in the MASM or TASM manuals, or looking over the cycle times in Intels literature, is a good place to start; published cycle times are closer to actual execution times on the 386 and 486 than on the 8088, and are reasonably reliable indicators of the relative performance levels of x86 instructions.
Cycle counting and directly substituting instructions (DEC CX/JNZ for LOOP, for example) are techniques that belong at the lowest level of optimization. Its an important level, but its fairly mechanical; once youve learned the capabilities and relative performance levels of the various instructions, you should be able to select the best instructions fairly easily. Whats more, this is a task at which compilers excel. What Im saying is that you shouldnt get too caught up in counting cycles because thats a small (albeit important) part of the optimization picture, and not the area in which your greatest advantage lies.
One level at which assembly language programming pays off handsomely is that of local optimization; that is, selecting the best sequence of instructions for a task. The key to local optimization is viewing the 80x86 instruction set as a set of building blocks, each with unique characteristics. Your job is to sequence those blocks so that they perform well. It doesnt matter what the instructions are intended to do or what their names are; all that matters is what they do.
Our discussion of LOOP versus DEC/JNZ is an excellent example of optimization by cycle counting. Its worth knowing, but once youve learned it, you just routinely use DEC/JNZ at the bottom of loops in 386/486-specific code, and thats that. Besides, youll save at most a few cycles each time, and while that helps a little, its not going to make all that much difference.
Now lets step back for a moment, and with no preconceptions consider what the x86 instruction set can do for us. The bulk of the time with both LOOP and DEC/JNZ is taken up by branching, which just happens to be one of the slowest aspects of every processor in the x86 family, and the rest is taken up by decrementing the count register and checking whether its zero. There may be ways to perform those tasks a little faster by selecting different instructions, but they can get only so fast, and branching cant even get all that fast.
|The trick, then, is not to find the fastest way to decrement a count and branch conditionally, but rather to figure out how to accomplish the same result without decrementing or branching as often. Remember the Kobiyashi Maru problem in Star Trek?The same principle applies here: Redefine the problem to one that offers better solutions.|
Consider Listing 7.1, which searches a buffer until either the specified byte is found, a zero byte is found, or the specified number of characters have been checked. Such a function would be useful for scanning up to a maximum number of characters in a zero-terminated buffer. Listing 7.1, which uses LOOP in the main loop, performs a search of the sample string for a period (.) in 170 µs on a 20 MHz cached 386.
When the LOOP in Listing 7.1 is replaced with DEC CX/JNZ, performance improves to 168 µs, less than 2 percent faster than Listing 7.1. Actually, instruction fetching, instruction alignment, cache characteristics, or something similar is affecting these results; Id expect a slightly larger improvementaround 7 percentbut thats the most that counting cycles could buy us in this case. (All right, already; LOOPNZ could be used at the bottom of the loop, and other optimizations are surely possible, but all that wont add up to anywhere near the benefits were about to see from local optimization, and thats the whole point.)
|Previous||Table of Contents||Next|