|Previous||Table of Contents||Next|
Truth to tell, I didnt expect a three-times speedup; around two times was what I had in mind. Which just goes to show that any code can be made faster than youd expect, if you think about it long enough and from many different perspectives. (The most potent word-counting technique seems to be a 64K lookup table that allows handling two bytes simultaneously. This is not the sort of technique one comes up with by brute-force optimization.) Thinking (or, worse yet, boasting) that your code is the fastest possible is rollescating on a tightrope in a hurricane; youre due for a fall, if you catch my drift. Case in point: Terje Mathisens word-counting program.
Not so long ago, Terje Mathisen, who I introduced earlier in this book, wrote a very fast word-counting program, and posted it on Bix. When I say it was fast, I mean fast; this code was optimized like nobodys business. Were talking top-quality code here.
When the topic of optimizing came up in one of the Bix conferences, Terjes program was mentioned, and he posted the following message: I challenge BIXens (and especially mabrash!) to speed it up significantly. I would consider 5 percent a good result. The clear implication was, That code is as fast as it can possibly be.
Naturally, it wasnt; there aint no such thing as the fastest code (TANSTATFC? I agree, it doesnt have the ring of TANSTAAFL). I pored over Terjes 386 native-mode code, and found the critical inner loop, which was indeed as tight as one could imagine, consisting of just a few 386 native-mode instructions. However, one of the instructions was this:
Harmless enough, save for two things. First, EBX happened to be zero at this point (a leftover from an earlier version of the code, as it turned out), so it was superfluous as a memory-addressing component; this made it possible to use base-only addressing ([EAX]) rather than base+index addressing ([EBX+EAX]), which saves a cycle on the 386. Second: Changing the instruction to CMP [EAX],DH saved 2 cyclesjust enough, by good fortune, to speed up the whole program by 5 percent.
|CMP reg,[mem] takes 6 cycles on the 386, but CMP [ mem ],reg takes only 5 cycles; you should always performCMP with the memory operand on the left on the 386.|
(Granted, CMP [mem],reg is 1 cycle slower than CMP reg,[mem] on the 286, and theyre both the same on the 8088; in this case, though, the code was specific to the 386. In case youre curious, both forms take 2 cycles on the 486; quite a lot faster, eh?)
The first lesson to be learned here is not to lug assumptions that may no longer be valid from the 8088/286 world into the wonderful new world of 386 native-mode programming. The second lesson is that after youve slaved over your code for a while, youre in no shape to see its flaws, or to be able to get the new perspectives needed to speed it up. Ill bet Terje looked at that [EBX+EAX] addressing a hundred times while trying to speed up his code, but he didnt really see what it did; instead, he saw what it was supposed to do. Mental shortcuts like this are what enable us to deal with the complexities of assembly language without overloading after about 20 instructions, but they can be a major problem when looking over familiar code.
The third, and most interesting, lesson is that a far more fruitful optimization came of all this, one that nicely illustrates that cycle counting is not the key to happiness, riches, and wondrous performance. After getting my 5 percent speedup, I mentioned to Terje the possibility of using a 64K lookup table. (This predated the arrival of entries for the optimization contest.) He said that he had considered it, but it didnt seem to him to be worthwhile. He couldnt shake the thought, though, and started to poke around, and one day, voila, he posted a new version of his word count program, WC50, that was much faster than the old version. I dont have exact numbers, but Terjes preliminary estimate was 80 percent faster, and word countingincluding disk cache access timeproceeds at more than 3 MB per second on a 33 MHz 486. Even allowing for the speed of the 486, those are very impressive numbers indeed.
The point I want to make, though, is that the biggest optimization barrier that Terje faced was that he thought he had the fastest code possible. Once he opened up the possibility that there were faster approaches, and looked beyond the specific approach that he had so carefully optimized, he was able to come up with code that was a lot faster. Consider the incongruity of Terjes willingness to consider a 5 percent speedup significant in light of his later near-doubling of performance.
|Dont get stuck in the rut of instruction-by-instruction optimization. Its useful in key loops, but very often, a change in approach will work far greater wonders than any amount of cycle counting can.|
By the way, Terjes WC50 program is a full-fledged counting program; it counts characters, words, and lines, can handle multiple files, and lets you specify the characters that separate words, should you so desire. Source code is provided as part of the archive WC50 comes in. All in all, its a nice piece of work, and you might want to take a look at it if youre interested in really fast assembly code. I wouldnt call it the fastest word-counting code, though, because I would of course never be so foolish as to call anything the fastest.
As it happened, the challenge I issued to my PC TECHNIQUES readers was a smashing success, with dozens of good entries. I certainly enjoyed it, even though I did have to look at a lot of tricky assembly code that I didnt writehard work under the best of circumstances. It was worth the trouble, though. The winning entry was an astonishing example of what assembly language can do in the right hands; on my 386, it was four times faster at word counting than the nice, tight assembly code I provided as a starting pointand about 13 times faster than the original C implementation. Attention, high-level language chauvinists: Is the speedup getting significant yet? Okay, maybe word counting isnt the most critical application, but how would you like to have that kind of improvement in your compression software, or in your real-time gamesor in Windows graphics?
The winner was David Stafford, who at the time was working for Borland International; his entry is shown in Listing 16.5. Dave Methvin, whom some of you may recall as a tech editor of the late, lamented PC Tech Journal, was a close second, and Mick Brown, about whom I know nothing more than that he is obviously an extremely good assembly language programmer, was a close third, as shown in Table 16.2, which precedes Listing 16.5. Those three were out ahead of the pack; the fourth-place entry, good as it was (twice as fast as my original code), was twice as slow as Davids winning entry, so you can see that David, Dave, and Mick attained a rarefied level of optimization indeed.
|Previous||Table of Contents||Next|