|Previous||Table of Contents||Next|
When youre pushing the envelope in writing optimized PC code, youre likely to become more than a little compulsive about finding approaches that let you wring more speed from your computer. In the process, youre bound to make mistakes, which is fineas long as you watch for those mistakes and learn from them.
A case in point: A few years back, I came across an article about 8088 assembly language called Optimizing for Speed. Now, optimize is not a word to be used lightly; Websters Ninth New Collegiate Dictionary defines optimize as to make as perfect, effective, or functional as possible, which certainly leaves little room for error. The author had, however, chosen a small, well-defined 8088 assembly language routine to refine, consisting of about 30 instructions that did nothing more than expand 8 bits to 16 bits by duplicating each bit.
The author of Optimizing had clearly fine-tuned the code with care, examining alternative instruction sequences and adding up cycles until he arrived at an implementation he calculated to be nearly 50 percent faster than the original routine. In short, he had used all the information at his disposal to improve his code, and had, as a result, saved cycles by the bushel. There was, in fact, only one slight problem with the optimized version of the routine....
It ran slower than the original version!
As diligent as the author had been, he had nonetheless committed a cardinal sin of x86 assembly language programming: He had assumed that the information available to him was both correct and complete. While the execution times provided by Intel for its processors are indeed correct, they are incomplete; the otherand often more importantpart of code performance is instruction fetch time, a topic to which I will return in later chapters.
Had the author taken the time to measure the true performance of his code, he wouldnt have put his reputation on the line with relatively low-performance code. Whats more, had he actually measured the performance of his code and found it to be unexpectedly slow, curiosity might well have led him to experiment further and thereby add to his store of reliable information about the CPU.
|There you have an important tenet of assembly language optimization: After crafting the best code possible, check it in action to see if its really doing what you think it is. If its not behaving as expected, thats all to the good, since solving mysteries is the path to knowledge. Youll learn more in this way, I assure you, than from any manual or book on assembly language.|
Assume nothing. I cannot emphasize this strongly enoughwhen you care about performance, do your best to improve the code and then measure the improvement. If you dont measure performance, youre just guessing, and if youre guessing, youre not very likely to write top-notch code.
Ignorance about true performance can be costly. When I wrote video games for a living, I spent days at a time trying to wring more performance from my graphics drivers. I rewrote whole sections of code just to save a few cycles, juggled registers, and relied heavily on blurry-fast register-to-register shifts and adds. As I was writing my last game, I discovered that the program ran perceptibly faster if I used look-up tables instead of shifts and adds for my calculations. It shouldnt have run faster, according to my cycle counting, but it did. In truth, instruction fetching was rearing its head again, as it often does, and the fetching of the shifts and adds was taking as much as four times the nominal execution time of those instructions.
Ignorance can also be responsible for considerable wasted effort. I recall a debate in the letters column of one computer magazine about exactly how quickly text can be drawn on a Color/Graphics Adapter (CGA) screen without causing snow. The letter-writers counted every cycle in their timing loops, just as the author in the story that started this chapter had. Like that author, the letter-writers had failed to take the prefetch queue into account. In fact, they had neglected the effects of video wait states as well, so the code they discussed was actually much slower than their estimates. The proper test would, of course, have been to run the code to see if snow resulted, since the only true measure of code performance is observing it in action.
Clearly, one key to mastering Zen-class optimization is a tool with which to measure code performance. The most accurate way to measure performance is with expensive hardware, but reasonable measurements at no cost can be made with the PCs 8253 timer chip, which counts at a rate of slightly over 1,000,000 times per second. The 8253 can be started at the beginning of a block of code of interest and stopped at the end of that code, with the resulting count indicating how long the code took to execute with an accuracy of about 1 microsecond. (A microsecond is one millionth of a second, and is abbreviated µs). To be precise, the 8253 counts once every 838.1 nanoseconds. (A nanosecond is one billionth of a second, and is abbreviated ns.)
Listing 3.1 shows 8253-based timer software, consisting of three subroutines: ZTimerOn, ZTimerOff, and ZTimerReport. For the remainder of this book, Ill refer to these routines collectively as the Zen timer. C-callable versions of the two precision Zen timers are presented in Chapter K on the companion CD-ROM.
|Previous||Table of Contents||Next|