|Previous||Table of Contents||Next|
As I showed in the previous chapter, optimization is by no means always a matter of dropping into assembly. In fact, in performance tuning high-level language code, assembly should be used rarely, and then only after youve made sure a badly chosen or clumsily implemented algorithm isnt eating you alive. Certainly if you use assembly at all, make absolutely sure you use it right. The potential of assembly code to run slowly is poorly understood by a lot of people, but that potential is great, especially in the hands of the ignorant.
Truly great optimization, however, happens only at the assembly level, and it happens in response to a set of dynamics that is totally different from that governing C/C++ or Pascal optimization. Ill be speaking of assembly-level optimization time and again in this book, but when I do, I think it will be helpful if you have a grasp of those assembly specific dynamics.
As usual, the best way to wade in is to present a real-world example.
Some time ago, I was asked to work over a critical assembly subroutine in order to make it run as fast as possible. The task of the subroutine was to construct a nibble out of four bits read from different bytes, rotating and combining the bits so that they ultimately ended up neatly aligned in bits 3-0 of a single byte. (In case youre curious, the object was to construct a 16-color pixel from bits scattered over 4 bytes.) I examined the subroutine line by line, saving a cycle here and a cycle there, until the code truly seemed to be optimized. When I was done, the key part of the code looked something like this:
LoopTop: lodsb ;get the next byte to extract a bit from and al,ah ;isolate the bit we want rol al,cl ;rotate the bit into the desired position or bl,al ;insert the bit into the final nibble dec cx ;the next bit goes 1 place to the right dec dx ;count down the number of bits jnz LoopTop ;process the next bit, if any
Now, its hard to write code thats much faster than seven instructions, only one of which accesses memory, and most programmers would have called it a day at this point. Still, something bothered me, so I spent a bit of time going over the code again. Suddenly, the answer struck methe code was rotating each bit into place separately, so that a multibit rotation was being performed every time through the loop, for a total of four separate time-consuming multibit rotations!
|While the instructions themselves were individually optimized, the overall approach did not make the best possible use of the instructions.|
I changed the code to the following:
LoopTop: lodsb ;get the next byte to extract a bit from and al,ah ;isolate the bit we want or bl,al ;insert the bit into the final nibble rol bl,1 ;make room for the next bit dec dx ;count down the number of bits jnz LoopTop ;process the next bit, if any rol bl,cl ;rotate all four bits into their final ; positions at the same time
This moved the costly multibit rotation out of the loop so that it was performed just once, rather than four times. While the code may not look much different from the original, and in fact still contains exactly the same number of instructions, the performance of the entire subroutine improved by about 10 percent from just this one change. (Incidentally, that wasnt the end of the optimization; I eliminated the DEC and JNJ instructions by expanding the four iterations of the loopbut thats a tale for another chapter.)
The point is this: To write truly superior assembly programs, you need to know what the various instructions do and which instructions execute fastest...and more. You must also learn to look at your programming problems from a variety of perspectives so that you can put those fast instructions to work in the most effective ways.
Is it really so hard as all that to write good assembly code for the PC? Yes! Thanks to the decidedly quirky nature of the x86 family CPUs, assembly language differs fundamentally from other languages, and is undeniably harder to work with. On the other hand, the potential of assembly code is much greater than that of other languages, as well.
To understand why this is so, consider how a program gets written. A programmer examines the requirements of an application, designs a solution at some level of abstraction, and then makes that design come alive in a code implementation. If not handled properly, the transformation that takes place between conception and implementation can reduce performance tremendously; for example, a programmer who implements a routine to search a list of 100,000 sorted items with a linear rather than binary search will end up with a disappointingly slow program.
|Previous||Table of Contents||Next|