|Previous||Table of Contents||Next|
Your first question is likely to be the following: Just how fast is Listing 37.1? Is it optimized to the hilt or just pretty fast? The quick answer is: Its fast. Listing 37.1 draws lines at a rate of nearly 1 million pixels per second on my 486/33, and is capable of still faster drawing, as Ill discuss shortly. (The heavily optimized AutoCAD line-drawing code that I mentioned in the last chapter drew 150,000 pixels per second on an EGA in a 386/16, and I thought I had died and gone to Heaven. Such is progress.) The full answer is a more complicated one, and ties in to the principle that if it is broken, maybe thats okayand to the principle of looking before you leap, also known as profiling before you optimize.
When I went to speed up run-length slice lines, I initially manually converted the last chapters C code into assembly. Then I streamlined the register usage and used REP STOS wherever possible. Listing 37.1 is that code. At that point, line drawing was surely faster, although I didnt know exactly how much faster. Equally surely, there were significant optimizations yet to be made, and I was itching to get on to them, for they were bound to be a lot more interesting than a basic C-to-assembly port.
Ego intervened at this point, however. I wanted to know how much of a speed-up I had already gotten, so I timed the performance of the C code and compared it to the assembly code. To my horror, I found that I had not gotten even a two-times improvement! I couldnt understand how that could bethe C code was decidedly unoptimizeduntil I hit on the idea of measuring the maximum memory speed of the VGA to which I was drawing.
Bingo. The Paradise VGA in my 486/33 is fast for a single display-memory write, because it buffers the data, lets the CPU go on its merry way, and finishes the write when display memory is ready. However, the maximum rate at which data can be written to the adapter turns out to be no more than one byte every microsecond. Put another way, you can only write one byte to this adapter every 33 clock cycles on a 486/33. Therefore, no matter how fast I made the line-drawing code, it could never draw more than 1,000,000 pixels per second in 256-color mode in my system. The C code was already drawing at about half that rate, so the potential speed-up for the assembly code was limited to a maximum of two times, which is pretty close to what Listing 37.1 did, in fact, achieve. When I compared the C and assembly implementations drawing to normal system (nondisplay) memory, I found that the assembly code was actually four times as fast as the C code.
|In fact, Listing 37.1 draws VGA lines at about 92 percent of the maximum possible rate in my systemthat is, it draws very nearly as fast as the VGA hardware will allow. All the optimization in the world would get me less than 10 percent faster line drawingand only if I eliminated all overhead, an unlikely proposition at best. The code isnt fully optimized, but so what?|
Now its true that faster line-drawing code would likely be more beneficial on faster VGAs, especially local-bus VGAs, and in slower systems. For that reason, Ill list a variety of potential optimizations to Listing 37.1. On the other hand, its also true that Listing 37.1 is capable of drawing lines at a rate of 2.2 million pixels per second on a 486/ 33, given fast enough VGA memory, so it should be able to drive almost any non-local-bus VGA at nearly full speed. In short, Listing 37.1 is very fast, and, in many systems, further optimization is basically a waste of time.
Profile before you optimize.
Following is a quick tour of some of the many possible further optimizations to Listing 37.1.
The run-handling loops could be unrolled more than the current two times. However, bear in mind that a two-times unrolling gets more than half the maximum unrolling benefit with less overhead than a more heavily unrolled loop.
BX could be freed up in the Y-major code by breaking out separate loops for X advances of 1 and 1. DX could be freed up by using AH as the counter for the run loops, although this would limit the maximum line length that could be handled. The freed registers could be used to keep more of the whole-step and error variables in registers. Alternatively, the freed registers could be used to implement more esoteric approaches like unrolling the Y-major inner loop; such unrolling could take advantage of the knowledge that only two run lengths are possible for any given line. Strangely enough, on the 486 it might also be worth unrolling the X-major inner loop, which consists of REP STOSB, because of the slow start-up time of REP relative to the speed of branching on that processor.
Special code could be implemented for lines with integral slopes, because all runs are exactly the same length in such lines. Also, the X-major code could try to write an aligned word at a time to display memory whenever possible; this would improve the maximum possible performance on some 16-bit VGAs.
One weakness of Listing 37.1 is that for lines with slopes between 0.5 and 2, the average run length is less than two, rendering run-length slicing ineffective. This can be remedied by viewing lines in that range as being composed of diagonal, rather than horizontal or vertical runs. I havent space to take this idea any further in this book, but its not very complicated, and it guarantees a minimum run length of 2, which renders run drawing considerably more efficient, and makes techniques such as unrolling the inner run-drawing loops more attractive.
Finally, be aware that run-length slice drawing is best for long lines, because it has more and slower setup than a standard Bresenhams line draw, including a divide. Run-length slice is great for 100-pixel lines, but not necessarily for 20-pixel lines, and its a sure thing that its not terrific for 3-pixel lines. Both approaches will work, but if line-drawing performance is critical, whether youll want to use run-length slice or standard Bresenhams depends on the typical lengths of the lines youll be drawing. For lines of widely varying lengths, you might want to implement both approaches, and choose the best one for each line, depending on the line lengthassuming, of course, that your display memory is fast enough and your application demanding enough to make that level of optimization worthwhile.
If your code looks broken from a performance perspective, think before you fix it; that particular cat may be dead for a perfectly good reason. Ill say it again: Profile before you optimize.
|Previous||Table of Contents||Next|