## Chapter 7Local Optimization

### Optimizing Halfway between Algorithms and Cycle Counting

You might not think it, but there’s much to learn about performance programming from the Great Buffalo Sauna Fiasco. To wit:

The scene is Buffalo, New York, in the dead of winter, with the snow piled several feet deep. Four college students, living in typical student housing, are frozen to the bone. The third floor of their house, uninsulated and so cold that it’s uninhabitable, has an ancient bathroom. One fabulously cold day, inspiration strikes:

“Hey—we could make that bathroom into a sauna!

Pandemonium ensues. Someone rushes out and buys a gas heater, and at considerable risk to life and limb hooks it up to an abandoned but still live gas pipe that once fed a stove on the third floor. Someone else gets sheets of plastic and lines the walls of the bathroom to keep the moisture in, and yet another student gets a bucket full of rocks. The remaining chap brings up some old wooden chairs and sets them up to make benches along the sides of the bathroom. Voila—instant sauna!

They crank up the gas heater, put the bucket of rocks in front of it, close the door, take off their clothes, and sit down to steam themselves. Mind you, it’s not yet 50 degrees Fahrenheit in this room, but the gas heater is roaring. Surely warmer times await.

Indeed they do. The temperature climbs to 55 degrees, then 60, then 63, then 65, and finally creeps up to 68 degrees.

And there it stops.

68 degrees is warm for an uninsulated third floor in Buffalo in the dead of winter. Damn warm. It is not, however, particularly warm for a sauna. Eventually someone acknowledges the obvious and allows that it might have been a stupid idea after all, and everyone agrees, and they shut off the heater and leave, each no doubt offering silent thanks that they had gotten out of this without any incidents requiring major surgery.

And so we see that the best idea in the world can fail for lack of either proper design or adequate horsepower. The primary cause of the Great Buffalo Sauna Fiasco was a lack of horsepower; the gas heater was flat-out undersized. This is analogous to trying to write programs that incorporate features like bitmapped text and searching of multisegment buffers without using high-performance assembly language. Any PC language can perform just about any function you can think of—eventually. That heater would eventually have heated the room to 110 degrees, too—along about the first of June or so.

The Great Buffalo Sauna Fiasco also suffered from fundamental design flaws. A more powerful heater would indeed have made the room hotter—and might well have burned the house down in the process. Likewise, proper algorithm selection and good design are fundamental to performance. The extra horsepower a superb assembly language implementation gives a program is worth bothering with only in the context of a good design.

 Assembly language optimization is a small but crucial corner of the PC programming world. Use it sparingly and only within the framework of a good design—but ignore it and you may find various portions of your anatomy out in the cold.

So, drawing fortitude from the knowledge that our quest is a pure and worthy one, let’s resume our exploration of assembly language instructions with hidden talents and instructions with well-known talents that are less than they appear to be. In the process, we’ll come to see that there is another, very important optimization level between the algorithm/design level and the cycle-counting/individual instruction level. I’ll call this middle level local optimization; it involves focusing on optimizing sequences of instructions rather than individual instructions, all with an eye to implementing designs as efficiently as possible given the capabilities of the x86 family instruction set.

And yes, in case you’re wondering, the above story is indeed true. Was I there? Let me put it this way: If I were, I’d never admit it!

#### When LOOP Is a Bad Idea

Let’s examine first an instruction that is less than it appears to be: LOOP. There’s no mystery about what LOOP does; it decrements CX and branches if CX doesn’t decrement to zero. It’s so beautifully suited to the task of counting down loops that any experienced x86 programmer instinctively stuffs the loop count in CX and reaches for LOOP when setting up a loop. That’s fine—LOOP does, of course, work as advertised—but there is one problem:

 On half of the processors in the x86 family, LOOP is slower than DEC CX followed by JNZ. (Granted, DEC CX/JNZ isn’t precisely equivalent to LOOP, because DEC alters the flags and LOOP doesn’t, but in most situations they’re comparable.)

How can this be? Don’t ask me, ask Intel. On the 8088 and 80286, LOOP is indeed faster than DEC CX/JNZ by a cycle, and LOOP is generally a little faster still because it’s a byte shorter and so can be fetched faster. On the 386, however, things change; LOOP is two cycles slower than DEC/JNZ, and the fetch time for one extra byte on even an uncached 386 generally isn’t significant. (Remember that the 386 fetches four instruction bytes at a pop.) LOOP is three cycles slower than DEC/JNZ on the 486, and the 486 executes instructions in so few cycles that those three cycles mean that DEC/JNZ is nearly twice as fast as LOOP. Then, too, unlike LOOP, DEC doesn’t require that CX be used, so the DEC/JNZ solution is both faster and more flexible on the 386 and 486, and on the Pentium as well. (By the way, all this is not just theory; I’ve timed the relative performances of LOOP and DEC CX/JNZ on a cached 386, and LOOP really is slower.)

 Things are stranger still for LOOP’s relative JCXZ, which branches if and only if CX is zero. JCXZ is faster than AND CX,CX/JZ on the 8088 and 80286, and equivalent on the 80386—but is about twice as slow on the 486!

Graphics Programming Black Book © 2001 Michael Abrash