Michael Abrash's Graphics Programming Black Book Special Edition: Pushing the 486

It’s Not Just a Bigger 386

So this traveling salesman is walking down a road, and he sees a group of men digging a ditch with their bare hands. “Whoa, there!” he says. “What you guys need is a Model 8088 ditch digger!” And he whips out a trowel and sells it to them.

A few days later, he stops back around. They’re happy with the trowel, but he sells them the latest ditch-digging technology, the Model 80286 spade. That keeps them content until he stops by again with a Model 80386 shovel (a full 32 inches wide, with a narrow point to emulate the trowel), and that holds them until he comes back around with what they really need: a Model 80486 bulldozer.

Having reached the top of the line, the salesman doesn’t pay them a call for a while. When he does, not only are they none too friendly, but they’re digging with the 80386 shovel; the bulldozer is sitting off to one side. “Why on earth are you using that shovel?” the salesman asks. “Why aren’t you digging with the bulldozer?”

“Well, Lord knows we tried,” says the foreman, “but it was all we could do just to lift the damn thing!”

Substitute “processor” for the various digging implements, and you get an idea of just how different the optimization rules for the 486 are from what you’re used to. Okay, it’s not quite that bad—but upon encountering a processor where string instructions are often to be avoided and memory-to-register MOVs are frequently as fast as register-to-register MOVs, Dorothy was heard to exclaim (before she sank out of sight in a swirl of hopelessly mixed metaphors), “I don’t think we’re in Kansas anymore, Toto.”

Enter the 486

No chip that is a direct, fully compatible descendant of the 8088, 286, and 386 could ever be called a RISC chip, but the 486 certainly contains RISC elements, and it’s those elements that are most responsible for making 486 optimization unique. Simple, common instructions are executed in a single cycle by a RISC-like core processor, but other instructions are executed pretty much as they were on the 386, where every instruction takes at least 2 cycles. For example, MOV AL, [TestChar] takes only 1 cycle on the 486, assuming both instruction and data are in the cache—3 cycles faster than the 386—but STOSB takes 5 cycles, 1 cycle slower than on the 386. The floating-point execution unit inside the 486 is also much faster than the 387 math coprocessor, largely because, being in the same silicon as the CPU (the 486 has a math coprocessor built in), it is more tightly coupled. The results are sometimes startling: FMUL (floating point multiply) is usually faster on the 486 than IMUL (integer multiply)!

An encyclopedic approach to 486 optimization would take a book all by itself, so in this chapter I’m only going to hit the highlights of 486 optimization, touching on several optimization rules, some documented, some not. You might also want to check out the following sources of 486 information: i486 Microprocessor Programmer’s Reference Manual, from Intel; “8086 Optimization: Aim Down the Middle and Pray,” in the March, 1991 Dr. Dobb’s Journal; and “Peak Performance: On to the 486,” in the November, 1990 Programmer’s Journal.

Rules to Optimize By

In Appendix G of the i486 Microprocessor Programmer’s Reference Manual, Intel lists a number of optimization techniques for the 486. While neither exhaustive (we’ll look at two undocumented optimizations shortly) nor entirely accurate (we’ll correct two of the rules here), Intel’s list is certainly a good starting point. In particular, the list conveys the extent to which 486 optimization differs from optimization for earlier x86 processors. Generally, I’ll be discussing optimization for real mode (it being the most widely used mode at the moment), although many of the rules should apply to protected mode as well.

486 optimization is generally more precise and less frustrating than optimization for other x86 processors because every 486 has an identical internal cache. Whenever both the instructions being executed and the data the instructions access are in the cache, those instructions will run in a consistent and calculatable number of cycles on all 486s, with little chance of interference from the prefetch queue and without regard to the speed of external memory.

In other words, for cached code (which time-critical code almost always is), performance is predictable and can be calculated with good precision, and those calculations will apply on any 486. However, “predictable” doesn’t mean “trivial”; the cycle times printed for the various instructions are not the whole story. You must be aware of all the rules, documented and undocumented, that go into calculating actual execution times—and uncovering some of those rules is exactly what this chapter is about.

The Hazards of Indexed Addressing

Rule #1: Avoid indexed addressing (that is, try not to use either two registers or scaled addressing to point to memory).

Intel cautions against using indexing to address memory because there’s a one-cycle penalty for indexed addressing. True enough—but “indexed addressing” might not mean what you expect.

Traditionally, SI and DI are considered the index registers of the x86 CPUs. That is not the sense in which “indexed addressing” is meant here, however. In real mode, indexed addressing means that two registers, rather than one or none, are used to point to memory. (In this context, the use of one register to address memory is “base addressing,” no matter what register is used.) MOV AX, [BX+DI] and MOV CL, [BP+SI+10] perform indexed addressing; MOV AX,[BX] and MOV DL, [SI+1] do not.

Therefore, in real mode, the rule is to avoid using two registers to point to memory whenever possible. Often, this simply means adding the two registers together outside a loop before memory is actually addressed.

As an example, you might adhere to this rule by replacing the code

LoopTop:
    add  ax,[bx+si]
    add  si,2
    dec  cx
    jnz  LoopTop

with this

    add  si,bx
LoopTop:
    add  ax,[si]
    add  si,2
    dec  cx
    jnz  LoopTop
    sub  si,bx

Table of Contents

Chapter 12Pushing the 486

It’s Not Just a Bigger 386

Enter the 486

Rules to Optimize By

The Hazards of Indexed Addressing

Chapter 12
Pushing the 486