|Previous||Table of Contents||Next|
So this traveling salesman is walking down a road, and he sees a group of men digging a ditch with their bare hands. Whoa, there! he says. What you guys need is a Model 8088 ditch digger! And he whips out a trowel and sells it to them.
A few days later, he stops back around. Theyre happy with the trowel, but he sells them the latest ditch-digging technology, the Model 80286 spade. That keeps them content until he stops by again with a Model 80386 shovel (a full 32 inches wide, with a narrow point to emulate the trowel), and that holds them until he comes back around with what they really need: a Model 80486 bulldozer.
Having reached the top of the line, the salesman doesnt pay them a call for a while. When he does, not only are they none too friendly, but theyre digging with the 80386 shovel; the bulldozer is sitting off to one side. Why on earth are you using that shovel? the salesman asks. Why arent you digging with the bulldozer?
Well, Lord knows we tried, says the foreman, but it was all we could do just to lift the damn thing!
Substitute processor for the various digging implements, and you get an idea of just how different the optimization rules for the 486 are from what youre used to. Okay, its not quite that badbut upon encountering a processor where string instructions are often to be avoided and memory-to-register MOVs are frequently as fast as register-to-register MOVs, Dorothy was heard to exclaim (before she sank out of sight in a swirl of hopelessly mixed metaphors), I dont think were in Kansas anymore, Toto.
No chip that is a direct, fully compatible descendant of the 8088, 286, and 386 could ever be called a RISC chip, but the 486 certainly contains RISC elements, and its those elements that are most responsible for making 486 optimization unique. Simple, common instructions are executed in a single cycle by a RISC-like core processor, but other instructions are executed pretty much as they were on the 386, where every instruction takes at least 2 cycles. For example, MOV AL, [TestChar] takes only 1 cycle on the 486, assuming both instruction and data are in the cache3 cycles faster than the 386but STOSB takes 5 cycles, 1 cycle slower than on the 386. The floating-point execution unit inside the 486 is also much faster than the 387 math coprocessor, largely because, being in the same silicon as the CPU (the 486 has a math coprocessor built in), it is more tightly coupled. The results are sometimes startling: FMUL (floating point multiply) is usually faster on the 486 than IMUL (integer multiply)!
An encyclopedic approach to 486 optimization would take a book all by itself, so in this chapter Im only going to hit the highlights of 486 optimization, touching on several optimization rules, some documented, some not. You might also want to check out the following sources of 486 information: i486 Microprocessor Programmers Reference Manual, from Intel; 8086 Optimization: Aim Down the Middle and Pray, in the March, 1991 Dr. Dobbs Journal; and Peak Performance: On to the 486, in the November, 1990 Programmers Journal.
In Appendix G of the i486 Microprocessor Programmers Reference Manual, Intel lists a number of optimization techniques for the 486. While neither exhaustive (well look at two undocumented optimizations shortly) nor entirely accurate (well correct two of the rules here), Intels list is certainly a good starting point. In particular, the list conveys the extent to which 486 optimization differs from optimization for earlier x86 processors. Generally, Ill be discussing optimization for real mode (it being the most widely used mode at the moment), although many of the rules should apply to protected mode as well.
|486 optimization is generally more precise and less frustrating than optimization for other x86 processors because every 486 has an identical internal cache. Whenever both the instructions being executed and the data the instructions access are in the cache, those instructions will run in a consistent and calculatable number of cycles on all 486s, with little chance of interference from the prefetch queue and without regard to the speed of external memory.|
In other words, for cached code (which time-critical code almost always is), performance is predictable and can be calculated with good precision, and those calculations will apply on any 486. However, predictable doesnt mean trivial; the cycle times printed for the various instructions are not the whole story. You must be aware of all the rules, documented and undocumented, that go into calculating actual execution timesand uncovering some of those rules is exactly what this chapter is about.
Rule #1: Avoid indexed addressing (that is, try not to use either two registers or scaled addressing to point to memory).
Intel cautions against using indexing to address memory because theres a one-cycle penalty for indexed addressing. True enoughbut indexed addressing might not mean what you expect.
Traditionally, SI and DI are considered the index registers of the x86 CPUs. That is not the sense in which indexed addressing is meant here, however. In real mode, indexed addressing means that two registers, rather than one or none, are used to point to memory. (In this context, the use of one register to address memory is base addressing, no matter what register is used.) MOV AX, [BX+DI] and MOV CL, [BP+SI+10] perform indexed addressing; MOV AX,[BX] and MOV DL, [SI+1] do not.
|Therefore, in real mode, the rule is to avoid using two registers to point to memory whenever possible. Often, this simply means adding the two registers together outside a loop before memory is actually addressed.|
As an example, you might adhere to this rule by replacing the code
LoopTop: add ax,[bx+si] add si,2 dec cx jnz LoopTop
add si,bx LoopTop: add ax,[si] add si,2 dec cx jnz LoopTop sub si,bx
|Previous||Table of Contents||Next|