Michael Abrash's Graphics Programming Black Book Special Edition: In the Lair of the Cycle-Eaters

Table of Contents

Back to the 8-bit bus cycle-eater. As I’ve said, in 8088 work you should strive to use byte-sized memory variables whenever possible. That does not mean that you should use 2 byte-sized memory accesses to manipulate a word-sized memory variable in preference to 1 word-sized memory access, as, for instance,

mov  dl,byte ptr [MemVar]
mov  dh,byte ptr [MemVar+1]

versus:

mov  dx,word ptr [MemVar]

Recall that every access to a memory byte takes at least 4 cycles; that limitation is built right into the 8088. The 8088 is also built so that the second byte-sized memory access to a 16-bit memory variable takes just those 4 cycles and no more. There’s no way you can manipulate the second byte of a word-sized memory variable faster with a second separate byte-sized instruction in less than 4 cycles. As a matter of fact, you’re bound to access that second byte much more slowly with a separate instruction, thanks to the overhead of instruction fetching and execution, address calculation, and the like.

For example, consider Listing 4.3, which performs 1,000 word-sized reads from memory. This code runs in 3.77 µs per word read on a 4.77 MHz 8088. That’s 45 percent faster than the 5.49 µs per word read of Listing 4.4, which reads the same 1,000 words as Listing 4.3 but does so with 2,000 byte-sized reads. Both listings perform exactly the same number of memory accesses—2,000 accesses, each byte-sized, as all 8088 memory accesses must be. (Remember that the Bus Interface Unit must perform two byte-sized memory accesses in order to handle a word-sized memory operand.) However, Listing 4.3 is considerably faster because it expends only 4 additional cycles to read the second byte of each word, while Listing 4.4 performs a second LODSB, requiring 13 cycles, to read the second byte of each word.

LISTING 4.3 LST4-3.ASM

; Measures the performance of reading 1,000 words
; from memory with 1,000 word-sized accesses.
;
     sub  si,si
     mov  cx,1000
     call ZTimerOn
     rep  lodsw
     call ZTimerOff

LISTING 4.4 LST4-4.ASM

; Measures the performance of reading 1000 words
; from memory with 2,000 byte-sized accesses.
;
     sub  si,si
     mov  cx,2000
     call ZTimerOn
     rep  lodsb
     call ZTimerOff

In short, if you must perform a 16-bit memory access, let the 8088 break the access into two byte-sized accesses for you. The 8088 is more efficient at that task than your code can possibly be.

Word-sized variables should be stored in registers to the greatest feasible extent, since registers are inside the 8088, where 16-bit operations are just as fast as 8-bit operations because the 8-bit cycle-eater can’t get at them. In fact, it’s a good idea to keep as many variables of all sorts in registers as you can. Instructions with register-only operands execute very rapidly, partially because they avoid both the time-consuming memory accesses and the lengthy address calculations associated with memory operands.

There is yet another reason why register operands are preferable to memory operands, and it’s an unexpected effect of the 8-bit bus cycle-eater. Instructions with only register operands tend to be shorter (in terms of bytes) than instructions with memory operands, and when it comes to performance, shorter is usually better. In order to explain why that is true and how it relates to the 8-bit bus cycle-eater, I must diverge for a moment.

For the last few pages, you may well have been thinking that the 8-bit bus cycle-eater, while a nuisance, doesn’t seem particularly subtle or difficult to quantify. After all, any instruction reference tells us exactly how many cycles each instruction loses to the 8-bit bus cycle-eater, doesn’t it?

Yes and no. It’s true that in general we know approximately how much longer a given instruction will take to execute with a word-sized memory operand than with a byte-sized operand, although the dynamic RAM refresh and wait state cycle-eaters (which I’ll cover a little later) can raise the cost of the 8-bit bus cycle-eater considerably. However, all word-sized memory accesses lose 4 cycles to the 8-bit bus cycle-eater, and there’s one sort of word-sized memory access we haven’t discussed yet: instruction fetching. The ugliest manifestation of the 8-bit bus cycle-eater is in fact the prefetch queue cycle-eater.

The Prefetch Queue Cycle-Eater

In an 8088 context, here’s the prefetch queue cycle-eater in a nutshell: The 8088’s 8-bit external data bus keeps the Bus Interface Unit from fetching instruction bytes as fast as the 16-bit Execution Unit can execute them, so the Execution Unit often lies idle while waiting for the next instruction byte to be fetched.

Exactly why does this happen? Recall that the 8088 is an 8086 internally, but accesses word-sized memory data at only one-half the maximum rate of the 8086 due to the 8088’s 8-bit external data bus. Unfortunately, instructions are among the word-sized data the 8086 fetches, meaning that the 8088 can fetch instructions at only one-half the speed of the 8086. On the other hand, the 8086-equivalent Execution Unit of the 8088 can execute instructions every bit as fast as the 8086. The net result is that the Execution Unit burns up instruction bytes much faster than the Bus Interface Unit can fetch them, and ends up idling while waiting for instructions bytes to arrive.

The BIU can fetch instruction bytes at a maximum rate of one byte every 4 cycles—and that 4-cycle per instruction byte rate is the ultimate limit on overall instruction execution time, regardless of EU speed. While the EU may execute a given instruction that’s already in the prefetch queue in less than 4 cycles per byte, over time the EU can’t execute instructions any faster than they can arrive—and they can’t arrive faster than 1 byte every 4 cycles.

Clearly, then, the prefetch queue cycle-eater is nothing more than one aspect of the 8-bit bus cycle-eater. 8088 code often runs at less than the Execution Unit’s maximum speed because the 8-bit data bus can’t keep up with the demand for instruction bytes. That’s straightforward enough—so why all the fuss about the prefetch queue cycle-eater?

What makes the prefetch queue cycle-eater tricky is that it’s undocumented and unpredictable. That is, with a word-sized memory access, such as

mov  [bx],ax

it’s well-documented that an extra 4 cycles will always be required to write the upper byte of AX to memory. Not so with the prefetch queue cycle-eater lurking nearby. For instance, the instructions

shr  ax,1
shr  ax,1
shr  ax,1
shr  ax,1
shr  ax,1

should execute in 10 cycles, since each SHR takes 2 cycles to execute, according to Intel’s specifications. Those specifications contain Intel’s official instruction execution times, but in this case—and in many others—the specifications are drastically wrong. Why? Because they describe execution time once an instruction reaches the prefetch queue. They say nothing about whether a given instruction will be in the prefetch queue when it’s time for that instruction to run, or how long it will take that instruction to reach the prefetch queue if it’s not there already. Thanks to the low performance of the 8088’s external data bus, that’s a glaring omission—but, alas, an unavoidable one. Let’s look at why the official execution times are wrong, and why that can’t be helped.

Table of Contents