|Previous||Table of Contents||Next|
Back to the 8-bit bus cycle-eater. As Ive said, in 8088 work you should strive to use byte-sized memory variables whenever possible. That does not mean that you should use 2 byte-sized memory accesses to manipulate a word-sized memory variable in preference to 1 word-sized memory access, as, for instance,
mov dl,byte ptr [MemVar] mov dh,byte ptr [MemVar+1]
mov dx,word ptr [MemVar]
Recall that every access to a memory byte takes at least 4 cycles; that limitation is built right into the 8088. The 8088 is also built so that the second byte-sized memory access to a 16-bit memory variable takes just those 4 cycles and no more. Theres no way you can manipulate the second byte of a word-sized memory variable faster with a second separate byte-sized instruction in less than 4 cycles. As a matter of fact, youre bound to access that second byte much more slowly with a separate instruction, thanks to the overhead of instruction fetching and execution, address calculation, and the like.
For example, consider Listing 4.3, which performs 1,000 word-sized reads from memory. This code runs in 3.77 µs per word read on a 4.77 MHz 8088. Thats 45 percent faster than the 5.49 µs per word read of Listing 4.4, which reads the same 1,000 words as Listing 4.3 but does so with 2,000 byte-sized reads. Both listings perform exactly the same number of memory accesses2,000 accesses, each byte-sized, as all 8088 memory accesses must be. (Remember that the Bus Interface Unit must perform two byte-sized memory accesses in order to handle a word-sized memory operand.) However, Listing 4.3 is considerably faster because it expends only 4 additional cycles to read the second byte of each word, while Listing 4.4 performs a second LODSB, requiring 13 cycles, to read the second byte of each word.
LISTING 4.3 LST4-3.ASM
; Measures the performance of reading 1,000 words ; from memory with 1,000 word-sized accesses. ; sub si,si mov cx,1000 call ZTimerOn rep lodsw call ZTimerOff
LISTING 4.4 LST4-4.ASM
; Measures the performance of reading 1000 words ; from memory with 2,000 byte-sized accesses. ; sub si,si mov cx,2000 call ZTimerOn rep lodsb call ZTimerOff
In short, if you must perform a 16-bit memory access, let the 8088 break the access into two byte-sized accesses for you. The 8088 is more efficient at that task than your code can possibly be.
Word-sized variables should be stored in registers to the greatest feasible extent, since registers are inside the 8088, where 16-bit operations are just as fast as 8-bit operations because the 8-bit cycle-eater cant get at them. In fact, its a good idea to keep as many variables of all sorts in registers as you can. Instructions with register-only operands execute very rapidly, partially because they avoid both the time-consuming memory accesses and the lengthy address calculations associated with memory operands.
There is yet another reason why register operands are preferable to memory operands, and its an unexpected effect of the 8-bit bus cycle-eater. Instructions with only register operands tend to be shorter (in terms of bytes) than instructions with memory operands, and when it comes to performance, shorter is usually better. In order to explain why that is true and how it relates to the 8-bit bus cycle-eater, I must diverge for a moment.
For the last few pages, you may well have been thinking that the 8-bit bus cycle-eater, while a nuisance, doesnt seem particularly subtle or difficult to quantify. After all, any instruction reference tells us exactly how many cycles each instruction loses to the 8-bit bus cycle-eater, doesnt it?
Yes and no. Its true that in general we know approximately how much longer a given instruction will take to execute with a word-sized memory operand than with a byte-sized operand, although the dynamic RAM refresh and wait state cycle-eaters (which Ill cover a little later) can raise the cost of the 8-bit bus cycle-eater considerably. However, all word-sized memory accesses lose 4 cycles to the 8-bit bus cycle-eater, and theres one sort of word-sized memory access we havent discussed yet: instruction fetching. The ugliest manifestation of the 8-bit bus cycle-eater is in fact the prefetch queue cycle-eater.
In an 8088 context, heres the prefetch queue cycle-eater in a nutshell: The 8088s 8-bit external data bus keeps the Bus Interface Unit from fetching instruction bytes as fast as the 16-bit Execution Unit can execute them, so the Execution Unit often lies idle while waiting for the next instruction byte to be fetched.
Exactly why does this happen? Recall that the 8088 is an 8086 internally, but accesses word-sized memory data at only one-half the maximum rate of the 8086 due to the 8088s 8-bit external data bus. Unfortunately, instructions are among the word-sized data the 8086 fetches, meaning that the 8088 can fetch instructions at only one-half the speed of the 8086. On the other hand, the 8086-equivalent Execution Unit of the 8088 can execute instructions every bit as fast as the 8086. The net result is that the Execution Unit burns up instruction bytes much faster than the Bus Interface Unit can fetch them, and ends up idling while waiting for instructions bytes to arrive.
The BIU can fetch instruction bytes at a maximum rate of one byte every 4 cyclesand that 4-cycle per instruction byte rate is the ultimate limit on overall instruction execution time, regardless of EU speed. While the EU may execute a given instruction thats already in the prefetch queue in less than 4 cycles per byte, over time the EU cant execute instructions any faster than they can arriveand they cant arrive faster than 1 byte every 4 cycles.
Clearly, then, the prefetch queue cycle-eater is nothing more than one aspect of the 8-bit bus cycle-eater. 8088 code often runs at less than the Execution Units maximum speed because the 8-bit data bus cant keep up with the demand for instruction bytes. Thats straightforward enoughso why all the fuss about the prefetch queue cycle-eater?
What makes the prefetch queue cycle-eater tricky is that its undocumented and unpredictable. That is, with a word-sized memory access, such as
its well-documented that an extra 4 cycles will always be required to write the upper byte of AX to memory. Not so with the prefetch queue cycle-eater lurking nearby. For instance, the instructions
shr ax,1 shr ax,1 shr ax,1 shr ax,1 shr ax,1
should execute in 10 cycles, since each SHR takes 2 cycles to execute, according to Intels specifications. Those specifications contain Intels official instruction execution times, but in this caseand in many othersthe specifications are drastically wrong. Why? Because they describe execution time once an instruction reaches the prefetch queue. They say nothing about whether a given instruction will be in the prefetch queue when its time for that instruction to run, or how long it will take that instruction to reach the prefetch queue if its not there already. Thanks to the low performance of the 8088s external data bus, thats a glaring omissionbut, alas, an unavoidable one. Lets look at why the official execution times are wrong, and why that cant be helped.
|Previous||Table of Contents||Next|