|Previous||Table of Contents||Next|
Lets look at examples from opposite ends of the spectrum in terms of the impact of DRAM refresh on code performance. First, consider the series of MUL instructions in Listing 4.9. Since a 16-bit MUL on the 8088 executes in between 118 and 133 cycles and is only 2 bytes long, there should be plenty of time for the prefetch queue to fill after each instruction, even after DRAM refresh has taken its slice of memory access time. Consequently, the prefetch queue should be able to keep the Execution Unit well-supplied with instruction bytes at all times. Since Listing 4.9 uses no memory operands, the Execution Unit should never have to wait for data from memory, and DRAM refresh should have no impact on performance. (Remember that the Execution Unit can operate normally during DRAM refreshes so long as it doesnt need to request a memory access from the Bus Interface Unit.)
LISTING 4.9 LST4-9.ASM
; Measures the performance of repeated MUL instructions, ; which allow the prefetch queue to be full at all times, ; to demonstrate a case in which DRAM refresh has no impact ; on code performance. ; sub ax,ax call ZTimerOn rept 1000 mul ax endm call ZTimerOff
Running Listing 4.9, we find that each MUL executes in 24.72 µs, or exactly 118 cycles. Since thats the shortest time in which MUL can execute, we can see that no performance is lost to DRAM refresh. Listing 4.9 clearly illustrates that DRAM refresh only affects code performance when a DRAM refresh forces the Execution Unit of the 8088 to wait for a memory access.
Now lets look at the series of SHR instructions shown in Listing 4.10. Since SHR executes in 2 cycles but is 2 bytes long, the prefetch queue should be empty while Listing 4.10 executes, with the 8088 prefetching instruction bytes non-stop. As a result, the time per instruction of Listing 4.10 should precisely reflect the time required to fetch the instruction bytes.
LISTING 4.10 LST4-10.ASM
; Measures the performance of repeated SHR instructions, ; which empty the prefetch queue, to demonstrate the ; worst-case impact of DRAM refresh on code performance. ; call ZTimerOn rept 1000 shr ax,1 endm call ZTimerOff
Since 4 cycles are required to read each instruction byte, wed expect each SHR to execute in 8 cycles, or 1.676 µs, if there were no DRAM refresh. In fact, each SHR in Listing 4.10 executes in 1.81 µs, indicating that DRAM refresh is taking 7.4 percent of the programs execution time. Thats nearly 2 percent more than our worst-case estimate of the loss to DRAM refresh overhead! In fact, the result indicates that DRAM refresh is stealing not 4, but 5.33 cycles out of every 72 cycles. How can this be?
The answer is that a given DRAM refresh can actually hold up CPU memory accesses for as many as 6 cycles, depending on the timing of the DRAM refreshs DMA request relative to the 8088s internal instruction execution state. When the code in Listing 4.10 runs, each DRAM refresh holds up the CPU for either 5 or 6 cycles, depending on where the 8088 is in executing the current SHR instruction when the refresh request occurs. Now we see that things can get even worse than we thought: DRAM refresh can steal as much as 8.33 percent of available memory access time6 out of every 72 cyclesfrom the 8088.
Which of the two cases weve examined reflects reality? While either case can happen, the latter casesignificant performance reduction, ranging as high as 8.33 percentis far more likely to occur. This is especially true for high-performance assembly code, which uses fast instructions that tend to cause non-stop instruction fetching.
Hmmm. When we discovered the prefetch queue cycle-eater, we learned to use short instructions. When we discovered the 8-bit bus cycle-eater, we learned to use byte-sized memory operands whenever possible, and to keep word-sized variables in registers. What can we do to work around the DRAM refresh cycle-eater?
As Ive said before, DRAM refresh is an act of God. DRAM refresh is a fundamental, unchanging part of the PCs operation, and theres nothing you or I can do about it. If refresh were any less frequent, the reliability of the PC would be compromised, so tinkering with either timer 1 or DMA channel 0 to reduce DRAM refresh overhead is out. Nor is there any way to structure code to minimize the impact of DRAM refresh. Sure, some instructions are affected less by DRAM refresh than others, but how many multiplies and divides in a row can you really use? I suppose that code could conceivably be structured to leave a free memory access every 72 cycles, so DRAM refresh wouldnt have any effect. In the old days when code size was measured in bytes, not K bytes, and processors were less powerfuland complexprogrammers did in fact use similar tricks to eke every last bit of performance from their code. When programming the PC, however, the prefetch queue cycle-eater would make such careful code synchronization a difficult task indeed, and any modest performance improvement that did result could never justify the increase in programming complexity and the limits on creative programming that such an approach would entail. Besides, all that effort goes to waste on faster 8088s, 286s, and other computers with different execution speeds and refresh characteristics. Theres no way around it: Useful code accesses memory frequently and at irregular intervals, and over the long haul DRAM refresh always exacts its price.
If youre still harboring thoughts of reducing the overhead of DRAM refresh, consider this. Instructions that tend not to suffer very much from DRAM refresh are those that have a high ratio of execution time to instruction fetch time, and those arent the fastest instructions of the PC. It certainly wouldnt make sense to use slower instructions just to reduce DRAM refresh overhead, for its total execution timeDRAM refresh, instruction fetching, and allthat matters.
The important thing to understand about DRAM refresh is that it generally slows your code down, and that the extent of that performance reduction can vary considerably and unpredictably, depending on how the DRAM refreshes interact with your codes pattern of memory accesses. When you use the Zen timer and get a fractional cycle count for the execution time of an instruction, thats often the DRAM refresh cycle-eater at work. (The display adapter cycleis another possible culprit, and, on 386s and later processors, cache misses and pipeline execution hazards produce this sort of effect as well.) Whenever you get two timing results that differ less or more than they seemingly should, thats usually DRAM refresh too. Thanks to DRAM refresh, variations of up to 8.33 percent in PC code performance are par for the course.
Wait states are cycles during which a bus access by the CPU to a device on the PCs bus is temporarily halted by that device while the device gets ready to complete the read or write. Wait states are well and truly the lowest level of code performance. Everything we have discussed (and will discuss)even DMA accessescan be affected by wait states.
|Previous||Table of Contents||Next|