|Previous||Table of Contents||Next|
This chapter, adapted from my earlier book, Zen of Assembly Language located on the companion CD-ROM, goes right to the heart of my philosophy of optimization: Understand where the time really goes when your code runs. That may sound ridiculously simple, but, as this chapter makes clear, it turns out to be a challenging task indeed, one that at times verges on black magic. This chapter is a long-time favorite of mine because it was the firstand to a large extent onlywork that I know of that discussed this material, thereby introducing a generation of PC programmers to pedal-to-the-metal optimization.
This chapter focuses almost entirely on the first popular x86-family processor, the 8088. Some of the specific features and results that I cite in this chapter are no longer applicable to modern x86-family processors such as the 486 and Pentium, as Ill point out later on when we discuss those processors. Nonetheless, the overall theme of this chapterthat understanding dimly-seen and poorly-documented code gremlins called cycle-eaters that lurk in your system is essential to performance programmingis every bit as valid today. Also, later chapters often refer back to the basic cycle-eaters described in this chapter, so this chapter is the foundation for the discussions of x86-family optimization to come. Whats more, the Zen timer remains an excellent tool with which to flush out and examine cycle-eaters, as well see in later chapters, and this chapter is as good an illustration of how to use the Zen timer as youre likely to find.
So, dont take either the absolute or the relative execution times presented in this chapter as gospel for newer processors, and read on to later chapters to see how the cycle-eaters and optimization rules have changed over time, but do take the time to at least skim through this chapter to give yourself a good start on the material in the rest of this book.
Programming has many levels, ranging from the familiar (high-level languages, DOS calls, and the like) down to the esoteric things that lie on the shadowy edge of hardware-land. I call these cycle-eaters because, like the monsters in a bad 50s horror movie, they lurk in those shadows, taking their share of your programs performance without regard to the forces of goodness or the U.S. Army. In this chapter, were going to jump right in at the lowest level by examining the cycle-eaters that live beneath the programming interface; that is, beneath your application, DOS, and BIOSin fact, beneath the instruction set itself.
Why start at the lowest level? Simply because cycle-eaters affect the performance of all assembler code, and yet are almost unknown to most programmers. A full understanding of code optimization requires an understanding of cycle-eaters and their implications. Thats no simple task, and in fact it is in precisely that area that most books and articles about assembly programming fall short.
Nearly all literature on assembly programming discusses only the programming interface: the instruction set, the registers, the flags, and the BIOS and DOS calls. Those topics cover the functionality of assembly programs most thoroughlybut its performance above all else that were after. No one ever tells you about the raw stuff of performance, which lies beneath the programming interface, in the dimly-seen realmpopulated by instruction prefetching, dynamic RAM refresh, and wait stateswhere software meets hardware. This area is the domain of hardware engineers, and is almost never discussed as it relates to code performance. And yet it is only by understanding the mechanisms operating at this level that we can fully understand and properly improve the performance of our code.
Which brings us to cycle-eaters.
Cycle-eaters are gremlins that live on the bus or in peripherals (and sometimes within the CPU itself), slowing the performance of PC code so that it doesnt execute at full speed. Most cycle-eaters (and all of those haunting the older Intel processors) live outside the CPUs Execution Unit, where they can only affect the CPU when the CPU performs a bus access (a memory or I/O read or write). Once your code and data are already inside the CPU, those cycle-eaters can no longer be a problem. Only on the 486 and Pentium CPUs will you find cycle-eaters inside the chip, as well see in later chapters.
The nature and severity of the cycle-eaters vary enormously from processor to processor, and (especially) from memory architecture to memory architecture. In order to understand them all, we need first to understand the simplest among them, those that haunted the original 8088-based IBM PC. Later on in this book, Ill be better able to explain the newer generation of cycle-eaters in terms of those ancestral cycle-eatersbut we have to get the groundwork down first.
Internally, the 8088 is a 16-bit processor, capable of running at full speed at all timesunless external data is required. External data must traverse the 8088s external data bus and the PCs data bus one byte at a time to and from peripherals, with cycle-eaters lurking along every step of the way. Whats more, external data includes not only memory operands but also instruction bytes, so even instructions with no memory operands can suffer from cycle-eaters. Since some of the 8088s fastest instructions are register-only instructions, thats important indeed.
The major cycle-eaters are:
The locations of these cycle-eaters in the primordial 8088-based PC are shown in Figure 4.1. Well cover each of the cycle-eaters in turn in this chapter. The material wont be easy since cycle-eaters are among the most subtle aspects of assembly programming. By the same token, however, this will be one of the most important and rewarding chapters in this book. Dont worry if you dont catch everything in this chapter, but do read it all even if the going gets a bit tough. Cycle-eaters play a key role in later chapters, so some familiarity with them is highly desirable.
Look! Down on the motherboard! Its a 16-bit processor! Its an 8-bit processor! Its...
Fans of the 8088 call it a 16-bit processor. Fans of other 16-bit processors call the 8088 an 8-bit processor. The truth of the matter is that the 8088 is a 16-bit processor that often performs like an 8-bit processor.
The 8088 is internally a full 16-bit processor, equivalent to an 8086. (In fact, the 8086 is identical to the 8088, except that it has a full 16-bit bus. The 8088 is basically the poor mans 8086, because it allows a cheaperalbeit slowersystem to be built, thanks to the half-sized bus.) In terms of the instruction set, the 8088 is clearly a 16-bit processor, capable of performing any given 16-bit operationaddition, subtraction, even multiplication or divisionwith a single instruction. Externally, however, the 8088 is unequivocally an 8-bit processor, since the external data bus is only 8 bits wide. In other words, the programming interface is 16 bits wide, but the hardware interface is only 8 bits wide, as shown in Figure 4.2. The result of this mismatch is simple: Word-sized data can be transferred between the 8088 and memory or peripherals at only one-half the maximum rate of the 8086, which is to say one-half the maximum rate for which the Execution Unit of the 8088 was designed.
|Previous||Table of Contents||Next|