Previous Table of Contents Next

Listing Borland Microsoft Borland Microsoft Assembly Optimization

(no opt) (no opt) (opt) (opt)
1 166.9 166.8 167.0 165.8 155.1 1.08
4 13.5 13.6 13.5 13.5 ... 1.01
5 4.7 5.5 3.8 3.4 2.7 2.04
Ratio best
to worst
35.51 30.33 43.95 48.76 57.44
Note: The execution times (in seconds) for this chapter’s listings were timed when the compiled listings were run on the WordPerfect 4.2 thesaurus file TH.WP (362,293 bytes in size), as compiled in the small model with Borland and Microsoft compilers with optimization on (opt) and off (no opt). All times were measured with Paradigm Systems’ TIMER program on a 10 MHz 1-wait-state AT clone with a 28-ms hard disk, with disk caching turned off.

Table 1.1 Execution Times for WordPerfect Checksum.

LISTING 1.2 L1-2.C

* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time in
* assembler, via direct calls to DOS.

#include <stdio.h>
#include <fcntl.h>

main(int argc, char *argv[]) {
      int Handle;
      unsigned char Byte;
      unsigned int Checksum;
      int ReadLength;

      if ( argc != 2 ) {
            printf(“usage: checksum filename\n”);
      if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
            printf(“Can’t open file: %s\n”, argv[1]);
      if ( !ChecksumFile(Handle, &Checksum) ) {
            printf(“Error reading file %s\n”, argv[1]);

      /* Report the result */
      printf(“The checksum is: %u\n”, Checksum);


; Assembler subroutine to perform a 16-bit checksum on the file
; opened on the passed-in handle. Stores the result in the
; passed-in checksum variable. Returns 1 for success, 0 for error.
; Call as:
;           int ChecksumFile(unsigned int Handle, unsigned int *Checksum);
; where:
;           Handle = handle # under which file to checksum is open
;           Checksum = pointer to unsigned int variable checksum is
;           to be stored in
; Parameter structure:
Parms      struc
                 dw        ?       ;pushed BP
                 dw        ?       ;return address
Handle           dw        ?
Checksum         dw        ?
Parms      ends
                 .model small
TempWord label   word
TempByte         db        ?       ;each byte read by DOS will be stored here
                 db        0       ;high byte of TempWord is always 0
                                   ;for 16-bit adds
                 public _ChecksumFile
_ChecksumFile    proc near
                 push      bp
                 mov       bp,sp
                 push      si                  ;save C’s register variable
                 mov       bx,[bp+Handle]       ;get file handle
                 sub       si,si                ;zero the checksum ;accumulator
                 mov       cx,1                 ;request one byte on each ;read
                 mov       dx,offset TempByte   ;point DX to the byte in
                                                ;which DOS should store
                                                ;each byte read
                 mov       ah,3fh               ;DOS read file function #
                 int       21h                  ;read the byte
jcErrorEnd;an error occurred
                 and       ax,ax                ;any bytes read?
                 jz        Success              ;no-end of file reached-we’re done
                 add       si,[TempWord]        ;add the byte into the
                                                ;checksum total
                 sub       ax,ax                ;error
                 jmp       short Done
                 mov       bx,[bp+Checksum] ;point to the checksum variable
                 mov       [bx],si              ;save the new checksum
                 mov       ax,1                 ;success
                 pop       si                   ;restore C’s register variable
                 pop       bp

The lesson is clear: Optimization makes code faster, but without proper design, optimization just creates fast slow code.

Well, then, how are we going to improve our design? Before we can do that, we have to understand what’s wrong with the current design.

Know the Territory

Just why is Listing 1.1 so slow? In a word: overhead. The C library implements the read() function by calling DOS to read the desired number of bytes. (I figured this out by watching the code execute with a debugger, but you can buy library source code from both Microsoft and Borland.) That means that Listing 1.1 (and Listing 1.3 as well) executes one DOS function per byte processed—and DOS functions, especially this one, come with a lot of overhead.

For starters, DOS functions are invoked with interrupts, and interrupts are among the slowest instructions of the x86 family CPUs. Then, DOS has to set up internally and branch to the desired function, expending more cycles in the process. Finally, DOS has to search its own buffers to see if the desired byte has already been read, read it from the disk if not, store the byte in the specified location, and return. All of that takes a long time—far, far longer than the rest of the main loop in Listing 1.1. In short, Listing 1.1 spends virtually all of its time executing read(), and most of that time is spent somewhere down in DOS.

You can verify this for yourself by watching the code with a debugger or using a code profiler, but take my word for it: There’s a great deal of overhead to DOS calls, and that’s what’s draining the life out of Listing 1.1.

How can we speed up Listing 1.1? It should be clear that we must somehow avoid invoking DOS for every byte in the file, and that means reading more than one byte at a time, then buffering the data and parceling it out for examination one byte at a time. By gosh, that’s a description of C’s stream I/O feature, whereby C reads files in chunks and buffers the bytes internally, doling them out to the application as needed by reading them from memory rather than calling DOS. Let’s try using stream I/O and see what happens.

Listing 1.4 is similar to Listing 1.1, but uses fopen() and getc() (rather than open() and read()) to access the file being checksummed. The results confirm our theories splendidly, and validate our new design. As shown in Table 1.1, Listing 1.4 runs more than an order of magnitude faster than even the assembly version of Listing 1.1, even though Listing 1.1 and Listing 1.4 look almost the same. To the casual observer, read() and getc() would seem slightly different but pretty much interchangeable, and yet in this application the performance difference between the two is about the same as that between a 4.77 MHz PC and a 16 MHz 386.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash