Michael Abrash's Graphics Programming Black Book Special Edition: Crossing the Border

Searching Files with Restartable Blocks

We just moved. Those three little words should strike terror into the heart of anyone who owns more than a sleeping bag and a toothbrush. Our last move was the usual zoo—and then some. Because the distance from the old house to the new was only five miles, we used cars to move everything smaller than a washing machine. We have a sizable household—cats, dogs, kids, com, you name it—so the moving process took a number of car trips. A large number—33, to be exact. I personally spent about 15 hours just driving back and forth between the two houses. The move took days to complete.

Never again.

You’re probably wondering two things: What does this have to do with high-performance programming, and why on earth didn’t I rent a truck and get the move over in one or two trips, saving hours of driving? As it happens, the second question answers the first. I didn’t rent a truck because it seemed easier and cheaper to use cars—no big truck to drive, no rentals, spread the work out more manageably, and so on.

It wasn’t easier, and wasn’t even much cheaper. (It costs quite a bit to drive a car 330 miles, to say nothing of the value of 15 hours of my time.) But, at the time, it seemed as though my approach would be easier and cheaper. In fact, I didn’t realize just how much time I had wasted driving back and forth until I sat down to write this chapter.

In Chapter 1, I briefly discussed using restartable blocks. This, you might remember, is the process of handling in chunks data sets too large to fit in memory so that they can be processed just about as fast as if they did fit in memory. The restartable block approach is very fast but is relatively difficult to program.

At the opposite end of the spectrum lies byte-by-byte processing, whereby DOS (or, in less extreme cases, a group of library functions) is allowed to do all the hard work, so that you only have to deal with one byte at a time. Byte-by-byte processing is easy to program but can be extremely slow, due to the vast overhead that results from invoking DOS each time a byte must be processed.

Sound familiar? It should. I moved via the byte-by-byte approach, and the overhead of driving back and forth made for miserable performance. Renting a truck (the restartable block approach) would have required more effort and forethought, but would have paid off handsomely.

The easy, familiar approach often has nothing in its favor except that it requires less thinking; not a great virtue when writing high-performance code—or when moving.

And with that, let’s look at a fairly complex application of restartable blocks.

Searching for Text

The application we’re going to examine searches a file for a specified string. We’ll develop a program that will search the file specified on the command line for a string (also specified on the comline), then report whether the string was found or not. (Because the searched-for string is obtained via argv, it can’t contain any whitespace characters.)

This is a very limited subset of what search utilities such as grep can do, and isn’t really intended to be a generally useful application; the purpose is to provide insight into restartable blocks in particular and optimization in general in the course of developing a search engine. That search engine will, however, be easy to plug into any program, and there’s nothing preventing you from using it in a more fruitful context, like searching through a user-selectable file set.

The first point to address in designing our program involves the appropriate text-search approach to use. Literally dozens of workable ways exist to search a file. We can immediately discard all approaches that involve reading any byte of the file more than once, because disk access time is orders of magnitude slower than any data handling performed by our own code. Based on our experience in Chapter 1, we can also discard all approaches that get bytes either one at a time or in small sets from DOS. We want to read big “buffers-full” of bytes at a pop from the searched file, and the bigger the buffer the better—in order to minimize DOS’s overhead. A good rough cut is a buffer that will be between 16K and 64K, depending on the exact search approach, 64K being the maximum size because near pointers make for superior performance.

So we know we want to work with a large buffer, filling it as infrequently as possible. Now we have to figure out how to search through a file by loading it into that large buffer in chunks. To accomplish this, we have to know how we want to do our searching, and that’s not immediately obvious. Where do we begin?

Well, it might be instructive to consider how we would search if our search involved only one buffer, already resident in memory. In other words, suppose we don’t have to bother with file handling at all, and further suppose that we don’t have to deal with searching through multiple blocks. After all, that’s a good description of the all-important inner loop of our searching program, where the program will spend virtually all of its time (aside from the unavoidable disk access overhead).

Avoiding the String Trap

The easiest approach would be to use a C/C++ library function. The closest match to what we need is strstr(), which searches one string for the first occurrence of a second string. However, while strstr() would work, it isn’t ideal for our purposes. The problem is this: Where we want to search a fixed-length buffer for the first occurrence of a string, strstr() searches a string for the first occurrence of another string.

Table of Contents

Chapter 5Crossing the Border

Searching Files with Restartable Blocks

Searching for Text

Avoiding the String Trap

Chapter 5
Crossing the Border