Previous Table of Contents Next

Chapter 14
Boyer-Moore String Searching

Optimizing a Pretty Optimum Search Algorithm

When you seem to be stumped, stop for a minute and think. All the information you need may be right in front of your nose if you just look at things a little differently. Here’s a case in point:

When I was in college, I used to stay around campus for the summer. Oh, I’d take a course or two, but mostly it was an excuse to hang out and have fun. In that spirit, my girlfriend, Adrian (not my future wife, partly for reasons that will soon become apparent), bussed in to spend a week, sharing a less-than-elegant $150 per month apartment with me and, by necessity, my roommate.

Our apartment was pretty much standard issue for two male college students; maybe even a cut above. The dishes were usually washed, there was generally food in the refrigerator, and nothing larger than a small dog had taken up permanent residence in the bathroom. However, there was one sticking point (literally): the kitchen floor. This floor—standard tile, with a nice pattern of black lines on an off-white background (or so we thought)—had never been cleaned. By which I mean that I know for a certainty that we had never cleaned it, but I suspect that it had in fact not been cleaned since the Late Jurassic, or possibly earlier. Our feet tended to stick to it; had the apartment suddenly turned upside-down, I think we’d all have been hanging from the ceiling.

One day, my roommate and I returned from a pick-up basketball game. Adrian, having been left to her own devices for a couple of hours, had apparently kept herself busy. “Notice anything?” she asked, with an edge to her voice that suggested we had damned well better.

“Uh, you cooked dinner?” I guessed. “Washed the dishes? Had your hair done?” My roommate was equally without a clue.

She stamped her foot (really; the only time I’ve ever seen it happen), and said, “No, you jerks! The kitchen floor! Look at the floor! I cleaned it!”

The floor really did look amazing. It was actually all white; the black lines had been grooves filled with dirt. We assured her that it looked terrific, it just wasn’t that obvious until you knew to look for it; anyone would tell you that it wasn’t the kind of thing that jumped out at you, but it really was great, no kidding. We had almost smoothed things over, when a friend walked in, looked around with a start, and said, “Hey! Did you guys put in a new floor?”

As I said, sometimes everything you need to know is right in front of your nose. Which brings us to Boyer-Moore string searching.

String Searching Refresher

I’ve discussed string searching earlier in this book, in Chapters 5 and 9. You may want to refer back to these chapters for some background on string searching in general. I’m also going to use some of the code from that chapter as part of this chapter’s test suite. For further information, you may want to refer to the discussion of string searching in the excellent Algorithms in C, by Robert Sedgewick (Addison-Wesley), which served as the primary reference for this chapter. (If you look at Sedgewick, be aware that in the Boyer-Moore listing on page 288, there is a mistake: “j > 0” in the for loop should be “j >= 0,” unless I’m missing something.)

String searching is the simple matter of finding the first occurrence of a particular sequence of bytes (the pattern) within another sequence of bytes (the buffer). The obvious, brute-force approach is to try every possible match location, starting at the beginning of the buffer and advancing one position after each mismatch, until either a match is found or the buffer is exhausted. There’s even a nifty string instruction, REPZ CMPS, that’s perfect for comparing the pattern to the contents of the buffer at each location. What could be simpler?

We have some important information that we’re not yet using, though. Typically, the buffer will contain a wide variety of bytes. Let’s assume that the buffer contains text, in which case there will be dozens of different characters; and although the distribution of characters won’t usually be even, neither will any one character constitute half the buffer, or anything close. A reasonable conclusion is that the first character of the pattern will rarely match the first character of the buffer location currently being checked. This allows us to use the speedy REPNZ SCASB to whiz through the buffer, eliminating most potential match locations with single repetitions of SCASB. Only when that first character does (infrequently) match must we drop back to the slower REPZ CMPS approach.

It’s important to understand that we’re assuming that the buffer is typical text. That’s what I meant at the outset, when I said that the information you need may be under your nose.

Formally, you don’t know a blessed thing about the search buffer, but experience, common sense, and your knowledge of the application give you a great deal of useful, if somewhat imprecise, information.

If the buffer contains the letter ‘A’ repeated 1,000 times, followed by the letter ‘B,’ then the REPNZ SCASB/REPZ CMPS approach will be much slower than the brute-force REPZ CMPS approach when searching for the pattern “AB,” because REPNZ SCASB would match at every buffer location. You could construct a horrendous worst-case scenario for almost any good optimization; the key is understanding the usual conditions under which your code will work.

As discussed in Chapter 9, we also know that certain characters have lower probabilities of matching than others. In a normal buffer, ‘T’ will match far more often than ‘X.’ Therefore, if we use REPNZ SCASB to scan for the least common letter in the search string, rather than the first letter, we’ll greatly decrease the number of times we have to drop back to REPZ CMPS, and the search time will become very close to the time it takes REPNZ SCASB to go from the start of the buffer to the match location. If the distance to the first match is N bytes, the least-common REPNZ SCASB approach will take about as long as N repetitions of REPNZ SCASB.

At this point, we’re pretty much searching at the speed of REPNZ SCASB. On the x86, there simply is no faster way to test each character in turn. In order to get any faster, we’d have to check fewer characters—but we can’t do that and still be sure of finding all matches. Can we?

Actually, yes, we can.

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash