|Previous||Table of Contents||Next|
All our a priori knowledge of string searching is stated above, but theres another sort of knowledgeknowledge thats generated dynamically. As we search through the buffer, we acquire information each time we check for a match. One sort of information that we acquire is based on partial matches; we can often skip ahead after partial matches because (take a deep breath!) by partially matching, we have already implicitly done a comparison of the partially matched buffer characters with all possible pattern start locations that overlap those partially-matched bytes.
If that makes your head hurt, it shouldand dont worry. This line of thinking, which is the basis of the Knuth-Morris-Pratt algorithm and half the basis of the Boyer-Moore algorithm, is what gives Boyer-Moore its reputation for inscrutability. That reputation is well deserved for this aspect (which I will not discuss further in this book), but theres another part of Boyer-Moore thats easily understood, easily implemented, and highly effective.
Consider this: Were searching for the pattern ABC, beginning the search at the start (offset 0) of a buffer containing ABZABC. We match on A, we match on B, and we mismatch on C; the buffer contains a Z in this position. What have we learned? Why, weve learned not only that the pattern doesnt match the buffer starting at offset 0, but also that it cant possibly match starting at offset 1 or offset 2, either! After all, theres a Z in the buffer at offset 2; since the pattern doesnt contain a single Z, theres no way that the pattern can match starting at any location from which it would span the Z at offset 2. We can just skip straight from offset 0 to offset 3 and continue, saving ourselves two comparisons.
Unfortunately, this approach only pays off big when a near-complete partial match is found; if the comparison fails on the first pattern character, as often happens, we can only skip ahead 1 byte, as usual. Look at it differently, though: What if we compare the pattern starting with the last (rightmost) byte, rather than the first (leftmost) byte? In other words, what if we compare from high memory toward low, in the direction in which string instructions go after the STD instruction? After all, were comparing one set of bytes (the pattern) to another set of bytes (a portion of the buffer); it doesnt matter in the least in what order we compare them, so long as all the bytes in one set are compared to the corresponding bytes in the other set.
|Why on earth would we want to start with the rightmost character? Because a mismatch on the rightmost character tells us a great deal more than a mismatch on the leftmost character.|
We learn nothing new from a mismatch on the leftmost character, except that the pattern cant match starting at that location. A mismatch on the rightmost character, however, tells us about the possibilities of the pattern matching starting at every buffer location from which the pattern spans the mismatch location. If the mismatched character in the buffer doesnt appear in the pattern, then weve just eliminated not one potential match, but as many potential matches as there are characters in the pattern; thats how many locations there are in the buffer that might have matched, but have just been shown not to, because they overlap the mismatched character that doesnt belong in the pattern. In this case, we can skip ahead by the full pattern length in the buffer! This is how we can outperform even REPNZ SCASB; REPNZ SCASB has to check every byte in the buffer, but Boyer-Moore doesnt.
Figure 14.1 illustrates the operation of a Boyer-Moore search when the rightcharacter of the search pattern (which is the first character thats compared at each location because were comparing backwards) mismatches with a buffer character that appears nowhere in the pattern. Figure 14.2 illustrates the operation of a partial match when the mismatch occurs with a character thats not a pattern member. In this case, we can only skip ahead past the mismatch location, resulting in an advance of fewer bytes than the pattern length, and potentially as little as the same single byte distance by which the standard search approach advances.
Figure 14.1 Mismatch on first character checked.
What if the mismatch occurs with a buffer character that does occur in the pattern? Then we cant skip past the mismatch location, but we can skip to whatever location aligns the rightmost occurrence of that character in the pattern with the mismatch location, as shown in Figure 14.3.
Basically, we exercise our right as members of a free society to compare strings in whichever direction we choose, and we choose to do so right to left, rather than the more intuitive left to right. Whenever we find a mismatch, we see what we can learn from the buffer character that failed to match the pattern. Imagine that we move the pattern to the right across the mismatch location until we find a start location that the mismatch does not eliminate as a possible match for the pattern. If the mismatch character doesnt appear in the pattern, the pattern can move clear past the mismatch location. Otherwise, the pattern moves until a matching pattern byte lies atop the mismatch. Thats all there is to it!
Figure 14.2 Mismatch on third character checked.
The worst case for this version of Boyer-Moore is that the pattern mismatches on the leftmost characterthe last character comparedevery time. Again, not very likely, but it is true that this version of Boyer-Moore performs better as there are fewer and shorter partial matches; ideally, the rightmost character would never match until the full match location was reached. Longer patterns, which make for longer skips, help Boyer-Moore, as does a long distance to the match location, which helps diffuse the overhead of building the table of distances to skip ahead on all the possible mismatch values.
Figure 14.3 Mismatch on character that appears in pattern.
The best case for Boyer-Moore is good indeed: About N/M comparisons are required, where N is the buffer length and M is the pattern length. This reflects the ability of Boyer-Moore to skip ahead by a full pattern length on a complete mismatch.
|Previous||Table of Contents||Next|