Overview
Being able to search for a string in a larger string is very important in computer science. There are a number of applications, not the least of which is the Ctrl+F
command that everyone is used to. Boyer-Moore is an extremely fast algorithm for string search and it runs in O(n) time. The problem is that Boyer-Moore’s implementation is not trivial.
All is not lost, there is a derivation of Boyer-Moore which is easier to implement. It is called the Boyer-Moore-Horspool algorithm. I will be going over the code for this algorithm in this post.
Content
The algorithm goes from the beginning to the end of the long text comparing from the back of the sub.
Let’s see an example.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/*
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|
I wish I had more time to learn algorithms
learn
|||||
*/
Beautiful, isn’t it? How does it work? Well, it creates a dictionary mapping the characters in the substring to how much to skip ahead so that the characters align. At the penultimate step, it sees that r and n do not match, so it looks up r in the dictionary. It finds r which says to move over by 1. We then try to match the substring backwards again. This time, we succeed.
Here is the full implementation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
public static int BoyerMooreHorspool(string text, string sub) {
int length = sub.Length;
// This is our step dictionary
Dictionary<char, int> badTable = new Dictionary<char, int>();
for (int i = 0; i < length; ++i) {
// this is how much we have to skip to get from the last char
// to each other char
badTable[sub[i]] = length - i - 1;
}
// the last character is always the length of the substring
badTable[sub[length - 1]] = length;
int mainIdx = length - 1; // pointer which will go over text
int subIdx = length - 1; // pointer which will go over sub
// while the text pointer isn't at the end
while (mainIdx < text.Length) {
int tempMain = mainIdx;
// we found a match!
if (text[mainIdx] == sub[subIdx]) {
int tempSub = subIdx;
// now try to go backwards and see
// if the other characters match too
while ((tempMain > 0 && mainIdx - tempMain) < (length - 1)
&& text[tempMain]) == sub[tempSub]) {
--tempMain;
--tempSub;
}
// all characters match, return the beginning of the match
if ((mainIdx - tempMain) == (length - 1)) {
return tempMain;
}
}
// not all characters match, look if the
// non matching character is in our table
if (badTable.ContainsKey(text[tempMain])) {
// it's in our table, move the mainIdx
// forward by that number
mainIdx += badTable[text[tempMain]];
} else {
// otherwise, just move forward
// by the length of sub
mainIdx += length;
}
}
return -1;
}