The Task
Implementing wildcard search functionality is not easy. The user expects asterisks (*) and question marks (?) to work, while SQL demands percents (%) and underscores (_), regular expressions need the dot and an asterisk (.*) or just a dot. And then there is the rare case when someone wants to search for an asterisk and needs escaping of the usual wildcards to work. How can we find escaped and unescaped wildcards?
To get an impression of what we are trying to achieve, here are some examples:
User's Query Expression | Regular Expression | SQL Like Expression | Comment |
---|---|---|---|
Foo* | Foo.* | Foo% | Unescaped wildcard (asterisk) |
Foo?ar | Foo.ar | Foo_ar | Unescaped wildcard (question mark) |
Foo\*bar\? | Foo\*bar\? | Foo*bar? | Escaped wildcards |
Foo\\bar\\\? | Foo\\bar\\\? | Foo\\bar\\? | Escaped escape symbol and escaped wildcard |
.Foo%bar_ | \.Foo%bar_ | .Foo\%bar\_ | Wildcards of the target language |
The last example might not be obvious at first sight. The user is searching for some text containing a dot, a percent sign, and an underscore. These are no wildcards or special characters of our query language, but they are in regular expressions and SQL respectively. So if the target language is SQL, we have to take care of them, because users typically don't know anything of SQL wildcards.
So how do we identify the special character sequences to handle? Of course, we are using regular expressions:
Use | Regular Expression (Java) | Regular Expression (C#) | All Matches (red colored and underlined) |
---|---|---|---|
Finds unescaped wildcards | (?<!\\)(?:\\\\)*(\*|\?) |
(?<!\\)(?:\\\\)*(\*|\?) | ab*dethis\*notx\\\?yz\\12_but\\\\*this |
Finds escaped wildcards | |||
Finds regular expression wildcards | |||
Finds SQL wildcards | (%|_|\[|\]|\^) |
Java developers, beware to add additional backslashes for the Java string coding.
I will just explain the first of these regular expressions in detail. The others are easy to grasp once that one is tamed.
The Third Group (\*|\?)
Let's begin with the third group of the regular expression: (\*|\?)
. It tells the regular expression automaton to look for an asterisk or a question mark. That's not very surprising given that we are searching for unescaped wildcards, and asterisks and question marks are wildcards. Because both, the asterisk and the question mark, are symbols of regular expressions themselves, we have to escape each with a backslash.
If we took just this group to search for unescaped wildcards, we would have the following matches:
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
Obviously this regular expression is not sufficient. The two matches with a flash are not intended, because the wildcards are escaped.
The First Group (?<!\\)
To prevent the regular expression automaton to accept a backslash before the matching wildcard, the first group is prepended: (?<!\\)
. The regular expression under test looks like this now: (?<!\\)(\*|\?)
. The symbols ?<!
introduce a negative look-behind. This negative look-behind tells the regular expression automaton to not accept a match of the following group if there is a backslash before it.
The test of this regular expression shows the following match:
ab*dethis\*notx\\\?yz\\12_but\\\\*this
The fourth item of the first match list is missing now. All wildcards following a backslash are not accepted by the negative look-behind, even if the backslash is escaped. This regular expression is too restrictive.
The Second Group (?:\\\\)*
So let's have a look at the second group completing our regular expression: (?:\\\\)*
. To begin with, we ignore the question mark followed by a colon (?:) and the asterisk after the parenthesis, we concentrate on the four backslashes: \\\\
. Two of those four backslashes are escape symbols, the first and the third. The rest is what we are really looking for. We tell the regular expression automaton to search for a backslash followed by a backslash. If we test this expression we get the following matches (again colored in red and underlined):
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
ab*dethis\*notx\\\?yz\\12_but\\\\*this
All these matches are escaped backslashes. If the user wants to search for a backslash she has to
Every even number of backslashes before a wildcard will not escape it. Also, if there is no backslash before the wildcard, it is unescaped. To tell the regular expression automaton that he can find this group zero or more times, we use the asterisk "(?:\\\\)*
"
These are the matches of the second and third group together:
Regular Expression (Java) | Regular Expression (C#) | Matches (red colored and underlined) |
---|---|---|
(\*|\?) |
ab*dethis\*notx\\\?yz\\12_but\\\\*this |
How do these two expressions work together? Perhaps they already suffice? Here are the matches of the regular expression:
So what does the first group do?