The .NET Regex does not have an explicit mechanism for excluding certain data in your search. For example, I ran across a problem where I needed to find all instances of The Word in some text but I needed to ignore any instance of The Word that may have been found inside of a tag (<>) or inside of an href pair (<a href="blahblah">The Word</a>).
Not a straight forward problem. In my search I saw some really strange solutions including everything from walking the HTML DOM tree to custom parsers and worse. Most of the time, when you're writing code, if it makes you feel dirty then there's probably a better solution.
I eventually stumbled on this:
The idea is to attack this problem with match groups. The basic pseudo pattern is as follows:
string pattern = @"(PatternIWantToAvoid)|(MaybeAnotherPatternIwantToAvoid)|(ThePatternIWant)";
Notice how we have three Match groups there, denoted by the parenthesis. Each group is separated by the OR operator ("|"). The end result is that you will match all three patterns, but atomically. In other words, when you get your match groups back, you only care about what's in group #3 (in this case). Match group 0 is the entire string (as per usual). Match group 1 will ctonain matches for "PatternIWantToAvoid", match group 2 will contain matches for "MaybeAnotherPatternIWantToAvoid" and finally match group 3 is the "ThePatternIWant".
To restate the original problem I faced...
- I want to ignore everything that's a tag, i.e. anything that matches "<.[^>]*>".
- I also want to ignore everything that's an href pair, i.e. anything that matches "<a.*?</a>".
- I want to capture everything that matches "The Word", i.e. "\W(TheWord)\W" (this could be any pattern)
My final regex pattern then looks like this:
string pattern = string.Format(@"(<a.*?</a>)|(<.[^>]*>)|\W({0})\W", targetKeyword)
And here is the code to actually pull out the desired matches:
1: // Regex to find our keyword in the content string
2: string content = "target text to search";
3: string targetKeyword = "The Word";
4: string pattern = string.Format(@"(<a.*?</a>)|(<.[^>]*>)|\W({0})\W", targetKeyword); // looks for our keyword, but allows us to filter out everything in <tags> as well as everything already hyperlinked via <a href> 5: Regex targetWord = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline);
6: MatchCollection matches = targetWord.Matches(content);
7: for(int i = 0; i < matches.Count; i++)
8: { 9: Match m = matches[i];
10: if (m.Groups[3].Success) // group 3 is where our lonely targetKeyword finds would happen. group 1 is <a href>s and group 2 is other <tags>
11: { 12: string target = m.Groups[3].Value;
13: // Do something wonderous with your newfound treasure here
14: }
15: }
Tah-dah.