Using Regular Expressions to Match HTML
Mar 14th, 2005 by Dave
Here are some code snippets to parse through HTML and strip out useful content I used for a .NET forms based screen scraping application written in C#.
Strip out all tags except <a/> tags:
s = Regex.Replace(s,”(</*[^aA](.|\n)+?>)+”,”\n”);
Pull the link from the href and discard the surrounding tags:
s = Regex.Replace(s,”< [aA]\\s*href=\”([^\"]*)\”\\s*>”,”$1\n”);
Turn numbered sections into a match collection (specific to my implementation but you get the idea):
MatchCollection mc =
Regex.Matches(s,”^(\\d+\\.)(\n(?!\\d+\\.).*){1,}”,RegexOptions.Multiline);
Then you can iterate through the match collection, splitting it into columns of a dataset AddRow and then update it back to the database.
Haacked shows a code listing for an exe that when run, builds a fully compiled version of your regular expression into an assembly that you can then reference in any project:
Building a Regular Expression Library Source Listing