1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
private static Regex _namedtags = new Regex (@"</?(?<tagname>\w+)[^>]*(\s|$|>)", RegexOptions.Singleline | RegexOptions.ExplicitCapture); /// <summary> /// attempt to balance HTML tags in the html string /// by removing any unmatched opening or closing tags /// </summary> public static string BalanceTags(string html) { var tagname = ""; var tag = ""; var ignoredtags = "<p<img<br<li"; var match = 0; MatchCollection tags = _namedtags.Matches(html); bool[] tagpaired = new bool[tags.Count]; // loop through matched tags in reverse order for (int i = tags.Count - 1; i > -1; i--) { tagname = tags[i].Groups["tagname"].Value.ToLower(); tag = tags[i].Value; match = -1; // skip any tags in our ignore list; assume they're self-closed if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname)) { if (tag.StartsWith("</")) { // if searching backwards, look for opening tags for (int j = i - 1; j > -1; j--) { if (!tagpaired[j] && tags[j].Value.ToLower().StartsWith("<" + tagname)) { match = j; break; } } } else { // if searching forwards, look for closing tags for (int j = i + 1; j < tags.Count; j++) { if (!tagpaired[j] && tags[j].Value.ToLower().StartsWith("</" + tagname)) { match = j; break; } } } if (match > -1) { // found matching opening/closing tag tagpaired[match] = true; tagpaired[i] = true; } else { // no matching opening/closing tag found -- remove this tag! html = html.Remove(tags[i].Index, tags[i].Length); tagpaired[i] = true; System.Diagnostics.Debug.WriteLine("unbalanced tag removed: " + tags[i]); } } } return html; }
Refactorings
No refactoring yet !
Konrad
July 11, 2008, July 11, 2008 08:58, permalink
Hi Jeff,
just two quick questions:
1. Why is 'p' part of the self-closing tags list? Mistake? Intent?
2. I wouldn't call 'Remove' repeatedly on a string. This method is good for quick'n'dirty text manipulation; for everything else, building the text incrementally with a StringBuilder is faster *and* cleaner. Plus, iterating over a list in reverse is plain ugly (because you can't use 'foreach').
Jeff Atwood
July 11, 2008, July 11, 2008 09:25, permalink
> p isn't self closing
Well, it should be. :) We have a lot of data with unclosed <p> tags for testing. Also, since I am using the ban-hammer of REMOVAL, it's not a fun tag to make mistakes with for people who enter a lot of unclosed <p> tags and have JavaScript disabled, so they can input anything in our editor textbox.
> iterating over a list in reverse is plain ugly (because you can't use 'foreach').
That's necessary so we can continue to manipulate the string while the size is changing -- otherwise it's a PITA when we go forward; the original tag match locations downstream of each replace are no longer valid and must be adjusted with an offset for every deletion.
Peter Hosey
July 11, 2008, July 11, 2008 09:42, permalink
> var ignoredtags = "<p<img<br";
What's wrong with a list or set here?
Duncan Smart
July 11, 2008, July 11, 2008 09:52, permalink
IMHO forget the Regex-ing. Load it up into HtmlAgilityPack.HtmlDocument which is IXPathNavigable and manipulate it that way. (http://www.codeplex.com/htmlagilitypack)
Ian Potter
July 11, 2008, July 11, 2008 10:02, permalink
I liked your article on flattening arrow code: http://www.codinghorror.com/blog/archives/000486.html
Try continue for ignoredtags.
1 2
if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname)) continue;
Duncan Smart
July 11, 2008, July 11, 2008 10:06, permalink
Example that actually fixes up the tags.
Example using HtmlAgilityPack.HtmlDocument
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
HtmlDocument doc = new HtmlDocument(); doc.OptionFixNestedTags = true; doc.LoadHtml(@" <body> <ul> <li>Hello <li>World </ul> </body> "); StringWriter output = new StringWriter(); doc.Save(output); Debug.WriteLine(output.ToString()); /* Outputs: <body> <ul> <li>Hello </li><li>World </li></ul> </body> */
Jeff Atwood
July 11, 2008, July 11, 2008 10:44, permalink
> What's wrong with a list or set here?
Much, much, MUCH slower in my benchmarking. The simple String.Contains() test is extremely fast to evaluate, which is an issue when we're calling it in a loop.
> Example that actually fixes up the tags.
The agility pack is 4x slower than this routine in my testing -- result of 4,000 runs on medium sized input:
Regex based Balance
563
560
555
HtmlAgilityPack balance
2048
1990
1974
Also, HtmlAgilityPack may fix up orphaned <begin> tags but it strips out orphaned </end> tags just like I do..
I need to add the <li> elements to the ignore list as well; those don't need to be closed. I know I never close mine..
Peter Hosey
July 11, 2008, July 11, 2008 11:47, permalink
> The simple String.Contains() test is extremely fast to evaluate …
I would think that a list's Contains method should be at least as fast to evaluate, depending on the algorithm that String.Contains uses.
Another problem is the concatenation ("<" + tagname). Doing that every iteration adds up, too.
I'm not going to flat-out tell you you're wrong, because I don't have a C# runtime on my computer. I do, however, ask you to carefully check your results and make sure you didn't use a different order of magnitude or the wrong collection class or something. If your results are correct, I'd be very curious why.
Corbin March
July 11, 2008, July 11, 2008 21:07, permalink
Is leaving incorrect nesting a requirement? If not, this may be useful. It collects opening tags as they occur, searches backward through them for each closing tag, and then discards everything between the closing tag and its match.
Advantage: It only compares closing tags to opening tags, so it eliminates the need for the StartsWith("<" + tagname)/StartsWith("</" + tagname) in the comparison loop (saves a string concatenation). Enforces correct nesting.
Disadvantages: An intermediate collection and Sort. Not really any simpler. Enforces correct nesting.
To ignore correct nesting, remove lines 35 to 41.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
public static string BalanceTags(string html) { MatchCollection tags = _namedtags.Matches(html); List<Match> openingTags = new List<Match>(); List<Match> tagsToRemove = new List<Match>(); String tagname = ""; String ignoredtags = "<p<img<br<li"; foreach (Match tag in tags) { tagname = tag.Groups["tagname"].ToString(); if (!ignoredtags.Contains("<" + tagname)) { if (tag.ToString().StartsWith("</")) { //closing tag - work backward through opening tags for a match int matchIndex = -1; for (int i = openingTags.Count - 1; i >= 0; i--) { if (openingTags[i].Groups["tagname"].ToString().Equals(tagname, StringComparison.CurrentCultureIgnoreCase)) { matchIndex = i; break; } } if (matchIndex == -1) { //no matching opening tag tagsToRemove.Add(tag); } else { for (int i = openingTags.Count - 1; i > matchIndex; i--) { //remove opening tags between the closing tag and its match //(omit this to leave broken HTML nesting alone) tagsToRemove.Add(openingTags[i]); openingTags.RemoveAt(i); } openingTags.RemoveAt(matchIndex); } } else { //opening tag - add to our 'stack' openingTags.Add(tag); } } } //remaining opening tags have no match tagsToRemove.AddRange(openingTags); tagsToRemove.Sort(CompareTagMatch); foreach (Match tag in tagsToRemove) { html = html.Remove(tag.Index, tag.Length); } return html; } /// <summary> /// Comparision to sort in descending order by Match index /// </summary> private static int CompareTagMatch(Match x, Match y) { return -x.Index.CompareTo(y.Index); }
Jeff Atwood
July 12, 2008, July 12, 2008 00:30, permalink
Wow, Corbin, thank you. That's really nice -- it's even faster, and does more! Testing with 4,000 iterations on medium sized input:
my BalanceTags:
569 ms
549
549
your (improved) BalanceTags
566 ms
537
524
Ah, yes, a List of Matches eg List<Match> ! Why didn't I think of that..
Jeff Atwood
July 12, 2008, July 12, 2008 00:39, permalink
> I would think that a list's Contains method should be at least as fast to evaluate, depending on the algorithm that String.Contains uses.
And yet it is not. Result of 4,000 iterations:
use List<string> for ignored tags
650 ms
641
635
use String.Contains for ignored tags
585 ms
565
549
1 2 3 4
var ignoredtags = new List<String> { "p", "img", "br" }; if (!tagpaired[i] && !ignoredtags.Contains(tagname)) {
Jeff Atwood
July 12, 2008, July 12, 2008 00:46, permalink
> Another problem is the concatenation ("<" + tagname). Doing that every iteration adds up, too.
The difference is fairly small.. even if I (erroneously) use just the tagname, no concatenation, improvement is marginal. Result of 4,000 iterations:
tagname (broken!!)
543 ms
546 ms
534 ms
"<" + tagname
569 ms
558
551
String.Concat("<", tagname)
581 ms
555
551
1 2 3 4 5 6 7 8 9
// ** DO NOT USE ** // THIS IS BAD BROKEN CODE, ONLY DISPLAYED FOR BENCHMARKING PURPOSES! if (!tagpaired[i] && !ignoredtags.Contains(tagname)) // as coded if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname)) // alternate if (!tagpaired[i] && !ignoredtags.Contains(String.Concat("<", tagname)))
Paul Celi
July 12, 2008, July 12, 2008 18:09, permalink
Hi jeff, maybe a stupid question, but why do you consider ignoring the "<" would break Corbin's code ??
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
tagname = tag.Groups["tagname"].ToString(); if (!ignoredtags.Contains("<" + tagname)) { whatever; } // vs tagname = tag.Groups["tagname"].ToString(); if (!ignoredtags.Contains(tagname)) { whatever; }
Jeff Atwood
July 13, 2008, July 13, 2008 07:37, permalink
It doesn't break Corbin's code, it breaks mine. The point was that a single concatenation is fairly cheap, to the tune of 15ms over 4,000 iterations.
In general, yes, concatenation is expensive. But one per loop isn't really significant in my benchmarking. As to why the List<>.Contains is 25% slower than String.Contains, I dunno. But it is..
Konrad
July 13, 2008, July 13, 2008 09:53, permalink
> As to why the List<>.Contains is 25% slower than String.Contains, I dunno. But it is..
Actually, why should 'List.Contains' be any faster? It has to do the same work, plus a bit more to navigate through the elements.
On the other hand, using a hash table (=> Dictionary<>) *will* be faster -- but only for longer lists. The overhead of managing a hash table outweights the cost of iterating all characters in such a small string.
Just one thought about Colin's code: 'RemoveAt' is really slow. Why not use a 'Stack<>' here, since the only operations take place at the end of the list?
Chris Jester-Young
July 14, 2008, July 14, 2008 10:45, permalink
Hats off to Corbin for an excellent implementation. I would add a few suggestions:
1. Because <li>, <p>, and the like are put on the ignore list, people who type in “correct XHTML” with closer tags are going to get those stripped off. This is not strictly a problem, however for people writing lots of list items and/or paragraphs, this search-back through the ignore list is O(n²) on the number of such closer tags. Although malicious users are going to be able to cause this O(n²) case no matter what, it'd be nice for the more typical XHTML case to actually not impose that sort of performance penalty. I'll post a possible solution once I come up with one, it's just a thought I have at the moment.
2. I would suggest downcasing all the incoming tags like Jeff's version does, just to reduce the amount of case-insensitive comparisons that have to be done. If this is not done, then the ignoredtags.Contains() test has to be modified to be a case-insensitive one, too, in order for <P>, <LI>, and the like to work correctly.
3. (This applies to Jeff's version too.) For consistency, <hr /> should get added to the no-closer-needed list.
Chris Jester-Young
July 14, 2008, July 14, 2008 11:52, permalink
@Konrad: on your suggestion of using a Stack, it's impossible to iterate through the stack (looking for the opening tag) without popping all the items in between. If we _do_ find the opening tag, then yes, we do want to pop all the items in between (adding them to the discard list), but if we don't find it, we are supposed to keep the opening tag stack intact, and just place the closing tag in the discard list. RemoveAt is O(1) if the back element is being removed (see the .NET Framework reference source if you want to verify), which Corbin's code does, consistently.
WRT string searching (this is more a diversion of curiosity than anything else): I'm given to believe that there are some optimised string-searching algorithms out there. The ones I've heard of (Knuth-Morris-Pratt and Boyer-Moore) require pre-studying of the “needle”, and thus isn't suitable for the current application. However, I've also chanced to look at a few different strstr() implementations, to see how they do their stuff:
1. Find first character of the needle in the haystack (using strchr() or equivalent), then (if found) use the result as a starting point to compare the rest of the needle (using memcmp() or equivalent). If comparison fails, advance the pointer and start again. This approach is used in: String.indexOf() in OpenJDK, strstr() in BSD libc, std::basic_string::find in {GNU libstdc++, STLport, and Microsoft's C++ library}, and (from what I can gather) String.IndexOf() in .NET Framework.
2. Find the first two characters of the needle in the haystack (using a custom-written loop), then (if found) use the result as a starting point to compare the rest of the needle (again, using a custom-written loop). This approach is used in: strstr() in Microsoft's libc, strstr() in procmail, and strstr() in GNU libc (which uses the procmail implementation).
Dinah
July 21, 2008, July 21, 2008 18:09, permalink
Has a final version of this been decided upon? If so, would you be willing to post the final source code?
Bealer
July 24, 2008, July 24, 2008 18:01, permalink
Hmmm nice solutions but I personally wouldn't go this route at all.
For correcting, formatting, cleaning etc... HTML, Tidy is the best tool and it's mature, and widely used. There's even a .Net port Tidy.Net.
It may be slightly slower, but it'll be more accurate with the results. Regex's are for matching, and it's very difficult to match something like HTML as there are so many ways of writing it. I would worry that you're using the wrong tool for the job, and focusing too much on micro performance increases.
Dinah
July 24, 2008, July 24, 2008 20:01, permalink
(I tried to submit this bug a few days ago but I guess something went wrong b/c I don't see it here).
Tags will not get balanced if 'ignore' tags and regular tags are incorrectly nested. Example:
1
<p><strong></p></strong>
Bealer
July 24, 2008, July 24, 2008 23:38, permalink
My point exactly. It's too hard to try and Regex match such things. There will be endless amounts of possible bugs.
Tidy will clean up that sort of thing for you.
Jeff Atwood
July 30, 2008, July 30, 2008 20:43, permalink
> Tags will not get balanced if 'ignore' tags and regular tags are incorrectly nested.
This is by design; I don't view incorrect nesting as a problem that would prevent page rendering, like an unclosed <div> or <td> would. Am I wrong?
Jeff Atwood
July 30, 2008, July 30, 2008 20:45, permalink
> I would worry that you're using the wrong tool for the job, and focusing too much on micro performance increases.
Well, to play Devil's Advocate: I would worry that you're writing far too much code, and overengineering a solution with too many complex, unnecessary external dependencies. :)
Bealer
July 30, 2008, July 30, 2008 20:58, permalink
Hmmm, point is I'm not writing too much code, the opposite in fact as Tidy does it for me. Your 60 lines of code I can do in about 5-10.
Tidy may be a dependency but it's only one dependency, and is a very mature, well written project. Much like using any other library like NHibernate, Castle Windsor, or NUnit.
Regex's are obviously great for matching, but when HTML can be written so badly and in so many ways, matching becomes very difficult and potentially less accurate.
JasonBunting
August 20, 2008, August 20, 2008 17:38, permalink
Jeff, apparently you suffer from a bit of NIH? The first thing I thought of when I saw this code was "Why isn't he using HTML Tidy? One of the biggest benefits of open source software is that it has been used and tested by large numbers of people so that bugs and problems get worked out.
Why are you reinventing the wheel?
For the subset of HTML tags we care about, this function ensures that all tags in the input HTML are balanced (but not correctly nested, necessarily) -- it does this by *removing* any extra opening or closing tags it finds in the HTML string. Note: this routine is NOT designed to be a XSS sanitizer! It assumes it is running on safe, pre-sanitized HTML.