51d623f33f8b83095db84ff35e15dbe8

For the subset of HTML tags we care about, this function ensures that all tags in the input HTML are balanced (but not correctly nested, necessarily) -- it does this by *removing* any extra opening or closing tags it finds in the HTML string. Note: this routine is NOT designed to be a XSS sanitizer! It assumes it is running on safe, pre-sanitized HTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
private static Regex _namedtags = new Regex
    (@"</?(?<tagname>\w+)[^>]*(\s|$|>)",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture);

/// <summary>
/// attempt to balance HTML tags in the html string
/// by removing any unmatched opening or closing tags
/// </summary>
public static string BalanceTags(string html)
{
    var tagname = "";
    var tag = "";
    var ignoredtags = "<p<img<br<li";
    var match = 0;

    MatchCollection tags = _namedtags.Matches(html);
    bool[] tagpaired = new bool[tags.Count];

    // loop through matched tags in reverse order
    for (int i = tags.Count - 1; i > -1; i--)
    {
        tagname = tags[i].Groups["tagname"].Value.ToLower();
        tag = tags[i].Value;
        match = -1;

        // skip any tags in our ignore list; assume they're self-closed
        if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname))
        {
            if (tag.StartsWith("</"))
            {
                // if searching backwards, look for opening tags
                for (int j = i - 1; j > -1; j--)
                {
                    if (!tagpaired[j] && tags[j].Value.ToLower().StartsWith("<" + tagname))
                    {
                        match = j;
                        break;
                    }
                }
            }
            else
            {
                // if searching forwards, look for closing tags
                for (int j = i + 1; j < tags.Count; j++)
                {
                    if (!tagpaired[j] && tags[j].Value.ToLower().StartsWith("</" + tagname))
                    {
                        match = j;
                        break;
                    }
                }
            }

            if (match > -1)
            {
                // found matching opening/closing tag
                tagpaired[match] = true;
                tagpaired[i] = true;
            }
            else
            {
                // no matching opening/closing tag found -- remove this tag!
                html = html.Remove(tags[i].Index, tags[i].Length);
                tagpaired[i] = true;
                System.Diagnostics.Debug.WriteLine("unbalanced tag removed: " + tags[i]);
            }
        }
    }

    return html;
}

Refactorings

No refactoring yet !

F9401202ab73e624cc82800b0fff1489

Konrad

July 11, 2008, July 11, 2008 08:58, permalink

No rating. Login to rate!

Hi Jeff,

just two quick questions:

1. Why is 'p' part of the self-closing tags list? Mistake? Intent?
2. I wouldn't call 'Remove' repeatedly on a string. This method is good for quick'n'dirty text manipulation; for everything else, building the text incrementally with a StringBuilder is faster *and* cleaner. Plus, iterating over a list in reverse is plain ugly (because you can't use 'foreach').

51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 11, 2008, July 11, 2008 09:25, permalink

No rating. Login to rate!

> p isn't self closing

Well, it should be. :) We have a lot of data with unclosed <p> tags for testing. Also, since I am using the ban-hammer of REMOVAL, it's not a fun tag to make mistakes with for people who enter a lot of unclosed <p> tags and have JavaScript disabled, so they can input anything in our editor textbox.

> iterating over a list in reverse is plain ugly (because you can't use 'foreach').

That's necessary so we can continue to manipulate the string while the size is changing -- otherwise it's a PITA when we go forward; the original tag match locations downstream of each replace are no longer valid and must be adjusted with an offset for every deletion.

63e8a7ad46b1e491d2e8dfd9efc042de

Peter Hosey

July 11, 2008, July 11, 2008 09:42, permalink

No rating. Login to rate!

> var ignoredtags = "<p<img<br";

What's wrong with a list or set here?

13ead10356c893aead42be91b5cdcc01

Duncan Smart

July 11, 2008, July 11, 2008 09:52, permalink

No rating. Login to rate!

IMHO forget the Regex-ing. Load it up into HtmlAgilityPack.HtmlDocument which is IXPathNavigable and manipulate it that way. (http://www.codeplex.com/htmlagilitypack)

Ce0162e9224181d6b9eadbe5c629b2ba

Ian Potter

July 11, 2008, July 11, 2008 10:02, permalink

No rating. Login to rate!

I liked your article on flattening arrow code: http://www.codinghorror.com/blog/archives/000486.html

Try continue for ignoredtags.

1
2
if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname))
    continue;
13ead10356c893aead42be91b5cdcc01

Duncan Smart

July 11, 2008, July 11, 2008 10:06, permalink

1 rating. Login to rate!

Example that actually fixes up the tags.

Example using HtmlAgilityPack.HtmlDocument

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
HtmlDocument doc = new HtmlDocument();
doc.OptionFixNestedTags = true;
doc.LoadHtml(@"
<body>
	<ul>
		<li>Hello
		<li>World
	</ul>
</body>
");

StringWriter output = new StringWriter();
doc.Save(output);

Debug.WriteLine(output.ToString());
/* Outputs:
<body>
	<ul>
		<li>Hello
		</li><li>World
	</li></ul>
</body>
 */
51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 11, 2008, July 11, 2008 10:44, permalink

No rating. Login to rate!

> What's wrong with a list or set here?

Much, much, MUCH slower in my benchmarking. The simple String.Contains() test is extremely fast to evaluate, which is an issue when we're calling it in a loop.

> Example that actually fixes up the tags.

The agility pack is 4x slower than this routine in my testing -- result of 4,000 runs on medium sized input:

Regex based Balance
563
560
555

HtmlAgilityPack balance
2048
1990
1974

Also, HtmlAgilityPack may fix up orphaned <begin> tags but it strips out orphaned </end> tags just like I do..

I need to add the <li> elements to the ignore list as well; those don't need to be closed. I know I never close mine..

63e8a7ad46b1e491d2e8dfd9efc042de

Peter Hosey

July 11, 2008, July 11, 2008 11:47, permalink

No rating. Login to rate!

> The simple String.Contains() test is extremely fast to evaluate …

I would think that a list's Contains method should be at least as fast to evaluate, depending on the algorithm that String.Contains uses.

Another problem is the concatenation ("<" + tagname). Doing that every iteration adds up, too.

I'm not going to flat-out tell you you're wrong, because I don't have a C# runtime on my computer. I do, however, ask you to carefully check your results and make sure you didn't use a different order of magnitude or the wrong collection class or something. If your results are correct, I'd be very curious why.

F17131c7feaf9a92fe35323f6ec48429

Corbin March

July 11, 2008, July 11, 2008 21:07, permalink

4 ratings. Login to rate!

Is leaving incorrect nesting a requirement? If not, this may be useful. It collects opening tags as they occur, searches backward through them for each closing tag, and then discards everything between the closing tag and its match.

Advantage: It only compares closing tags to opening tags, so it eliminates the need for the StartsWith("<" + tagname)/StartsWith("</" + tagname) in the comparison loop (saves a string concatenation). Enforces correct nesting.

Disadvantages: An intermediate collection and Sort. Not really any simpler. Enforces correct nesting.

To ignore correct nesting, remove lines 35 to 41.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
public static string BalanceTags(string html)
{
	MatchCollection tags = _namedtags.Matches(html);
	List<Match> openingTags = new List<Match>();
	List<Match> tagsToRemove = new List<Match>();
	String tagname = "";
	String ignoredtags = "<p<img<br<li";

	foreach (Match tag in tags)
	{
		tagname = tag.Groups["tagname"].ToString();

		if (!ignoredtags.Contains("<" + tagname))
		{
			if (tag.ToString().StartsWith("</"))
			{
				//closing tag - work backward through opening tags for a match
				int matchIndex = -1;
				for (int i = openingTags.Count - 1; i >= 0; i--)
				{
					if (openingTags[i].Groups["tagname"].ToString().Equals(tagname, StringComparison.CurrentCultureIgnoreCase))
					{
						matchIndex = i;
						break;
					}
				}

				if (matchIndex == -1)
				{
					//no matching opening tag
					tagsToRemove.Add(tag);
				}
				else
				{
					for (int i = openingTags.Count - 1; i > matchIndex; i--)
					{
						//remove opening tags between the closing tag and its match 
						//(omit this to leave broken HTML nesting alone)
						tagsToRemove.Add(openingTags[i]);
						openingTags.RemoveAt(i);
					}
					openingTags.RemoveAt(matchIndex);
				}
			}
			else
			{
				//opening tag - add to our 'stack'
				openingTags.Add(tag);
			}
		}
	}

	//remaining opening tags have no match
	tagsToRemove.AddRange(openingTags);

	tagsToRemove.Sort(CompareTagMatch);
	foreach (Match tag in tagsToRemove)
	{
		html = html.Remove(tag.Index, tag.Length);
	}

	return html;
}

/// <summary>
/// Comparision to sort in descending order by Match index
/// </summary>
private static int CompareTagMatch(Match x, Match y)
{
	return -x.Index.CompareTo(y.Index);
}
51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 12, 2008, July 12, 2008 00:30, permalink

No rating. Login to rate!

Wow, Corbin, thank you. That's really nice -- it's even faster, and does more! Testing with 4,000 iterations on medium sized input:

my BalanceTags:

569 ms
549
549

your (improved) BalanceTags

566 ms
537
524

Ah, yes, a List of Matches eg List<Match> ! Why didn't I think of that..

51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 12, 2008, July 12, 2008 00:39, permalink

No rating. Login to rate!

> I would think that a list's Contains method should be at least as fast to evaluate, depending on the algorithm that String.Contains uses.

And yet it is not. Result of 4,000 iterations:

use List<string> for ignored tags

650 ms
641
635

use String.Contains for ignored tags

585 ms
565
549

1
2
3
4
var ignoredtags = new List<String> { "p", "img", "br" };

if (!tagpaired[i] && !ignoredtags.Contains(tagname))
{  
51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 12, 2008, July 12, 2008 00:46, permalink

No rating. Login to rate!

> Another problem is the concatenation ("<" + tagname). Doing that every iteration adds up, too.

The difference is fairly small.. even if I (erroneously) use just the tagname, no concatenation, improvement is marginal. Result of 4,000 iterations:

tagname (broken!!)

543 ms
546 ms
534 ms

"<" + tagname

569 ms
558
551

String.Concat("<", tagname)

581 ms
555
551

1
2
3
4
5
6
7
8
9
// ** DO NOT USE **
// THIS IS BAD BROKEN CODE, ONLY DISPLAYED FOR BENCHMARKING PURPOSES!
if (!tagpaired[i] && !ignoredtags.Contains(tagname))

// as coded
if (!tagpaired[i] && !ignoredtags.Contains("<" + tagname))

// alternate
if (!tagpaired[i] && !ignoredtags.Contains(String.Concat("<", tagname)))
Avatar

Paul Celi

July 12, 2008, July 12, 2008 18:09, permalink

No rating. Login to rate!

Hi jeff, maybe a stupid question, but why do you consider ignoring the "<" would break Corbin's code ??

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
tagname = tag.Groups["tagname"].ToString();

if (!ignoredtags.Contains("<" + tagname))
{
     whatever;
}

// vs

tagname = tag.Groups["tagname"].ToString();

if (!ignoredtags.Contains(tagname))
{
     whatever;
}
51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 13, 2008, July 13, 2008 07:37, permalink

No rating. Login to rate!

It doesn't break Corbin's code, it breaks mine. The point was that a single concatenation is fairly cheap, to the tune of 15ms over 4,000 iterations.

In general, yes, concatenation is expensive. But one per loop isn't really significant in my benchmarking. As to why the List<>.Contains is 25% slower than String.Contains, I dunno. But it is..

F9401202ab73e624cc82800b0fff1489

Konrad

July 13, 2008, July 13, 2008 09:53, permalink

No rating. Login to rate!

> As to why the List<>.Contains is 25% slower than String.Contains, I dunno. But it is..

Actually, why should 'List.Contains' be any faster? It has to do the same work, plus a bit more to navigate through the elements.

On the other hand, using a hash table (=> Dictionary<>) *will* be faster -- but only for longer lists. The overhead of managing a hash table outweights the cost of iterating all characters in such a small string.

Just one thought about Colin's code: 'RemoveAt' is really slow. Why not use a 'Stack<>' here, since the only operations take place at the end of the list?

729442eea8d8548842a6e0947e333c7b

Chris Jester-Young

July 14, 2008, July 14, 2008 10:45, permalink

No rating. Login to rate!

Hats off to Corbin for an excellent implementation. I would add a few suggestions:

1. Because <li>, <p>, and the like are put on the ignore list, people who type in “correct XHTML” with closer tags are going to get those stripped off. This is not strictly a problem, however for people writing lots of list items and/or paragraphs, this search-back through the ignore list is O(n²) on the number of such closer tags. Although malicious users are going to be able to cause this O(n²) case no matter what, it'd be nice for the more typical XHTML case to actually not impose that sort of performance penalty. I'll post a possible solution once I come up with one, it's just a thought I have at the moment.

2. I would suggest downcasing all the incoming tags like Jeff's version does, just to reduce the amount of case-insensitive comparisons that have to be done. If this is not done, then the ignoredtags.Contains() test has to be modified to be a case-insensitive one, too, in order for <P>, <LI>, and the like to work correctly.

3. (This applies to Jeff's version too.) For consistency, <hr /> should get added to the no-closer-needed list.

729442eea8d8548842a6e0947e333c7b

Chris Jester-Young

July 14, 2008, July 14, 2008 11:52, permalink

No rating. Login to rate!

@Konrad: on your suggestion of using a Stack, it's impossible to iterate through the stack (looking for the opening tag) without popping all the items in between. If we _do_ find the opening tag, then yes, we do want to pop all the items in between (adding them to the discard list), but if we don't find it, we are supposed to keep the opening tag stack intact, and just place the closing tag in the discard list. RemoveAt is O(1) if the back element is being removed (see the .NET Framework reference source if you want to verify), which Corbin's code does, consistently.

WRT string searching (this is more a diversion of curiosity than anything else): I'm given to believe that there are some optimised string-searching algorithms out there. The ones I've heard of (Knuth-Morris-Pratt and Boyer-Moore) require pre-studying of the “needle”, and thus isn't suitable for the current application. However, I've also chanced to look at a few different strstr() implementations, to see how they do their stuff:

1. Find first character of the needle in the haystack (using strchr() or equivalent), then (if found) use the result as a starting point to compare the rest of the needle (using memcmp() or equivalent). If comparison fails, advance the pointer and start again. This approach is used in: String.indexOf() in OpenJDK, strstr() in BSD libc, std::basic_string::find in {GNU libstdc++, STLport, and Microsoft's C++ library}, and (from what I can gather) String.IndexOf() in .NET Framework.

2. Find the first two characters of the needle in the haystack (using a custom-written loop), then (if found) use the result as a starting point to compare the rest of the needle (again, using a custom-written loop). This approach is used in: strstr() in Microsoft's libc, strstr() in procmail, and strstr() in GNU libc (which uses the procmail implementation).

D97c394b2cf92c49fca089d2484813f3

Dinah

July 21, 2008, July 21, 2008 18:09, permalink

No rating. Login to rate!

Has a final version of this been decided upon? If so, would you be willing to post the final source code?

D23cbc530a1c8431e886bd7535c20414

Bealer

July 24, 2008, July 24, 2008 18:01, permalink

1 rating. Login to rate!

Hmmm nice solutions but I personally wouldn't go this route at all.

For correcting, formatting, cleaning etc... HTML, Tidy is the best tool and it's mature, and widely used. There's even a .Net port Tidy.Net.

It may be slightly slower, but it'll be more accurate with the results. Regex's are for matching, and it's very difficult to match something like HTML as there are so many ways of writing it. I would worry that you're using the wrong tool for the job, and focusing too much on micro performance increases.

D97c394b2cf92c49fca089d2484813f3

Dinah

July 24, 2008, July 24, 2008 20:01, permalink

No rating. Login to rate!

(I tried to submit this bug a few days ago but I guess something went wrong b/c I don't see it here).

Tags will not get balanced if 'ignore' tags and regular tags are incorrectly nested. Example:

1
<p><strong></p></strong>
D23cbc530a1c8431e886bd7535c20414

Bealer

July 24, 2008, July 24, 2008 23:38, permalink

1 rating. Login to rate!

My point exactly. It's too hard to try and Regex match such things. There will be endless amounts of possible bugs.

Tidy will clean up that sort of thing for you.

51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 30, 2008, July 30, 2008 20:43, permalink

No rating. Login to rate!

> Tags will not get balanced if 'ignore' tags and regular tags are incorrectly nested.

This is by design; I don't view incorrect nesting as a problem that would prevent page rendering, like an unclosed <div> or <td> would. Am I wrong?

51d623f33f8b83095db84ff35e15dbe8

Jeff Atwood

July 30, 2008, July 30, 2008 20:45, permalink

1 rating. Login to rate!

> I would worry that you're using the wrong tool for the job, and focusing too much on micro performance increases.

Well, to play Devil's Advocate: I would worry that you're writing far too much code, and overengineering a solution with too many complex, unnecessary external dependencies. :)

D23cbc530a1c8431e886bd7535c20414

Bealer

July 30, 2008, July 30, 2008 20:58, permalink

2 ratings. Login to rate!

Hmmm, point is I'm not writing too much code, the opposite in fact as Tidy does it for me. Your 60 lines of code I can do in about 5-10.

Tidy may be a dependency but it's only one dependency, and is a very mature, well written project. Much like using any other library like NHibernate, Castle Windsor, or NUnit.

Regex's are obviously great for matching, but when HTML can be written so badly and in so many ways, matching becomes very difficult and potentially less accurate.

3d6bd00a348b2caab6108406d7dcee1c

JasonBunting

August 20, 2008, August 20, 2008 17:38, permalink

No rating. Login to rate!

Jeff, apparently you suffer from a bit of NIH? The first thing I thought of when I saw this code was "Why isn't he using HTML Tidy? One of the biggest benefits of open source software is that it has been used and tested by large numbers of people so that bugs and problems get worked out.

Why are you reinventing the wheel?

Your refactoring





Format Copy from initial code

or Cancel