How To: Make Sure User-Entered HTML Is Valid and Error Free

How many mistakes can you spot in the HTML below?

<h3>My HTML is well dodgy, innit.</h4>
<p>This is my first paragraph</P>
<p>The second one isn't closed properly!!
<div><p>The third P is in a div that's unclosed. <b><i>Oh no</b></i>!!!</p>

Lots aren't there! Now imagine this is what a user has typed in to a form of a site where you've allowed HTML input. Whether or not you trust the users or even whether they mean to do it you can't expect them to get the syntactic rules of HTML right. Nor can you take the fact that the form uses an editor such as TinyMCE to mean the HTML sent to the server will be valid.

What you need to is have the server sanitise the HTML and make sure it's fit for use. If, as in the example above, a user has added a DIV element and didn't close it, who knows what problems this might cause.

Luckily there's a Java library called XMLUnit that includes a class called TolerantSaxDocumentBuilder which you can pass really bad HTML to and it will, ever so tolerantly, fix all the errors and make sure it's valid.

Here's a demo form where the WQS agent fixes all the problems in the HTML thrown at it. Press the "submit" button and see what happens. Once submitted you can edit the document to check that it did indeed fix all the errors.

Try including your own examples of poor HTML. See if you can break the page layout. If you do then send me the URL of the broken page.

Notice also that the HTML is not only fixed but also altered by the WQS agent. CSS classes are added to the first and last paragraphs, if there are more than one or three paragraphs, respectively. This shows how you can add to the HTML in any way you choose. You can also remove HTML -- any script tags are removed from the demo form.

Code to be released as and when DEXT is ready for download. Scream if you want it sooner!


    • avatar
    • Dave S
    • Wed 17 Sep 2008 05:03 AM

    This looks fantastic. Please could you supply the code prior to the dext release.

    • avatar
    • Jake Howlett
    • Wed 17 Sep 2008 05:10 AM

    Sure Dave. Mail me offline or post your email address and you can be the first beta tester.

    • avatar
    • mark b
    • Wed 17 Sep 2008 09:20 AM

    There's no submit button on your demo form - not that I can see anyway. Fantastic idea though, I can think of many applications where this would be very handy. I wonder about the additional load it might put on the server but probably well worth it anyway.

    • avatar
    • Jake Howlett
    • Wed 17 Sep 2008 09:34 AM

    Hi Mark,

    Woops. A bug in Ie7 for footer's CSS. Fixed now.

    I can't imagine the load on teh server is too great. I hope not as it's a part of the solution I just sold you ;o)


  1. Brilliant. If only I had this 2 weeks ago. I was looking for something like this as I was doing mass RichText->Mime/HTML conversions and found that Notes/Domino was converting RTF tables without closing the td tags. I got stuck trying to get HTMLTidy up and running properly.

    Maybe if time permits I will impliment this code.

    • avatar
    • Jake Howlett
    • Mon 22 Sep 2008 01:53 AM

    Sorry it was too late Zak. I've been sitting on it for about that long! Woops.

    If you want the code it's in the Sandbox now (see tab bar).

  2. Nice. So XMLUnit would be the second "forgiving" DOM parser. Have you tried jTidy? It has a similar forgiving parser and you can specify what things need fixin.

    :-) stw

Your Comments


About This Page

Written by Jake Howlett on Wed 17 Sep 2008

Share This Page

# ( ) '


The most recent comments added:

Skip to the comments or add your own.

You can subscribe to an individual RSS feed of comments on this entry.

Let's Get Social

About This Website

CodeStore is all about web development. Concentrating on Lotus Domino, ASP.NET, Flex, SharePoint and all things internet.

Your host is Jake Howlett who runs his own web development company called Rockall Design and is always on the lookout for new and interesting work to do.

You can find me on Twitter and on Linked In.

Read more about this site »


Here are the external links posted on the same day.

More links are available in the archive »

More Content