How many mistakes can you spot in the HTML below?
<h3>My HTML is well dodgy, innit.</h4> <p>This is my first paragraph</P> <p>The second one isn't closed properly!! <div><p>The third P is in a div that's unclosed. <b><i>Oh no</b></i>!!!</p>
Lots aren't there! Now imagine this is what a user has typed in to a form of a site where you've allowed HTML input. Whether or not you trust the users or even whether they mean to do it you can't expect them to get the syntactic rules of HTML right. Nor can you take the fact that the form uses an editor such as TinyMCE to mean the HTML sent to the server will be valid.
What you need to is have the server sanitise the HTML and make sure it's fit for use. If, as in the example above, a user has added a DIV element and didn't close it, who knows what problems this might cause.
Luckily there's a Java library called XMLUnit that includes a class called TolerantSaxDocumentBuilder which you can pass really bad HTML to and it will, ever so tolerantly, fix all the errors and make sure it's valid.
Here's a demo form where the WQS agent fixes all the problems in the HTML thrown at it. Press the "submit" button and see what happens. Once submitted you can edit the document to check that it did indeed fix all the errors.
Try including your own examples of poor HTML. See if you can break the page layout. If you do then send me the URL of the broken page.
Notice also that the HTML is not only fixed but also altered by the WQS agent. CSS classes are added to the first and last paragraphs, if there are more than one or three paragraphs, respectively. This shows how you can add to the HTML in any way you choose. You can also remove HTML -- any script tags are removed from the demo form.
Code to be released as and when DEXT is ready for download. Scream if you want it sooner!