Skip Navigation
Details
Author: Jake Howlett
Date: Wed 17 Sep 2008

Permalink

Comments / Add / Subscribe

Elsewhere

HTML 5: Welcome WebForms and more
Wed 17 Sep 2008

More links are available
in the archive »

« Hacking Domino: Over-riding Hidden or Computed Field Values | Blogs | Site Tweak: Recent Comments Added to Homepage »

How To: Make Sure User-Entered HTML Is Valid and Error Free

How many mistakes can you spot in the HTML below?

<h3>My HTML is well dodgy, innit.</h4>
<p>This is my first paragraph</P>
<p>The second one isn't closed properly!!
<div><p>The third P is in a div that's unclosed. <b><i>Oh no</b></i>!!!</p>

Lots aren't there! Now imagine this is what a user has typed in to a form of a site where you've allowed HTML input. Whether or not you trust the users or even whether they mean to do it you can't expect them to get the syntactic rules of HTML right. Nor can you take the fact that the form uses an editor such as TinyMCE to mean the HTML sent to the server will be valid.

What you need to is have the server sanitise the HTML and make sure it's fit for use. If, as in the example above, a user has added a DIV element and didn't close it, who knows what problems this might cause.

Luckily there's a Java library called XMLUnit that includes a class called TolerantSaxDocumentBuilder which you can pass really bad HTML to and it will, ever so tolerantly, fix all the errors and make sure it's valid.

Here's a demo form where the WQS agent fixes all the problems in the HTML thrown at it. Press the "submit" button and see what happens. Once submitted you can edit the document to check that it did indeed fix all the errors.

Try including your own examples of poor HTML. See if you can break the page layout. If you do then send me the URL of the broken page.

Notice also that the HTML is not only fixed but also altered by the WQS agent. CSS classes are added to the first and last paragraphs, if there are more than one or three paragraphs, respectively. This shows how you can add to the HTML in any way you choose. You can also remove HTML -- any script tags are removed from the demo form.

Code to be released as and when DEXT is ready for download. Scream if you want it sooner!

Comments

Dave S (Wed 17 Sep 2008 05:03 AM)

This looks fantastic. Please could you supply the code prior to the dext release.

Jake Howlett (Wed 17 Sep 2008 05:10 AM)

Sure Dave. Mail me offline or post your email address and you can be the first beta tester.

mark b (Wed 17 Sep 2008 09:20 AM)

There's no submit button on your demo form - not that I can see anyway. Fantastic idea though, I can think of many applications where this would be very handy. I wonder about the additional load it might put on the server but probably well worth it anyway.

Jake Howlett (Wed 17 Sep 2008 09:34 AM)

Hi Mark,

Woops. A bug in Ie7 for footer's CSS. Fixed now.

I can't imagine the load on teh server is too great. I hope not as it's a part of the solution I just sold you ;o)

Jake

zak karachiwala (Sun 21 Sep 2008 05:11 PM) e-mail

Brilliant. If only I had this 2 weeks ago. I was looking for something like this as I was doing mass RichText->Mime/HTML conversions and found that Notes/Domino was converting RTF tables without closing the td tags. I got stuck trying to get HTMLTidy up and running properly.

Maybe if time permits I will impliment this code.

Jake Howlett (Mon 22 Sep 2008 01:53 AM)

Sorry it was too late Zak. I've been sitting on it for about that long! Woops.

If you want the code it's in the Sandbox now (see tab bar).

Stephan H. Wissel (Mon 22 Sep 2008 04:26 AM) website / e-mail

Nice. So XMLUnit would be the second "forgiving" DOM parser. Have you tried jTidy? It has a similar forgiving parser and you can specify what things need fixin.

:-) stw

Add your response here:

Name *:
E-mail:
Protected from spambots!
Website:
rel="nofollow"

Comment *:
HTML is not allowed!

Note: This blog entry is more than two weeks old so your comment, as an anti-spam measure, will be sent for approval.