Working With Character Sets and Domino

Jake Howlett, 10 March 2005

Category: Forms; Keywords: character set utf-8 us-ascii encoding charset

Introduction

Just when you think you know all there is to know about web development you're brought crashing back down to earth. Until recently I hadn't really paid much attention to nor had any issues with character sets and Domino. Things had always just worked as I'd expected them to. Enter special characters, such as é, and Domino would store them correctly. Then, in response to a blog entry about Cookies, a thread began in which these special characters weren't being saved or displayed properly. I've since been spending a lot of time trying to solve this character-encoding enigma. What I found I think is worth sharing.

About Character Sets

What are character sets? Good question and one I don't feel qualified to answer. You can find everything you (n)ever wanted to know about Character Encoding on Wikipedia.

In layman's term (i.e: my understanding of it) it's like this. There are standard characters, like 0-9, A-Z, a-z along with things like !"£$%^&*() etc. Stuff you see on a normal "English" keyboard and these can be represented using the basic character set called ASCII. If your site only ever needs to use this set of characters you're ok. If, however, you need to cater for other languages that use different characters you need to use another character set. Simple really, isn't it?

But, how do I change the character set used and which one should it be? Well, I'll get to the how later on. As for which, well, that depends. Why not use UTF-8, which covers just about everything. Or you could use ISO-8859-1, which covers Western European languages.

Which to use is a debate all in itself and one that I don't truly "get". Let's forget it for now and look at the issue I had with Domino that made me stop and think about how they are used on this site.

The Scenario

The problem was that certain characters, submitted in response to blog entries, weren't storing properly. They would be sent back to the browser as square blobs like this ■ (the character &#9632, whatever that may be).

Responses to blog entries on this site aren't posted using real Domino forms. Instead they are created with"in-line" forms.

The same was happening for blog entries and articles. In the new version of CodeStore responses to articles are now added in the same way as blog responses - using faked forms. You can see one at the bottom of this article. Until I fixed the issue these documents wouldn't store certain characters either.

The really weird thing was that it worked when adding a response directly using the actual Domino form. With this in mind I went about trying to find out what the difference between the two pages were. Why did one work and not the other?!

In terms of the HTML that made up the form and the fields contained within it, the two forms were almost identical. The only difference I could see was its action parameter. For the in-line form the HTML is:

<form method="post" action="post?CreateDocument&ParentUNID=DOCID">

And, for the "direct" Domino Form, the HTML is:

<form method="post" action="post?OpenForm&ParentUNID=DOCID&Seq=1"
name="_post">

The difference here is the URL by which Domino received the data. The actual Form used by Domino is the same. It's just a different way of creating the same document. There's no reason it should be a problem with charsets.

But, this was irrelevant. Even before Domino received the form's values, the browser has added another difference. Using an HTTP Sniffer I was able to intercept the data being sent to Domino by each form. Sending the words inliné and diréct in the respective forms actually sent inlin%63 and dir%C3%A9ct. It appeared that each form was sending the data using a different encoding. Not only that, but one method seemed to be encoding twice. Something was very wrong. The mystery intensified until I happened to notice something else in the sniffer.

The Problem:

What I had failed to notice for a long time was the glaringly obvious. In the screen-grab below we can see the content-type of each type of "form". The inline form first and then the direct one:

screengrab

Notice how the charset in the content-type headers are different. After some further reading I found that browsers send form data using the same encoding as that of the page in which the form is contained. Our direct form is using a different charset to the document in which our in-line form lives. By adding a fake form to a document we are using a different charset to that used by the Domino forms.

So we know the root of the problem. What about the cause of the problem? Well, by default, Domino uses a different charset for documents in read-mode than it does with documents in edit-mode. As you can see from the screen-grab of the Server Document below — read-mode documents are considered "ouput" and so not encoded as UTF-8. Edit-mode documents are "HTML forms" and so encoded using UTF-8, rather than the default Western charset of ISO-8859-1. Hence the difference in the HTTP sniffer screengrab above.

screengrab

The real under-lying problem is that Domino expects forms to be sent to it using UTF-8 encoding, as per the server setting. It doesn't expect us to create our own forms and so doesn't expect data being sent in any other charset. Sending the form data encoded as ISO-8859-1 is going to cause problems. Hence the square blobs.

The Solution

The solution is obvious. We need to force our "fake" form to send data encoded in the same charset as Domino uses for its Forms — UTF-8 in my case. The easiest way to do this is to add an accept-charset argument to our in-line form, like so:

<form method="post" action="post?CreateDocument&ParentUNID=DOCID" accept-charset="UTF-8">

This solved the problem. Well, in Mozilla at least. Not in Internet Explorer, where it seemed to still use the page's charset as the over-riding setting.

To get this to work in Internet Explorer and Firefox (et al) the quick way you can make changes to the actual Domino Form, over-riding it's character set used. To do this, open the Form's Property Box and find the "Character set" setting on the propeller hat tab. Change this to UTF-8, as below.

screengrab

Domino now uses the same character set whether we are reading or editing with this Form. Problem solved!

However, solving it this way doesn't help with other forms that may have the same problem elsewhere. As an alternative we could change the server setting we talked about earlier, so that all output from the server is in the same character set. This would solve the problem on all future Forms we create using this technique.

Meta or Header?

If you're like me you might be wondering why the charset is being sent to the browser as part of the content-type response header. I've not seen many other servers doing this. Have you? Well, it turns out this is the W3C advised way of doing it. If you prefer to use HTML Meta tags there's a server setting that controls it. Here are the defaults:

screengrab

If you swapped these settings around then you'd see the character set as part of the HTML Head of the page, like so:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Alternatively you could disable both these settings and add this meta tag yourself as part of the $$HTMLHead field's value. It's a brave person who does this though. I rarely say this but I think it might be better to leave this task to Domino.

As far as CodeStore goes I have made sure Domino uses the same character sets for all output (and input) across the server by editing the server document and restarting the HTTP task. No more blobs round here...

Summary

Character sets are essential in Domino applications where there's a chance that international characters are going to be used. For the most part you can rest assured that Domino will take care of their use well enough. The problem comes about when we think for ourselves and start to send data to the server in ways it wouldn't normally expect.

Even if you never see yourself doing any weirdness like this it's still worth knowing, as a Web Developer, that the issue exists. It's essential to have a full understanding of the behind-the-scenes communication between browser and server.

There's no need to spend forever becoming an expert on all this, but a good primer is the O'Reilly HTTP Pocket Reference. You can digest this pocket-sized book in a few hours and discover as much as you'll ever need to know. Next time your on the computing section of a book shop search it out and add it to your shelf!

Further Reading

Addendum

It's worth noting that there's another problem with Domino. Yep, you heard me right, another problem.

Despite telling it to use UTF-8 for input and output it still insists on using US-ASCII for things like views on $$ViewTemplate forms. Others have had the same problem — here and here.

To make sure every page on codestore uses the same charset I changed the Character Set of all Forms. Nothing's ever easy with Domino is it…