Ben learns the difference between "characters" and "bytes" the hard way

Ben Galbraith discovers a little snippet about XML encoding that is both subtle and evil:

A while back, I was working on a system feature that read in some XML from the filesystem, XSLT’d it into HTML, and served it up to a browser. The XML had a bunch of characters from the higher Unicode ranges (i.e., >255), and wouldn’t you know, when viewed in a browser, these characters showed up as garbled data. Not “The Box”–that ugly little placeholder used when a font doesn’t contain a character for a given code point–but usually one to three seemingly random characters that had nothing to do with the character that was supposed to be displayed.

And then, whilst reading through some of the backend code, I saw this innocuous little line:

Document document = new SAXBuilder().build(new FileReader(file));

See the problem? Look again. … This is the code I should have written:

Document document = new SAXBuilder().build(new FileInputStream(file));

If you hand an XML parser bytes, which is the currency of InputStreams, the parser handles converting those bytes to characters itself, and uses the encoding in the XML prolog to configure itself for that process. If you hand it characters… it’s stuck using those characters and can’t affect the decoding process one whit, since it occurs a level beneath it.
By the way, anybody working with Streams in .NET had best be aware of the same basic problem….

The lesson? Be aware of your encodings, at all levels of the translation and processing machine… And if you don’t know the difference between UTF-8, Unicode and ASCII, you’re falling to the trap that goes by the name of Leaky Abstractions….