Monday, August 22, 2005
When do you use XML, again?

So I'm flipping through some old weblog entries, and I run across this one:

So when is it a good idea to use XML for your data? The easy answer is that you should use XML when it is likely to be easier (in the long run) than creating your own parser. Using XML carries some cost. XML is verbose, and parsing is guaranteed to be slower than a custom parser. Why is XML such the rage then? Aside from the hype there is one very good reason to use XML. For many types of data, it is easier to just load the data into a DOM and extract the information from that, than it is to write a custom parser. That means less time spent debugging code, more time spent focusing on the problem at hand.
Gotta say, no way. XML isn't just a format designed to avoid having to write a parser; going down this path may seem like a good idea in the beginning, but over time it's eventually going to bite you in the ass in a big way--just ask James Duncan Davidson, the original creator of Ant, who later came to admit that
  1. He originally chose to use XML as the format for Ant scripts because he didn't want to write a parser, and
  2. He really regrets it and apologizes to the Java community at large for it.
(Source: paraphrasing from personal communication and Pragmatic Project Automation, by Mike Clark).

The problem, in the case of Ant--and its successors, like MSBuild--in using XML is that it is a strictly-hierarchical format, and not everything follows a strictly hierarchical format (even though it might seem to at first). More importantly, XML is a hideously verbose format, and the "self-descriptive" tags that everybody blathers on about are only self-descriptive to carbon-based life forms (and then only if semantically-rich terms are used for the tag names). For example, does this "self-descriptive" XML have any meaning to you?

It obviously avoids the verbosity that frequently plagues XML, but clearly surrenders a lot of the self-descriptiveness as a result.

So when is it a good idea to use XML for your data? My criteria are a bit more stringent:

  • When your data is naturally hierarchical to begin with
  • When exchange with foreign platforms (which is to say, platforms not native to what you're currently authoring in) is important
  • When pre-existing tool support (XSLT, XML viewers, import/export utilities, etc) is of paramount importance
Still wanting to use XML to avoid having to write a parser? Fine, do so, but make sure to set a timer somewhere in your code that will delete the data file in a year or so, or else risk making the same mistakes Ant did....

By the way, scripting languages (like Ruby, Python or JavaScript) make a terribly convenient way to do data storage/manipulation that still doesn't require writing a parser or custom data format--witness the astounding success of Ruby-on-Rails in this area to see what I mean. Combine this with the ability to embed them in your .NET, Java or C++ code, and you have a really strong argument against using XML to "avoid writing a parser". Look at what Groovy does for Ant scripts, for example (Pragmatic Project Automation).