So You Say You Want to Kill XML....

Google (or at least some part of it) has now weighed in on the whole XML discussion with the recent release of their "Protocol Buffers" implementation, and, quite naturally, the debates have begun, with all the carefully-weighed logic, respectful discourse, and reasoned analysis that we've come to expect and enjoy from this industry.

Yeah, right.

Anyway, without trying to take sides either way in this debate--yes, the punchline is that I believe in a world where both XML and Protocol Buffers are useful--I thought I'd weigh in on some of the aspects about PBs that are interesting/disturbing, but more importantly, try to frame some of the debate and discussions around these two topics in a vain attempt to wring some coherency and sanity out of what will likely turn into a large shouting match.

For starters, let's take a quick look at how PBs work.

Protocol Buffers 101

The idea behind PBs is pretty straightforward: given a PB definition file, a code-generator tool builds C++, Java or Python accessors/generators that know how to parse and produce files in the Protocol Buffer format. The generated classes follow a pretty standard format, using the traditional POJO/JavaBean style get/set for Java classes, and something similar for both the Python and C++ classes. (The Python implementation is a tad different from the C++ and Java versions, as it makes use of Python metaclasses to generate the class at runtime, rather than at generation-time.) So, for example, given the Google example of:

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;

enum PhoneType {
HOME = 1;
WORK = 2;

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];

repeated PhoneNumber phone = 4;

... using the corresponding generated C++ class would look something like

Person person;
person.set_name("John Doe");
fstream output("myfile", ios::out | ios::binary);

... and the Java implementation would look somewhat similar, except using a Builder to create the Person.

The Protocol Buffer interoperable definition language is relatively complete, with support for a full range of scalar types, enumerations, variable-length collections of these types, nested message types, and so on. Each field in the message is tagged with a unique field number, as you can see above, and the language provides the ability to "reserve" field IDs via the "extension" mechanism, presumably to allow for either side to have flexibility in extending the PB format.

There's certainly more depth to the PB format, including the "service" and "rpc" features of the language that will generate full RPC proxies, but this general overview serves to provide the necessary background to be able to engage in an analysis of PBs.

Protocol Buffers: An Analysis

When looking at Protocol Buffers, it's important to realize that this is an open-source implementation, and that Google has already issued a call to others to modify them and submit patches to the project. Anything that comes across as a criticism or inherent flaw in the implementation is thus correctable, though whether Google would apply those fixes to the version they use internally is an open question--there's a tension in any open-source project sponsored by a company, between "What we need the project for" and "What other people using the project need it for", and it's not always clear how that tension will play out in the long term.

So, without further ado ...

For starters, Protocol Buffers' claim to be language and/or platform-neutral is hardly justifiable, given that they have zero support for the .NET platform out of the box. Now, before the Googlists and Google-fanbois react angrily, let me be the first to admit, that yes, nothing stops anybody from producing said capability and contributing it back to the project. In fact, there's even some verbage to that effect on the Protocol Buffers' FAQ page. But without it coming out of the box, it's not fair to claim language- and platform-neutrality, unless, of course, they are willing to suggest that COM's Structured Storage was also language- and platform-neutral, and that it wasn't Microsoft's fault that nobody went off and built implementations for it under *nix or the Mac. Frankly, any binary format, regardless of how convoluted, could be claimed to be language- and platform-neutral under those conditions, which I think makes the claim spurious to make. The fact that Google doesn't care about the .NET platform doesn't mean that their implementation is incapable of running on it; the fact that Google wrote their code generator to support C++, Java and Python doesn't mean that Protocol Buffers are language- and platform-neutral, either. XML still holds the edge here, by a long shot--until we see implementations of Protocol Buffers for Perl/Parrot, C, D, .NET, Ruby, JavaScript, mainframes and others, PBs will have to take second-place status behind XML in terms of "reach" across the wide array of systems. Does that diminish their usefulness? Hardly. It just depends on how far a developer wants their data format to stretch.

Having gotten that little bit out of the way ...

Remember a time when binary formats were big? Back in the early- to mid-90's, binary formats were all the rage, and criticism of them abounded: they were hard to follow, hard to debug, bloated, inherently inflexible, tightly-coupled, and so on. Consider, for example, this article comparing XML against its predecessor information-exchange technologies:

A great example ... is a data feed of all orders taken electronically via a web site into an internal billing and shipping system. The main advantage of XML here is it's flexibility - developers create their own tags and dictionaries, as they deem necessary. Therefore no matter what type of data is being transferred, the right XML representation of it will accurately describe the data.

Logically, each order can be described as one customer purchasing one or more items using their credit card and potentially entering different billing and shipping addresses. The contents of the file are very easy to read, even for a person who is not familiar with XML. The information within each of the order tags is well structured and organized. This enables developers to use parsing components and easily access any data within the document. Each item in the order is logically a unique entity, and is also represented with a separate tag. All item properties are defined as "child" nodes of the item tag.

XML is the language of choice for two major reasons. First of all, an XML formatted document can be easily processed under any OS, and in any language, as long as XML parsing components are available. On the other hand, XML files are still raw data, which enables merchants to format the data any way they want to. All in all, document structure and wide acceptance of this format has made it possible to enable customers to build more efficient internal order Tracking systems based on XML-formatted order files. Other online merchant sites are making similar functionality available on their Web sites as well.

And yet, now, the criticism goes the other way, complaining that XML is bloated, hard to follow, and so on:

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

How do we reconcile these apparently contradictory positions?

First of all, I don't think anybody in the XML community has ever sought to argue that XML was designed for efficiency; in fact, quite the opposite, as the XML specification itself states in the abstract:

The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. ... This document specifies a syntax created by subsetting an existing, widely used international text processing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.

The Introduction section is even more explicit:

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

And if that wasn't clear enough....

The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

In essence, then, the goal of XML was never to be small or fast, but still clearly simple. And, despite your personal opinion about the ecosystem that has grown up around XML (SOAP, WS-*, and so on), it's still fairly easy to defend the idea that XML itself is a simple technology, particularly if we make some basic assumptions around things that usually complicate text like character sets and encoding and such.

Note: I am deliberately ignoring the various attempts at a binary Infoset specification, which has been on the TODO list of many XML working groups in the past and has yet to really make any sort of impact in the industry. Theoretically, yes, XML Infoset-compliant documents could be rendered into a binary format that would have a much smaller, more efficient footprint than the textual representation. If and when we ever get there, it would be interesting to see what the results look like. I'm not holding my breath.

Why, then, did XML take on a role as data-transfer format if, on the surface of things, using text here was such a bad idea?


"With Web services your accounting departments Win 2k servers billing system can connect with your IT suppliers UNIX server." --

"Conventional application development often means developing for multiple devices, and that form factors of the client devices can be dramatically different. If you base an application on a web browser display size of 800x600, it would never work on a device with a resolution of 4 lines by 20 characters. Conversely, if you took a lowest common denominator approach and sized for the smaller device, the user interface would be lost on an 800x600 device.Using a non-XML approach, this leaves us writing multiple clients speaking to the server, or writing multiple clients speaking to multiple severs ... Limitations of this approach include:

  • Tightly coupled to browser
  • Multiple code-bases
  • Difficult to adapt to new devices
  • Major development efforts
  • Slow time to market

"In this case, only one application exists. It runs against the back-end database, and produces an XML stream. A "translator" takes this XML stream, and applies an XSLT transformation to it. Every device could either use a generic XSLT, or have a specialized XSLT that would produce the required device-specific output. The transformation occurs on the server, meaning that no special client capabilities are required.This "hub and spoke" architecture yields tremendous flexibility. When a new device appears, anew spoke can be added to accommodate it. The application itself does not need to be changed, only the translator needs to be informed about the existence of the new device, and which XSLT to use for it." --

Certainly the interoperability argument doesn't require a text-based format, it was just always cited that way. In fact, both the CORBA supporters and the developers over at ZeroC will both agree with Google in suggesting that a binary format can and will be an efficient and effective interoperability format. (I'll let the ZeroC folks talk to the pros and cons of their Ice format as compared to IIOP and SOAP.)

Some of the key inherent advantages in XML that are lost in this new binary format, however, center around the XML Infoset itself and the fact that it has a number of ancillary tools around it, which gets us into the next point; consider some of those inherent advantages in XML that are lost in this new binary, structured, tightly-coupled format:

  • XPath. The ability to selectively extract nodes from inside a document is incredibly powerful and highly underutilized by most developers.
  • XSLT. Although somewhat on the down-and-out in popularity with developers because of its complexity, XSLT stands as a shining example of how a general-purpose transformation tool can be built once the basic structure is well-understood and independent of the actual domain. (SQL is another.)
  • Structureless parsing. In order to use Protocol Buffers, each side must have a copy of the .proto file that generated the proxies. (While it's theoretically possible to build an API that picks through the structure element-by-element in the same way an XML API does, I haven't found such an API in the PB code yet, and doing so would mean spending some quality time with the Protocol Buffer binary encoding format. The same was always true of IIOP or the DCOM MEOW format, too.)

In essence, a Protocol Buffer format consumer is tightly-coupled against the .proto file that it was generated against, whereas in XML we can follow Noah Mendelsohn's advice of years ago ("Schema are relative") and parse XML in whatever manner suits the consumer, with or without schema, without or without schema-based-and-bound proxy code. The advantage to the XML approach, of course, is that it provides a degree of flexibility; the advantage of the Protocol Buffer approach is that the code to produce and consume the elements can be much simpler, and therefore, faster.

Note: it's only fair to point out at this point that the Protocol Buffer approach already contains a certain degree of flexibility that earlier binary formats lacked (if I remember correctly), via PB's use of "optional" vs "required" tags for the various fields. Whether this turns out to be sufficient over time is yet to be decided, though Google claims to be using Protocol Buffers throughout the company for its own internal purposes, which is certainly supporting evidence that cannot be discarded casually.

But there's a corollary effect here, as well: because XML documents are intended to be self-descriptive, the Protocol Buffer format can contain just the data, and leave the format and structure to be enforced by the code on either side of the producer/consumer relationship. Whether you consider this a Good Thing or a Bad Thing probably stands as a good indicator of whether you like the Protocol Buffer approach or the XML approach better.

Note, too, by the way, that many of the XML-based binding APIs can now parse objects out of part of an XML document, as opposed to having to read in the whole thing and then convert--JAXB, JiBX and XMLBeans can all pick up parsing an XML document from any node beyond the initial root element, for example--whereas, at least as of this point (as far as I can tell, and I'd love to have somebody at Google tell me I'm wrong here, because I think it's a major flaw), the Protocol Buffers approach assumes it will read in the entire object from the file. I don't see any way, short of putting developer-specified "separation markers" into the stream or some other kind of encoding or convention, of doing a "partial read" of an object model from a PB data file.

To see what I mean by this, consider the AddressBook example. Suppose the AddressBook holds several thousand records, and my processing system only cares about a select few (less than five, perhaps, who all have the last name "Neward"). In a Protocol Buffer scheme, I deserialize the entire AddressBook, then go through the persons item by item, looking for the one I want. In an XML-based scheme, I can pull-mode parse (StAX in Java, or using the pull-mode XML parser in .NET, for example) the nodes, throwing away nodes until I see one where the <lastName> node contains "Neward", and then JAXB-consume the next n number of nodes into a Person object before continuing through the remainder of the AddressBook.

Let's also note that the Protocol Buffer scheme assumes working with a stream-based (which usually means file-based) storage style for when Protocol Buffers are used as a storage mechanism. But frankly, if I want to store objects, I'd rather use a system that understands objects a bit more deeply than PBs do, and gives me some additional support beyond just input and output. This gets us into the long and involved discussion around object databases, which is another one that's likely to devolve into a shouting match, so I'll leave it at that. Suffice it to say that for object storage, I can see using (for example) db4o-storing-to-a-file as a vastly more long-term solution than I can using PBs, at least for now. (Undoubtedly the benchmarks will be along soon to try and convince us one way or another.) One area that is of particular interest along these lines, though, will be the evolutionary capabilities of each--from my (limited) study of PBs thus far, I believe db4o has a more flexible evolutionary scheme, and I have to admit, I don't like the idea of having to run the codegen step before being able to store my objects, but that's a minor nit that's easily solved with tools like Ant and a global rule saying "Nobody touches the .proto file without checking with me first."

Which, by the way, brings up another problem, the same one that plagues CORBA, COM/DCOM, WSDL-based services, and anything that relies on a shared definition file that is used for code-generation purposes, what I often call The Myth of the One True Schema. Assuming a developer creates a working .proto/.idl/.wsdl definition, and two companies agree on it, what happens when one side wants to evolve or change that definition? Who gets to decide the evolutionary progress of that file? Who "owns" that definition, in effect? And this, of course, presumes that we can even get some kind of definition as to what a "Customer" looks like across the various departments of the company in the first place, much less across companies. Granted, the "optional" tag in PBs help with this, but we're still stuck with an inherently unscalable problem as the number of participants in the system grows.

I'd give a great deal of money to see what the 12,000-odd .proto files look like inside Google, and then again at what they look like in five years, particularly if they are handed out to paying customers as APIs against which to compile and link. There's ways to manage this, of course, but they all look remarkably like the ways we managed them back in the COM/DCOM-vs-CORBA days, too.

Long story short, the Protocol Buffer approach looks like a good one, but let's not let the details get lost in the shouting: Protocol Buffers, as with any binary protocol format and/or RPC mechanism (and I'm not going to go there; the weaknesses of RPC are another debate for another day), are great for those situations where performance is critical and both ends of the system are well-known and controlled. If Google wants to open up their services such that third-parties can call into those systems using the Protocol Buffers approach, then more power to them... but let's not lose sight of the fact that it's yet another proprietary API, and that if Microsoft were to do this, the world would be screaming about "vendor lock-in" and "lack of standards compliance". (In fact, I heard exactly these complaints from Java developers during WCF Q&A when they were told that WCF-to-WCF endpoints could "negotiate up" to a faster, binary, protocol between them.)

In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.