JOB REFERRALS
    ON THIS PAGE
    ARCHIVES
    CATEGORIES
    BLOGROLL
    LINKS
    SEARCH
    MY BOOKS
    DISCLAIMER
 
 Friday, July 11, 2008
So You Say You Want to Kill XML....

Google (or at least some part of it) has now weighed in on the whole XML discussion with the recent release of their "Protocol Buffers" implementation, and, quite naturally, the debates have begun, with all the carefully-weighed logic, respectful discourse, and reasoned analysis that we've come to expect and enjoy from this industry.

Yeah, right.

Anyway, without trying to take sides either way in this debate--yes, the punchline is that I believe in a world where both XML and Protocol Buffers are useful--I thought I'd weigh in on some of the aspects about PBs that are interesting/disturbing, but more importantly, try to frame some of the debate and discussions around these two topics in a vain attempt to wring some coherency and sanity out of what will likely turn into a large shouting match.

For starters, let's take a quick look at how PBs work.

Protocol Buffers 101

The idea behind PBs is pretty straightforward: given a PB definition file, a code-generator tool builds C++, Java or Python accessors/generators that know how to parse and produce files in the Protocol Buffer format. The generated classes follow a pretty standard format, using the traditional POJO/JavaBean style get/set for Java classes, and something similar for both the Python and C++ classes. (The Python implementation is a tad different from the C++ and Java versions, as it makes use of Python metaclasses to generate the class at runtime, rather than at generation-time.) So, for example, given the Google example of:

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;

enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}

repeated PhoneNumber phone = 4;
}

... using the corresponding generated C++ class would look something like

Person person;
person.set_name("John Doe");
person.set_id(1234);
person.set_email("jdoe@example.com");
fstream output("myfile", ios::out | ios::binary);
person.SerializeToOstream(&output);

... and the Java implementation would look somewhat similar, except using a Builder to create the Person.

The Protocol Buffer interoperable definition language is relatively complete, with support for a full range of scalar types, enumerations, variable-length collections of these types, nested message types, and so on. Each field in the message is tagged with a unique field number, as you can see above, and the language provides the ability to "reserve" field IDs via the "extension" mechanism, presumably to allow for either side to have flexibility in extending the PB format.

There's certainly more depth to the PB format, including the "service" and "rpc" features of the language that will generate full RPC proxies, but this general overview serves to provide the necessary background to be able to engage in an analysis of PBs.

Protocol Buffers: An Analysis

When looking at Protocol Buffers, it's important to realize that this is an open-source implementation, and that Google has already issued a call to others to modify them and submit patches to the project. Anything that comes across as a criticism or inherent flaw in the implementation is thus correctable, though whether Google would apply those fixes to the version they use internally is an open question--there's a tension in any open-source project sponsored by a company, between "What we need the project for" and "What other people using the project need it for", and it's not always clear how that tension will play out in the long term.

So, without further ado ...

For starters, Protocol Buffers' claim to be language and/or platform-neutral is hardly justifiable, given that they have zero support for the .NET platform out of the box. Now, before the Googlists and Google-fanbois react angrily, let me be the first to admit, that yes, nothing stops anybody from producing said capability and contributing it back to the project. In fact, there's even some verbage to that effect on the Protocol Buffers' FAQ page. But without it coming out of the box, it's not fair to claim language- and platform-neutrality, unless, of course, they are willing to suggest that COM's Structured Storage was also language- and platform-neutral, and that it wasn't Microsoft's fault that nobody went off and built implementations for it under *nix or the Mac. Frankly, any binary format, regardless of how convoluted, could be claimed to be language- and platform-neutral under those conditions, which I think makes the claim spurious to make. The fact that Google doesn't care about the .NET platform doesn't mean that their implementation is incapable of running on it; the fact that Google wrote their code generator to support C++, Java and Python doesn't mean that Protocol Buffers are language- and platform-neutral, either. XML still holds the edge here, by a long shot--until we see implementations of Protocol Buffers for Perl/Parrot, C, D, .NET, Ruby, JavaScript, mainframes and others, PBs will have to take second-place status behind XML in terms of "reach" across the wide array of systems. Does that diminish their usefulness? Hardly. It just depends on how far a developer wants their data format to stretch.

Having gotten that little bit out of the way ...

Remember a time when binary formats were big? Back in the early- to mid-90's, binary formats were all the rage, and criticism of them abounded: they were hard to follow, hard to debug, bloated, inherently inflexible, tightly-coupled, and so on. Consider, for example, this article comparing XML against its predecessor information-exchange technologies:

A great example ... is a data feed of all orders taken electronically via a web site into an internal billing and shipping system. The main advantage of XML here is it's flexibility - developers create their own tags and dictionaries, as they deem necessary. Therefore no matter what type of data is being transferred, the right XML representation of it will accurately describe the data.

Logically, each order can be described as one customer purchasing one or more items using their credit card and potentially entering different billing and shipping addresses. The contents of the file are very easy to read, even for a person who is not familiar with XML. The information within each of the order tags is well structured and organized. This enables developers to use parsing components and easily access any data within the document. Each item in the order is logically a unique entity, and is also represented with a separate tag. All item properties are defined as "child" nodes of the item tag.

XML is the language of choice for two major reasons. First of all, an XML formatted document can be easily processed under any OS, and in any language, as long as XML parsing components are available. On the other hand, XML files are still raw data, which enables merchants to format the data any way they want to. All in all, document structure and wide acceptance of this format has made it possible to enable customers to build more efficient internal order Tracking systems based on XML-formatted order files. Other online merchant sites are making similar functionality available on their Web sites as well.

And yet, now, the criticism goes the other way, complaining that XML is bloated, hard to follow, and so on:

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

How do we reconcile these apparently contradictory positions?

First of all, I don't think anybody in the XML community has ever sought to argue that XML was designed for efficiency; in fact, quite the opposite, as the XML specification itself states in the abstract:

The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. ... This document specifies a syntax created by subsetting an existing, widely used international text processing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.

The Introduction section is even more explicit:

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

And if that wasn't clear enough....

The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

In essence, then, the goal of XML was never to be small or fast, but still clearly simple. And, despite your personal opinion about the ecosystem that has grown up around XML (SOAP, WS-*, and so on), it's still fairly easy to defend the idea that XML itself is a simple technology, particularly if we make some basic assumptions around things that usually complicate text like character sets and encoding and such.

Note: I am deliberately ignoring the various attempts at a binary Infoset specification, which has been on the TODO list of many XML working groups in the past and has yet to really make any sort of impact in the industry. Theoretically, yes, XML Infoset-compliant documents could be rendered into a binary format that would have a much smaller, more efficient footprint than the textual representation. If and when we ever get there, it would be interesting to see what the results look like. I'm not holding my breath.

Why, then, did XML take on a role as data-transfer format if, on the surface of things, using text here was such a bad idea?

Interoperability:

"With Web services your accounting departments Win 2k servers billing system can connect with your IT suppliers UNIX server." --http://www.w3schools.com/webservices/ws_why.asp

"Conventional application development often means developing for multiple devices, and that form factors of the client devices can be dramatically different. If you base an application on a web browser display size of 800x600, it would never work on a device with a resolution of 4 lines by 20 characters. Conversely, if you took a lowest common denominator approach and sized for the smaller device, the user interface would be lost on an 800x600 device.Using a non-XML approach, this leaves us writing multiple clients speaking to the server, or writing multiple clients speaking to multiple severs ... Limitations of this approach include:

  • Tightly coupled to browser
  • Multiple code-bases
  • Difficult to adapt to new devices
  • Major development efforts
  • Slow time to market

"In this case, only one application exists. It runs against the back-end database, and produces an XML stream. A "translator" takes this XML stream, and applies an XSLT transformation to it. Every device could either use a generic XSLT, or have a specialized XSLT that would produce the required device-specific output. The transformation occurs on the server, meaning that no special client capabilities are required.This "hub and spoke" architecture yields tremendous flexibility. When a new device appears, anew spoke can be added to accommodate it. The application itself does not need to be changed, only the translator needs to be informed about the existence of the new device, and which XSLT to use for it." --http://www.topxml.com/conference/wrox/2000_vegas/text/brianl_xml.pdf

Certainly the interoperability argument doesn't require a text-based format, it was just always cited that way. In fact, both the CORBA supporters and the developers over at ZeroC will both agree with Google in suggesting that a binary format can and will be an efficient and effective interoperability format. (I'll let the ZeroC folks talk to the pros and cons of their Ice format as compared to IIOP and SOAP.)

Some of the key inherent advantages in XML that are lost in this new binary format, however, center around the XML Infoset itself and the fact that it has a number of ancillary tools around it, which gets us into the next point; consider some of those inherent advantages in XML that are lost in this new binary, structured, tightly-coupled format:

  • XPath. The ability to selectively extract nodes from inside a document is incredibly powerful and highly underutilized by most developers.
  • XSLT. Although somewhat on the down-and-out in popularity with developers because of its complexity, XSLT stands as a shining example of how a general-purpose transformation tool can be built once the basic structure is well-understood and independent of the actual domain. (SQL is another.)
  • Structureless parsing. In order to use Protocol Buffers, each side must have a copy of the .proto file that generated the proxies. (While it's theoretically possible to build an API that picks through the structure element-by-element in the same way an XML API does, I haven't found such an API in the PB code yet, and doing so would mean spending some quality time with the Protocol Buffer binary encoding format. The same was always true of IIOP or the DCOM MEOW format, too.)

In essence, a Protocol Buffer format consumer is tightly-coupled against the .proto file that it was generated against, whereas in XML we can follow Noah Mendelsohn's advice of years ago ("Schema are relative") and parse XML in whatever manner suits the consumer, with or without schema, without or without schema-based-and-bound proxy code. The advantage to the XML approach, of course, is that it provides a degree of flexibility; the advantage of the Protocol Buffer approach is that the code to produce and consume the elements can be much simpler, and therefore, faster.

Note: it's only fair to point out at this point that the Protocol Buffer approach already contains a certain degree of flexibility that earlier binary formats lacked (if I remember correctly), via PB's use of "optional" vs "required" tags for the various fields. Whether this turns out to be sufficient over time is yet to be decided, though Google claims to be using Protocol Buffers throughout the company for its own internal purposes, which is certainly supporting evidence that cannot be discarded casually.

But there's a corollary effect here, as well: because XML documents are intended to be self-descriptive, the Protocol Buffer format can contain just the data, and leave the format and structure to be enforced by the code on either side of the producer/consumer relationship. Whether you consider this a Good Thing or a Bad Thing probably stands as a good indicator of whether you like the Protocol Buffer approach or the XML approach better.

Note, too, by the way, that many of the XML-based binding APIs can now parse objects out of part of an XML document, as opposed to having to read in the whole thing and then convert--JAXB, JiBX and XMLBeans can all pick up parsing an XML document from any node beyond the initial root element, for example--whereas, at least as of this point (as far as I can tell, and I'd love to have somebody at Google tell me I'm wrong here, because I think it's a major flaw), the Protocol Buffers approach assumes it will read in the entire object from the file. I don't see any way, short of putting developer-specified "separation markers" into the stream or some other kind of encoding or convention, of doing a "partial read" of an object model from a PB data file.

To see what I mean by this, consider the AddressBook example. Suppose the AddressBook holds several thousand records, and my processing system only cares about a select few (less than five, perhaps, who all have the last name "Neward"). In a Protocol Buffer scheme, I deserialize the entire AddressBook, then go through the persons item by item, looking for the one I want. In an XML-based scheme, I can pull-mode parse (StAX in Java, or using the pull-mode XML parser in .NET, for example) the nodes, throwing away nodes until I see one where the <lastName> node contains "Neward", and then JAXB-consume the next n number of nodes into a Person object before continuing through the remainder of the AddressBook.

Let's also note that the Protocol Buffer scheme assumes working with a stream-based (which usually means file-based) storage style for when Protocol Buffers are used as a storage mechanism. But frankly, if I want to store objects, I'd rather use a system that understands objects a bit more deeply than PBs do, and gives me some additional support beyond just input and output. This gets us into the long and involved discussion around object databases, which is another one that's likely to devolve into a shouting match, so I'll leave it at that. Suffice it to say that for object storage, I can see using (for example) db4o-storing-to-a-file as a vastly more long-term solution than I can using PBs, at least for now. (Undoubtedly the benchmarks will be along soon to try and convince us one way or another.) One area that is of particular interest along these lines, though, will be the evolutionary capabilities of each--from my (limited) study of PBs thus far, I believe db4o has a more flexible evolutionary scheme, and I have to admit, I don't like the idea of having to run the codegen step before being able to store my objects, but that's a minor nit that's easily solved with tools like Ant and a global rule saying "Nobody touches the .proto file without checking with me first."

Which, by the way, brings up another problem, the same one that plagues CORBA, COM/DCOM, WSDL-based services, and anything that relies on a shared definition file that is used for code-generation purposes, what I often call The Myth of the One True Schema. Assuming a developer creates a working .proto/.idl/.wsdl definition, and two companies agree on it, what happens when one side wants to evolve or change that definition? Who gets to decide the evolutionary progress of that file? Who "owns" that definition, in effect? And this, of course, presumes that we can even get some kind of definition as to what a "Customer" looks like across the various departments of the company in the first place, much less across companies. Granted, the "optional" tag in PBs help with this, but we're still stuck with an inherently unscalable problem as the number of participants in the system grows.

I'd give a great deal of money to see what the 12,000-odd .proto files look like inside Google, and then again at what they look like in five years, particularly if they are handed out to paying customers as APIs against which to compile and link. There's ways to manage this, of course, but they all look remarkably like the ways we managed them back in the COM/DCOM-vs-CORBA days, too.

Long story short, the Protocol Buffer approach looks like a good one, but let's not let the details get lost in the shouting: Protocol Buffers, as with any binary protocol format and/or RPC mechanism (and I'm not going to go there; the weaknesses of RPC are another debate for another day), are great for those situations where performance is critical and both ends of the system are well-known and controlled. If Google wants to open up their services such that third-parties can call into those systems using the Protocol Buffers approach, then more power to them... but let's not lose sight of the fact that it's yet another proprietary API, and that if Microsoft were to do this, the world would be screaming about "vendor lock-in" and "lack of standards compliance". (In fact, I heard exactly these complaints from Java developers during WCF Q&A when they were told that WCF-to-WCF endpoints could "negotiate up" to a faster, binary, protocol between them.)

In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.




Friday, July 11, 2008 2:02:47 AM (Pacific Daylight Time, UTC-07:00)
Comments [11]  | 
 Wednesday, July 02, 2008
Polyglot Plurality

The Pragmatic Programmer says, "Learn a new language every year". This is great advice, not just because it puts new tools into your mental toolbox that you can pull out on various occasions to get a job done, but also because it opens your mind to new ideas and new concepts that will filter their way into your code even without explicit language support. For example, suppose you've looked at (J/Iron)Ruby or Groovy, and come to like the "internal iterator" approach as a way of simplifying moving across a collection of objects in a uniform way; for political and cultural reasons, though, you can't write code in anything but Java. You're frustrated, because local anonymous functions (also commonly--and, I think, mistakenly--called closures) are not a first-class concept in Java. Then, you later look at Haskell/ML/Scala/F#, which makes heavy use of what Java programmers would call "static methods" to carry out operations, and realize that this could, in fact, be adapted to Java to give you the "internal iteration" concept over the Java Collections:

   1: package com.tedneward.util;
   2:  
   3: import java.util.*;
   4:  
   5: public interface Acceptor
   6: {
   7:   public void each(Object obj);
   8: }
   9:  
  10: public class Collection
  11: {
  12:   public static void each(List list, Acceptor acc)
  13:   {
  14:     for (Object o : list)
  15:       acc.each(o);
  16:   }
  17: }

Where using it would look like this:

   1: import com.tedneward.util.*;
   2:  
   3: List personList = ...;
   4: Collection.each(new Accpetor() {
   5:   public void each(Object person) {
   6:     System.out.println("Found person " + person + ", isn't that nice?");
   7:   }
   8: });

Is it quite as nice or as clean as using it from a language that has first-class support for anonymous local functions? No, but slowly migrating over to this style has a couple of definitive effects, most notably that you will start grooming the rest of your team (who may be reluctant to pick up these new languages) towards the new ideas that will be present in Groovy, and when they finally do see them (as they will, eventually, unless they hide under rocks on a daily basis), they will realize what's going on here that much more quickly, and start adding their voices to the call to start using (J/Iron)Ruby/Groovy for certain things in the codebase you support.

(By the way, this is so much easier to do in C# 2.0, thanks to generics, static classes and anonymous delegates...

   1: namespace TedNeward.Util
   2: {
   3:   public delegate void EachProc<T>(T obj);
   4:   public static class Collection
   5:   {
   6:     public static void each(ArrayList list, EachProc proc)
   7:     {
   8:       foreach (Object o in list)
   9:         proc(o);
  10:     }
  11:   }
  12: }
  13:  
  14: // ...
  15:  
  16: ArrayList personList = ...;
  17: Collection.each(list, delegate(Object person) {
  18:   System.Console.WriteLine("Found " + person + ", isn't that nice?");
  19: });

... though the collection classes in the .NET FCL are nowhere near as nicely designed as those in the Java Collections library, IMHO. C# programmers take note: spend at least a week studying the Java Collections API.)

 

This, then, opens the much harder question of, "Which language?" Without trying to infer any sort of order or importance, here's a list of languages to consider, with URLs where applicable; I invite your own suggestions, by the way, as I'm sure there's a lot of languages I don't know about, and quite frankly, would love to. The "current hotness" is to learn the languages marked in bold, so if you want to be daring and different, try one of those that isn't. (I've provided some links, but honestly it's kind of tiring to put all of them in; just remember that Google is your friend, and you should be OK. :-) )

  • Visual Basic. Yes, as in Visual Basic--if you haven't played with dynamic languages before, try turning "Option Strict Off", write some code, and see how interacting with the .NET FCL suddenly changes into a duck-typed scenario. If you're really curious, have a look at the generated code in Reflector or ILDasm, and notice how the generated code looks a lot like the generated JVM code from other dynamic languages on an execution environment, a la Groovy.
  • Ruby (JRuby, IronRuby):
  • Groovy: Some call this "javac 2.0"; I'm not sure it merits that title, or the assumption of the mantle of "King of the JVM" that would seem to go with that title, but the fact is, Groovy's a useful language.
  • Scala: A "SCAlable LAnguage" for the JVM (and CLR, though that feature has been left to the community to support), incorporating both object-oriented and functional concepts, plus a few new ideas, into a single package. I'm obviously bullish on Scala, given the talks and articles I've done on it.
  • F#: Originally OCaml-on-the-CLR, now F# is starting to take on a personality of its own as Microsoft productizes it. Like Scala and Erlang, F# will be immediately applicable in concurrency scenarios, I think. I'm obviously bullish on F#, given the talks, articles, and book I'm doing on it.
  • Erlang: Functional language with a strong emphasis on parallel processing, scalability, and concurrency.
  • Perl: People will perhaps be surprised I say this, given my public dislike of Perl's syntax, but I think every programmer should learn Perl, and decide for themselves what's right and what's wrong about Perl. Besides, there's clearly no argument that Perl is one of the power tools in every *nix sysadmin's toolbox.
  • Python: Again, given my dislike of Python's significant whitespace, my suggestion to learn it here may surprise some, but Python seems to be stepping into Perl's shoes as the sysadmin language tool of choice, and frankly, lots of people like the significant whitespace, since that's how they format their code anyway.
  • C++: The grandaddy of them all, in some ways; if you've never looked at C++ before, you should, particularly what they're doing with templates in the Boost library. As Scott Meyers once put it, "We're a long way from Stack<T>!"
  • D: Walter Bright's native-compiling garbage-collected successor to C++/Java/C#.
  • Objective-C (part of gcc): Great "other" object-oriented C-based language that never gathered the kind of attention C++ did, yet ended up making its mark on the industry thanks to Steve Jobs' love of the language and its incorporation into the NeXT (and later, Mac OS X) toolchain. Obj-C is a message-passing object language, which has some interesting implications in its own right.
  • Common Lisp (Steel Bank Common Lisp): What happens when you create a language that holds as a core principle that the language should hold no clear delineation between "code" and "data"? Or that the syntactic expression of the language should be accessible from within that langauge? You get Lisp, and if you're not sure what I'm talking about, pick up a Lisp or a Scheme implementation and start experimenting.
  • Scheme (PLT Scheme, SISC): Scheme is one of the earliest dialects of Lisp, and much of the same syntactic flexibility and power of Lisp is in Scheme, as well. While the syntaxes are usually not directly interchangeable, they're close enough that learning one is usually enough.
  • Clojure: Rich Hickey (who also built "dotLisp" for the CLR) has done an amazing job of bringing Lisp to the JVM, including a few new ideas, such as some functional concepts and an implementation of software transactional memory, among other things.
  • ECMAScript (E4X, Rhino, ES4): If you've never looking at JavaScript outside of the browser, you're in for a surprise--as Glenn Vanderburg put it during one of his NFJS talks, "There's a real programming language in there!". I'm particularly fond of E4X, which integrates XML as a native primitive type, and the Rhino implementation fully supports it, which makes it attractive to use as an XML services implementation language.
  • Haskell (Jaskell): One of the original functional languages. Learning this will give a programmer a leg up on the functional concepts that are creeping into other environments. Jaskell is an implementation of Haskell on the JVM, and they've taken the concept of functional further, creating a build system ("Neptune") on top of Jaskell + Ant, to yield a syntax that's... well... more Haskellian... for building Java projects. (Whether it's better/cleaner than Ant is debatable, but it certainly makes clear the functional nature of build scripts.)
  • ML: Another of the original functional languages. Probably don't need to learn this if you learn Haskell, but hey, it can't hurt.
  • Heron: Heron is interesting because it looks to take on more of the modeling aspects of programming directly into the language, such as state transitions, which is definitely a novel idea. I'm eagerly looking forward to future drops. (I'm not so interested in the graphical design mode, or the idea of "executable UML", but I think there's a vein of interesting ideas here that could be mined for other languages that aren't quite so lofty in scope.)
  • HaXe: A functional language that compiles to three different target platforms: its own (Neko), Flash, and/or Javascript (for use in Web DOMs).
  • CAL: A JVM-based statically-typed language from the folks who bring you Crystal Reports.
  • E: An interesting tack on distributed systems and security. Not sure if it's production-ready, but it's definitely an eye-opener to look at.
  • Prolog: A language built around the idea of logic and logical inference. Would love to see this in play as a "rules engine" in a production system.
  • Nemerle: A CLR-based language with functional syntax and semantics, and semantic macros, similar to what we see in Lisp/Scheme.
  • Nice: A JVM-based language that permits multi-dispatch methods, sometimes known as multimethods.
  • OCaml: An object-functional fusion that was the immediate predecessor of F#. The HaXe and MTASC compilers are both built in OCaml, and frankly, it's in a startlingly small number of lines of code, highlighting how appropriate functional languages are for building compilers and interpreters.
  • Smalltalk (Squeak, VisualWorks, Strongtalk): Smalltalk was widely-known as "the O-O language that all the C guys turned to in order to learn how to build object-oriented programs", but very few people at the time understood that Smalltalk was wildly different because of its message-passing and loosely/un-typed semantics. Now we know better (I hope). Have a look.
  • TCL (Jacl): Tool Command Language, a procedural scripting language that has some nice embedding capabilities. I'd be curious to try putting a TCL-based language in the hands of end users to see if it was a good DSL base. The Jacl implementation is built on top of the JVM.
  • Forth: The original (near as I can tell) stack-based language, in which all execution happens on an execution stack, not unlike what we see in the JVM or CLR. Given how much Lisp has made out of the "atoms and lists" concept, I'm curious if Forth's stack-based approach yields a similar payoff.
  • Lua: Dynamically-typed language that lives to be embedded; known for its biggest embedder's popularity: World of Warcraft, along with several other games/game engines. A great demonstration of the power of embedding a language into an engine/environment to allow users to create emergent behavior.
  • Fan: Another language that seeks to incorporate both static and dynamic typing, running on top of both the JVM or the CLR.
  • Factor: I'm curious about Factor because it's another stack-based language, with a lot of inspiration from some of the other languages on this list.
  • Boo: A Python-inspired CLR language that Ayende likes for domain-specific languages.
  • Cobra: A Python-inspired language that seeks to encompass both static and dynamic typing into one language. Fascinating stuff.
  • Slate: A "prototype-based object-oriented programming language based on Self, CLOS, and Smalltalk-80." Apparently on hold due to loss of interest from the founder, last release was 0.3.5 in August of 2005.
  • Joy: Factor's primary inspiration, another stack-based language.
  • Raven: A scripting language that "rips off" from Python, Forth, Perl, and the creator's own head.
  • Onyx: "Onyx is a powerful stack-based, multi-threaded, interpreted, general purpose programming language similar to PostScript. It can be embedded as an extension language similarly to ficl (Forth), guile (scheme), librep (lisp dialect), s-lang, Lua, and Tcl."
  • LOLCode: No, you won't use LOLcode on a project any time soon, but LOLCode has had so many different implementations of it built, it's a great practice tool towards building your own languages, a la DSLs. LOLcode has all the basic components a language would use, so if you can build a parser, AST and execution engine (either interpreter or compiler) for LOLcode, then you've got the basic skills in place to build an external DSL.

There's more, of course, but hopefully there's something in this list to keep you busy for a while. Remember to send me your favorite new-language links, and I'll add them to the list as seems appropriate.

Happy hacking!


.NET | C++ | F# | Flash | Java/J2EE | Languages | LLVM | Mac OS | Parrot | Ruby | Visual Basic | Windows

Wednesday, July 02, 2008 7:13:10 PM (Pacific Daylight Time, UTC-07:00)
Comments [2]  | 
The power of Office as a front-end

I recently had the pleasure of meeting Bruce Wilson, a principal with iLink, and we had a pleasant conversation about enterprise applications and trends and such. Last week, in the middle of my trip to Prague and Zurich, he sent me a link to a blog entry he'd written on using Office as a front-end, and it sort of underscored some ideas I've had around Office in general.

The interesting thing is, most of the ideas he talks about here could just as easily be implemented on top of a Java back-end, or a Ruby back-end, as a .NET back-end. Office is a tool that many end-users "get" right away (whether you agree with Microsoft's user interface metaphors or not, or even like the fact that Office is one of the most widely-installed software packages on the planet), and it has a lot of support infrastructure built in. "Mashup" doesn't have to mean YouTube on your website; in fact, I dislike the term "mashup" entirely, since it sounds like something done in the heat of the moment without any planning or thought (which is the antithesis of anything that goes--or should go--into the enterprise). Can we use the term "cardinality" instead? Please?


.NET | Java/J2EE | Windows | XML Services

Wednesday, July 02, 2008 6:17:23 PM (Pacific Daylight Time, UTC-07:00)
Comments [2]  | 
 Wednesday, June 25, 2008
The ultimate thin client

One of the things that I like about the idea of building a DSL is the idea of users being able to express, in fairly user-friendly terms, the actions they want to take. For example, Daniel Spiewak has a great example of a DSL built in Scala using Scala's parser combinators, and the resulting text, while certainly not English, is a very readable form. But in of itself, it seems it's been a hard sell to the general community, who look at GUIs as a far more intuitive way of doing things. (Note: I disagree with this; I don't think GUIs are more intuitive, I think GUIs are more self-explanatory, once you've learned a few basic principles, like moving the mouse, clicking the button, and recognizing which elements are clickable and which aren't.)

I think I've finally figured out where an English- (or other human spoken language) driven DSL can be far more powerful and intuitive than a GUI.

Voice. Or, specifically the world of telecommunication devices as a user interface device. Not as "putting-a-GUI-on-a-phone"; I think this is a red herring and ultimately unproductive line of research, iPhones notwithstanding. I mean, literally, talking to the computer.

Imagine a field repair agent, coming off of a repair call, calling back to the office to say the repair was done: "Ticket number 451123, status complete, note Mrs Johnson really needs to stop washing her clothes in the dishwasher." Hanging up, he moves on to the next ticket in the list.

Meanwhile, on the other end, voice-analysis software has done the basic job of transforming words into a line of text, which is fed to the DSL for processing.

Or, the field agent texts the message to a company account, which again passes it directly to the DSL for further processing.

I am firmly convinced that this style of user interface--one we use every day--is the way that mobile devices should interact with enterprise systems. Forget trying to do complex GUIs on a device, forget even trying to simplify down the complex GUI into a simple GUI--just use your voice and a well-understood shared protocol (the DSL itself).

It's the ultimate thin client.




Wednesday, June 25, 2008 3:48:19 AM (Pacific Daylight Time, UTC-07:00)
Comments [1]  | 
 Tuesday, June 24, 2008
Twittering

Freshly Twittering

Username is tedneward

Come follow my thoughts




Tuesday, June 24, 2008 7:46:13 PM (Pacific Daylight Time, UTC-07:00)
Comments [0]  | 
Let the Great Language Wars commence....

As Amanda notes, I'm riding with 46 other folks (and lots of beer) on a bus from Michigan to devLink in Tennessee, as part of sponsoring the show. (I think she got my language preferences just a teensy bit mixed up, though.)

Which brings up a related point, actually: Amanda (of "the great F# T-shirt" fame from TechEd this year) and I are teaming up to do F# In A Nutshell for O'Reilly. The goal is to have a Rough Cut ready (just the language parts) by the time F# goes CTP this summer or fall, so we're on an accelerated schedule. If you don't see much from me via the blog for a while, now you know why. :-) Once that's done, I'm going dark on a Scala book to follow--details to follow when that contract is nailed down.

Meanwhile.... As she suggests, the bus will likely be filled with lots of lively debate. The nice thing about having a technical debate with drunk geeks on a bus traveling down the highway at speed is that it's actually pretty easy to win the debate, if you really want to:

"You are such an idiot! Object-relashunal mappers are just... *burp* so cool! Why can't you see that?"

"Idiot, am I? I demand satisfaction! Step outside, sir!"

"Fine, you--" WHOOSH ... THUMP-THUMP....

"Next?"

I'm looking forward to this. :-)

Editor's note: (Contact Amanda if you're interested in participating on the devLink bus, not the book. Thanks for the interest, but we aren't soliciting co-authors. We think we have this one pretty well covered, but we're always interested in reviewers--for that, you can contact either of us.)


.NET | C++ | Conferences | F# | Java/J2EE |