|
JOB REFERRALS
|
|
|
|
ON THIS PAGE
|
|
|
|
|
ARCHIVES
|
| November, 2008 (8) |
| October, 2008 (1) |
| September, 2008 (2) |
| August, 2008 (4) |
| July, 2008 (10) |
| June, 2008 (5) |
| May, 2008 (10) |
| April, 2008 (13) |
| March, 2008 (11) |
| February, 2008 (18) |
| January, 2008 (17) |
| December, 2007 (12) |
| November, 2007 (2) |
| October, 2007 (6) |
| September, 2007 (1) |
| August, 2007 (2) |
| July, 2007 (7) |
| June, 2007 (1) |
| May, 2007 (1) |
| April, 2007 (2) |
| March, 2007 (2) |
| February, 2007 (1) |
| January, 2007 (16) |
| December, 2006 (3) |
| November, 2006 (7) |
| October, 2006 (5) |
| September, 2006 (1) |
| June, 2006 (4) |
| May, 2006 (3) |
| April, 2006 (3) |
| March, 2006 (17) |
| February, 2006 (5) |
| January, 2006 (13) |
| December, 2005 (2) |
| November, 2005 (6) |
| October, 2005 (15) |
| September, 2005 (16) |
| August, 2005 (17) |
|
|
|
CATEGORIES
|
|
|
|
|
BLOGROLL
|
|
|
|
|
LINKS
|
|
|
|
|
SEARCH
|
|
|
|
|
MY BOOKS
|
|
|
|
|
DISCLAIMER
|
Powered by:
newtelligence dasBlog 1.9.7067.0
The opinions expressed herein are my own personal opinions and do not represent
my employer's view in any way.
© Copyright
2008
,
Ted Neward
E-mail
|
|
|
|
|
 Wednesday, July 16, 2008
|
Object.hashCode implementation
|
|
After the previous post, I just had to look. The implementation of Object.equals is, as was previously noted, just "return this == obj", but the implementation of Object.hashCode is far more complicated. Taken straight from the latest hg-pulled OpenJDK sources, Object.hashCode is a native method registered from Object.c that calls into a Hotspot-exported function, JVM_IHashCode(), from hotspot\src\share\vm\prims\jvm.cpp: JVM_ENTRY(jint, JVM_IHashCode(JNIEnv* env, jobject handle)) JVMWrapper("JVM_IHashCode"); // as implemented in the classic virtual machine; return 0 if object is NULL return handle == NULL ? 0 : ObjectSynchronizer::FastHashCode (THREAD, JNIHandles::resolve_non_null(handle)) ; JVM_END
which in turn calls ObjectSynchronizer::FastHashCode, defined in hotspot\src\share\vm\runtime\synchronizer.cpp as:
intptr_t ObjectSynchronizer::FastHashCode (Thread * Self, oop obj) { if (UseBiasedLocking) { // NOTE: many places throughout the JVM do not expect a safepoint // to be taken here, in particular most operations on perm gen // objects. However, we only ever bias Java instances and all of // the call sites of identity_hash that might revoke biases have // been checked to make sure they can handle a safepoint. The // added check of the bias pattern is to avoid useless calls to // thread-local storage. if (obj->mark()->has_bias_pattern()) { // Box and unbox the raw reference just in case we cause a STW safepoint. Handle hobj (Self, obj) ; // Relaxing assertion for bug 6320749. assert (Universe::verify_in_progress() || !SafepointSynchronize::is_at_safepoint(), "biases should not be seen by VM thread here"); BiasedLocking::revoke_and_rebias(hobj, false, JavaThread::current()); obj = hobj() ; assert(!obj->mark()->has_bias_pattern(), "biases should be revoked by now"); } }
// hashCode() is a heap mutator ... // Relaxing assertion for bug 6320749. assert (Universe::verify_in_progress() || !SafepointSynchronize::is_at_safepoint(), "invariant") ; assert (Universe::verify_in_progress() || Self->is_Java_thread() , "invariant") ; assert (Universe::verify_in_progress() || ((JavaThread *)Self)->thread_state() != _thread_blocked, "invariant") ;
ObjectMonitor* monitor = NULL; markOop temp, test; intptr_t hash; markOop mark = ReadStableMark (obj);
// object should remain ineligible for biased locking assert (!mark->has_bias_pattern(), "invariant") ;
if (mark->is_neutral()) { hash = mark->hash(); // this is a normal header if (hash) { // if it has hash, just return it return hash; } hash = get_next_hash(Self, obj); // allocate a new hash code temp = mark->copy_set_hash(hash); // merge the hash code into header // use (machine word version) atomic operation to install the hash test = (markOop) Atomic::cmpxchg_ptr(temp, obj->mark_addr(), mark); if (test == mark) { return hash; } // If atomic operation failed, we must inflate the header // into heavy weight monitor. We could add more code here // for fast path, but it does not worth the complexity. } else if (mark->has_monitor()) { monitor = mark->monitor(); temp = monitor->header(); assert (temp->is_neutral(), "invariant") ; hash = temp->hash(); if (hash) { return hash; } // Skip to the following code to reduce code size } else if (Self->is_lock_owned((address)mark->locker())) { temp = mark->displaced_mark_helper(); // this is a lightweight monitor owned assert (temp->is_neutral(), "invariant") ; hash = temp->hash(); // by current thread, check if the displaced if (hash) { // header contains hash code return hash; } // WARNING: // The displaced header is strictly immutable. // It can NOT be changed in ANY cases. So we have // to inflate the header into heavyweight monitor // even the current thread owns the lock. The reason // is the BasicLock (stack slot) will be asynchronously // read by other threads during the inflate() function. // Any change to stack may not propagate to other threads // correctly. }
// Inflate the monitor to set hash code monitor = ObjectSynchronizer::inflate(Self, obj); // Load displaced header and check it has hash code mark = monitor->header(); assert (mark->is_neutral(), "invariant") ; hash = mark->hash(); if (hash == 0) { hash = get_next_hash(Self, obj); temp = mark->copy_set_hash(hash); // merge hash code into header assert (temp->is_neutral(), "invariant") ; test = (markOop) Atomic::cmpxchg_ptr(temp, monitor, mark); if (test != mark) { // The only update to the header in the monitor (outside GC) // is install the hash code. If someone add new usage of // displaced header, please update this code hash = test->hash(); assert (test->is_neutral(), "invariant") ; assert (hash != 0, "Trivial unexpected object/monitor header usage."); } } // We finally get the hash return hash; }
Hope this answers all the debates. 
Editor's note: Yes, I know it's a long quotation of code completely out of context; my goal here is simply to suggest that the hashCode() implementation is not just a integerification of the object's address in memory, as was suggested in other discussions. For whatever it's worth, the get_next_hash() implementation that's referenced in the FastHashCode() method looks like:
// hashCode() generation : // // Possibilities: // * MD5Digest of {obj,stwRandom} // * CRC32 of {obj,stwRandom} or any linear-feedback shift register function. // * A DES- or AES-style SBox[] mechanism // * One of the Phi-based schemes, such as: // 2654435761 = 2^32 * Phi (golden ratio) // HashCodeValue = ((uintptr_t(obj) >> 3) * 2654435761) ^ GVars.stwRandom ; // * A variation of Marsaglia's shift-xor RNG scheme. // * (obj ^ stwRandom) is appealing, but can result // in undesirable regularity in the hashCode values of adjacent objects // (objects allocated back-to-back, in particular). This could potentially // result in hashtable collisions and reduced hashtable efficiency. // There are simple ways to "diffuse" the middle address bits over the // generated hashCode values: //
static inline intptr_t get_next_hash(Thread * Self, oop obj) { intptr_t value = 0 ; if (hashCode == 0) { // This form uses an unguarded global Park-Miller RNG, // so it's possible for two threads to race and generate the same RNG. // On MP system we'll have lots of RW access to a global, so the // mechanism induces lots of coherency traffic. value = os::random() ; } else if (hashCode == 1) { // This variation has the property of being stable (idempotent) // between STW operations. This can be useful in some of the 1-0 // synchronization schemes. intptr_t addrBits = intptr_t(obj) >> 3 ; value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ; } else if (hashCode == 2) { value = 1 ; // for sensitivity testing } else if (hashCode == 3) { value = ++GVars.hcSequence ; } else if (hashCode == 4) { value = intptr_t(obj) ; } else { // Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX ; t ^= (t << 11) ; Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ; unsigned v = Self->_hashStateW ; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; Self->_hashStateW = v ; value = v ; }
value &= markOopDesc::hash_mask; if (value == 0) value = 0xBAD ; assert (value != markOopDesc::no_hash, "invariant") ; TEVENT (hashCode: GENERATE) ; return value; }
Thus (hopefully) putting the idea that it might be allocating a hash based on the object's identity completely to rest.
For the record, this is all from the OpenJDK source base--naturally, it's possible that earlier VM implementations did something entirely different.
Java/J2EE
Wednesday, July 16, 2008 1:18:19 AM (Pacific Daylight Time, UTC-07:00)
|
|
 Tuesday, July 15, 2008
|
Of Zealotry, Idiocy, and Etiquette...
|
|
I'm not sure what it is about our industry that promotes the flame war, but for some reason exchanges like this one, unheard of in any other industry I've ever touched (even tangentially), are far too common, too easy to get into, and entirely too counterproductive. I'm not going to weigh in on one side or the other here; frankly, I have a hard time following the debate and figuring out who's exactly arguing for what. I can see, however, that the entire debate follows some traditional patterns of the flame war: - Citing yourself as the final authority. At no point during the debate does anybody reach for their copy of Effective Java, a widely-accepted source of Java guidance, for a potential resolution to the discussion. Instead, the various players simply say, "Fact A is true" or "Fact A is false", with zero supporting information, citations, or demonstrations either way. (A few people cite the Javadoc, but there is enough ambiguity there to merit further citation.)
- Refusal to accept the possibility of an alternative viewpoint. At no point, near as I can tell, did any of the participants bother to say, "You know, you could be right, but I remain unconvinced. Can you give me more information to support your point of view?" The entire time, everybody is arguing from "fact", and nobody even considers the possibility that different JVMs can have different implementations, despite the fact that the Javadoc being quoted says as much.
- Degeneration into personal attacks. I don't care who started it, I don't care who called who the worse name. Fact is, reasonable people can reasonably disagree, and nobody in that transcript seemed overly reasonable to me.
- Nobody ever really gets around to answering the question because they're too busy arguing their position or point. Poor "doub", the initiator of the question, tries valiantly to circle the conversation back on topic, but the various players are too busy whipping out their instruments of manhood onto the table so everybody can see how much bigger it is than the other guys'. When "doub" points out that writing some sample code "gave me a very loose but still usefull information about my object, and took less time than the conversation about my question
", or in other words, "Hey, guys, I kinda already got my answer, can we move on now?", the conversation continues as if the comment never occurred--the question has turned into a "biggest-geek" argument by this point. "doub" even asks, at 10:12:12, "do i get bad karma points for being the initiator of a conflict?", and the image I get in my head is that of the poor kid, hiding in his bedroom while his parents yell and scream downstairs, feeling awful because the fight started over his backpack lying in the hallway where Mom told him to put it and Dad thought he left it instead of putting it away. ("doub", if you read this, no, you get no bad karma points, at least not in my universe.) The interesting thing, though, is that this conversation has nothing to do with Scala. "dysinger" twitters: Frankly, "dysinger", it's kinda hard to have much sympathy for somebody when they blame the language or tool for a conversation that's had around it; this would be like blaming Python, the language, for the community around it (which some people do, I understand). I can understand the frustration, on both sides, since everybody was essentially arguing past one another, but why is that Scala's fault, pray tell? And frankly, I find the dig at the academics to be a tad disingenuous. Yes, academics have a reputation--duly earned in some cases--of being removed from reality and the slings and arrows of a life spent developing software for production environments, but name for me a language in the popular mainstream that doesn't owe a huge debt to the preliminary work laid down by academics before it. In every other industry, academics are revered and honored--it's only in this industry they are used as an example of degradation and insult. Way to bite the hand that makes your life easier, folks.... At the end of the day, these kind of debates do nothing but harm the innocent, "doub", in this case. "dysinger", "DrMacIver", "JamesIry", all of you, right or wrong, didn't exactly cover yourselves in glory, nor did you really convince anybody of anything. Instead, you shouted at each other really loudly, made lots of noise, got angry over nothing in particular, and really failed to achieve much of anything. Regardless of your intentions, now Scala, Java, the JVM and the entire ecosystem have seen their reputation tarnished just a touch more than it was when you started. Great job. Here's a tip for all of you: Try listening.
Java/J2EE
Tuesday, July 15, 2008 11:18:43 PM (Pacific Daylight Time, UTC-07:00)
|
|
 Friday, July 11, 2008
|
So You Say You Want to Kill XML....
|
|
Google (or at least some part of it) has now weighed in on the whole XML discussion with the recent release of their "Protocol Buffers" implementation, and, quite naturally, the debates have begun, with all the carefully-weighed logic, respectful discourse, and reasoned analysis that we've come to expect and enjoy from this industry. Yeah, right. Anyway, without trying to take sides either way in this debate--yes, the punchline is that I believe in a world where both XML and Protocol Buffers are useful--I thought I'd weigh in on some of the aspects about PBs that are interesting/disturbing, but more importantly, try to frame some of the debate and discussions around these two topics in a vain attempt to wring some coherency and sanity out of what will likely turn into a large shouting match. For starters, let's take a quick look at how PBs work. Protocol Buffers 101 The idea behind PBs is pretty straightforward: given a PB definition file, a code-generator tool builds C++, Java or Python accessors/generators that know how to parse and produce files in the Protocol Buffer format. The generated classes follow a pretty standard format, using the traditional POJO/JavaBean style get/set for Java classes, and something similar for both the Python and C++ classes. (The Python implementation is a tad different from the C++ and Java versions, as it makes use of Python metaclasses to generate the class at runtime, rather than at generation-time.) So, for example, given the Google example of: message Person { required string name = 1; required int32 id = 2; optional string email = 3;
enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; }
message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; }
repeated PhoneNumber phone = 4; }
... using the corresponding generated C++ class would look something like
Person person; person.set_name("John Doe"); person.set_id(1234); person.set_email("jdoe@example.com"); fstream output("myfile", ios::out | ios::binary); person.SerializeToOstream(&output);
... and the Java implementation would look somewhat similar, except using a Builder to create the Person.
The Protocol Buffer interoperable definition language is relatively complete, with support for a full range of scalar types, enumerations, variable-length collections of these types, nested message types, and so on. Each field in the message is tagged with a unique field number, as you can see above, and the language provides the ability to "reserve" field IDs via the "extension" mechanism, presumably to allow for either side to have flexibility in extending the PB format.
There's certainly more depth to the PB format, including the "service" and "rpc" features of the language that will generate full RPC proxies, but this general overview serves to provide the necessary background to be able to engage in an analysis of PBs.
Protocol Buffers: An Analysis
When looking at Protocol Buffers, it's important to realize that this is an open-source implementation, and that Google has already issued a call to others to modify them and submit patches to the project. Anything that comes across as a criticism or inherent flaw in the implementation is thus correctable, though whether Google would apply those fixes to the version they use internally is an open question--there's a tension in any open-source project sponsored by a company, between "What we need the project for" and "What other people using the project need it for", and it's not always clear how that tension will play out in the long term.
So, without further ado ...
For starters, Protocol Buffers' claim to be language and/or platform-neutral is hardly justifiable, given that they have zero support for the .NET platform out of the box. Now, before the Googlists and Google-fanbois react angrily, let me be the first to admit, that yes, nothing stops anybody from producing said capability and contributing it back to the project. In fact, there's even some verbage to that effect on the Protocol Buffers' FAQ page. But without it coming out of the box, it's not fair to claim language- and platform-neutrality, unless, of course, they are willing to suggest that COM's Structured Storage was also language- and platform-neutral, and that it wasn't Microsoft's fault that nobody went off and built implementations for it under *nix or the Mac. Frankly, any binary format, regardless of how convoluted, could be claimed to be language- and platform-neutral under those conditions, which I think makes the claim spurious to make. The fact that Google doesn't care about the .NET platform doesn't mean that their implementation is incapable of running on it; the fact that Google wrote their code generator to support C++, Java and Python doesn't mean that Protocol Buffers are language- and platform-neutral, either. XML still holds the edge here, by a long shot--until we see implementations of Protocol Buffers for Perl/Parrot, C, D, .NET, Ruby, JavaScript, mainframes and others, PBs will have to take second-place status behind XML in terms of "reach" across the wide array of systems. Does that diminish their usefulness? Hardly. It just depends on how far a developer wants their data format to stretch.
Having gotten that little bit out of the way ...
Remember a time when binary formats were big? Back in the early- to mid-90's, binary formats were all the rage, and criticism of them abounded: they were hard to follow, hard to debug, bloated, inherently inflexible, tightly-coupled, and so on. Consider, for example, this article comparing XML against its predecessor information-exchange technologies:
A great example ... is a data feed of all orders taken electronically via a web site into an internal billing and shipping system. The main advantage of XML here is it's flexibility - developers create their own tags and dictionaries, as they deem necessary. Therefore no matter what type of data is being transferred, the right XML representation of it will accurately describe the data.
Logically, each order can be described as one customer purchasing one or more items using their credit card and potentially entering different billing and shipping addresses. The contents of the file are very easy to read, even for a person who is not familiar with XML. The information within each of the order tags is well structured and organized. This enables developers to use parsing components and easily access any data within the document. Each item in the order is logically a unique entity, and is also represented with a separate tag. All item properties are defined as "child" nodes of the item tag.
XML is the language of choice for two major reasons. First of all, an XML formatted document can be easily processed under any OS, and in any language, as long as XML parsing components are available. On the other hand, XML files are still raw data, which enables merchants to format the data any way they want to. All in all, document structure and wide acceptance of this format has made it possible to enable customers to build more efficient internal order Tracking systems based on XML-formatted order files. Other online merchant sites are making similar functionality available on their Web sites as well.
And yet, now, the criticism goes the other way, complaining that XML is bloated, hard to follow, and so on:
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.
How do we reconcile these apparently contradictory positions?
First of all, I don't think anybody in the XML community has ever sought to argue that XML was designed for efficiency; in fact, quite the opposite, as the XML specification itself states in the abstract:
The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. ... This document specifies a syntax created by subsetting an existing, widely used international text processing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.
The Introduction section is even more explicit:
XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.
And if that wasn't clear enough....
The design goals for XML are:
- XML shall be straightforwardly usable over the Internet.
- XML shall support a wide variety of applications.
- XML shall be compatible with SGML.
- It shall be easy to write programs which process XML documents.
- The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
- XML documents should be human-legible and reasonably clear.
- The XML design should be prepared quickly.
- The design of XML shall be formal and concise.
- XML documents shall be easy to create.
- Terseness in XML markup is of minimal importance.
In essence, then, the goal of XML was never to be small or fast, but still clearly simple. And, despite your personal opinion about the ecosystem that has grown up around XML (SOAP, WS-*, and so on), it's still fairly easy to defend the idea that XML itself is a simple technology, particularly if we make some basic assumptions around things that usually complicate text like character sets and encoding and such.
Note: I am deliberately ignoring the various attempts at a binary Infoset specification, which has been on the TODO list of many XML working groups in the past and has yet to really make any sort of impact in the industry. Theoretically, yes, XML Infoset-compliant documents could be rendered into a binary format that would have a much smaller, more efficient footprint than the textual representation. If and when we ever get there, it would be interesting to see what the results look like. I'm not holding my breath.
Why, then, did XML take on a role as data-transfer format if, on the surface of things, using text here was such a bad idea?
Interoperability:
"With Web services your accounting departments Win 2k servers billing system can connect with your IT suppliers UNIX server." --http://www.w3schools.com/webservices/ws_why.asp
"Conventional application development often means developing for multiple devices, and that form factors of the client devices can be dramatically different. If you base an application on a web browser display size of 800x600, it would never work on a device with a resolution of 4 lines by 20 characters. Conversely, if you took a lowest common denominator approach and sized for the smaller device, the user interface would be lost on an 800x600 device.Using a non-XML approach, this leaves us writing multiple clients speaking to the server, or writing multiple clients speaking to multiple severs ... Limitations of this approach include:
- Tightly coupled to browser
- Multiple code-bases
- Difficult to adapt to new devices
- Major development efforts
- Slow time to market
"In this case, only one application exists. It runs against the back-end database, and produces an XML stream. A "translator" takes this XML stream, and applies an XSLT transformation to it. Every device could either use a generic XSLT, or have a specialized XSLT that would produce the required device-specific output. The transformation occurs on the server, meaning that no special client capabilities are required.This "hub and spoke" architecture yields tremendous flexibility. When a new device appears, anew spoke can be added to accommodate it. The application itself does not need to be changed, only the translator needs to be informed about the existence of the new device, and which XSLT to use for it." --http://www.topxml.com/conference/wrox/2000_vegas/text/brianl_xml.pdf
Certainly the interoperability argument doesn't require a text-based format, it was just always cited that way. In fact, both the CORBA supporters and the developers over at ZeroC will both agree with Google in suggesting that a binary format can and will be an efficient and effective interoperability format. (I'll let the ZeroC folks talk to the pros and cons of their Ice format as compared to IIOP and SOAP.)
Some of the key inherent advantages in XML that are lost in this new binary format, however, center around the XML Infoset itself and the fact that it has a number of ancillary tools around it, which gets us into the next point; consider some of those inherent advantages in XML that are lost in this new binary, structured, tightly-coupled format:
- XPath. The ability to selectively extract nodes from inside a document is incredibly powerful and highly underutilized by most developers.
- XSLT. Although somewhat on the down-and-out in popularity with developers because of its complexity, XSLT stands as a shining example of how a general-purpose transformation tool can be built once the basic structure is well-understood and independent of the actual domain. (SQL is another.)
- Structureless parsing. In order to use Protocol Buffers, each side must have a copy of the .proto file that generated the proxies. (While it's theoretically possible to build an API that picks through the structure element-by-element in the same way an XML API does, I haven't found such an API in the PB code yet, and doing so would mean spending some quality time with the Protocol Buffer binary encoding format. The same was always true of IIOP or the DCOM MEOW format, too.)
In essence, a Protocol Buffer format consumer is tightly-coupled against the .proto file that it was generated against, whereas in XML we can follow Noah Mendelsohn's advice of years ago ("Schema are relative") and parse XML in whatever manner suits the consumer, with or without schema, without or without schema-based-and-bound proxy code. The advantage to the XML approach, of course, is that it provides a degree of flexibility; the advantage of the Protocol Buffer approach is that the code to produce and consume the elements can be much simpler, and therefore, faster.
Note: it's only fair to point out at this point that the Protocol Buffer approach already contains a certain degree of flexibility that earlier binary formats lacked (if I remember correctly), via PB's use of "optional" vs "required" tags for the various fields. Whether this turns out to be sufficient over time is yet to be decided, though Google claims to be using Protocol Buffers throughout the company for its own internal purposes, which is certainly supporting evidence that cannot be discarded casually.
But there's a corollary effect here, as well: because XML documents are intended to be self-descriptive, the Protocol Buffer format can contain just the data, and leave the format and structure to be enforced by the code on either side of the producer/consumer relationship. Whether you consider this a Good Thing or a Bad Thing probably stands as a good indicator of whether you like the Protocol Buffer approach or the XML approach better.
Note, too, by the way, that many of the XML-based binding APIs can now parse objects out of part of an XML document, as opposed to having to read in the whole thing and then convert--JAXB, JiBX and XMLBeans can all pick up parsing an XML document from any node beyond the initial root element, for example--whereas, at least as of this point (as far as I can tell, and I'd love to have somebody at Google tell me I'm wrong here, because I think it's a major flaw), the Protocol Buffers approach assumes it will read in the entire object from the file. I don't see any way, short of putting developer-specified "separation markers" into the stream or some other kind of encoding or convention, of doing a "partial read" of an object model from a PB data file.
To see what I mean by this, consider the AddressBook example. Suppose the AddressBook holds several thousand records, and my processing system only cares about a select few (less than five, perhaps, who all have the last name "Neward"). In a Protocol Buffer scheme, I deserialize the entire AddressBook, then go through the persons item by item, looking for the one I want. In an XML-based scheme, I can pull-mode parse (StAX in Java, or using the pull-mode XML parser in .NET, for example) the nodes, throwing away nodes until I see one where the <lastName> node contains "Neward", and then JAXB-consume the next n number of nodes into a Person object before continuing through the remainder of the AddressBook.
Let's also note that the Protocol Buffer scheme assumes working with a stream-based (which usually means file-based) storage style for when Protocol Buffers are used as a storage mechanism. But frankly, if I want to store objects, I'd rather use a system that understands objects a bit more deeply than PBs do, and gives me some additional support beyond just input and output. This gets us into the long and involved discussion around object databases, which is another one that's likely to devolve into a shouting match, so I'll leave it at that. Suffice it to say that for object storage, I can see using (for example) db4o-storing-to-a-file as a vastly more long-term solution than I can using PBs, at least for now. (Undoubtedly the benchmarks will be along soon to try and convince us one way or another.) One area that is of particular interest along these lines, though, will be the evolutionary capabilities of each--from my (limited) study of PBs thus far, I believe db4o has a more flexible evolutionary scheme, and I have to admit, I don't like the idea of having to run the codegen step before being able to store my objects, but that's a minor nit that's easily solved with tools like Ant and a global rule saying "Nobody touches the .proto file without checking with me first."
Which, by the way, brings up another problem, the same one that plagues CORBA, COM/DCOM, WSDL-based services, and anything that relies on a shared definition file that is used for code-generation purposes, what I often call The Myth of the One True Schema. Assuming a developer creates a working .proto/.idl/.wsdl definition, and two companies agree on it, what happens when one side wants to evolve or change that definition? Who gets to decide the evolutionary progress of that file? Who "owns" that definition, in effect? And this, of course, presumes that we can even get some kind of definition as to what a "Customer" looks like across the various departments of the company in the first place, much less across companies. Granted, the "optional" tag in PBs help with this, but we're still stuck with an inherently unscalable problem as the number of participants in the system grows.
I'd give a great deal of money to see what the 12,000-odd .proto files look like inside Google, and then again at what they look like in five years, particularly if they are handed out to paying customers as APIs against which to compile and link. There's ways to manage this, of course, but they all look remarkably like the ways we managed them back in the COM/DCOM-vs-CORBA days, too.
Long story short, the Protocol Buffer approach looks like a good one, but let's not let the details get lost in the shouting: Protocol Buffers, as with any binary protocol format and/or RPC mechanism (and I'm not going to go there; the weaknesses of RPC are another debate for another day), are great for those situations where performance is critical and both ends of the system are well-known and controlled. If Google wants to open up their services such that third-parties can call into those systems using the Protocol Buffers approach, then more power to them... but let's not lose sight of the fact that it's yet another proprietary API, and that if Microsoft were to do this, the world would be screaming about "vendor lock-in" and "lack of standards compliance". (In fact, I heard exactly these complaints from Java developers during WCF Q&A when they were told that WCF-to-WCF endpoints could "negotiate up" to a faster, binary, protocol between them.)
In the end, if you want an endpoint that is loosely coupled and offers the maximum flexibility, stick with XML, either wrapped in a SOAP envelope or in a RESTful envelope as dictated by the underlying transport (which means HTTP, since REST over anything else has never really been defined clearly by the Restafarians). If you need a binary format, then Protocol Buffers are certainly one answer... but so is ICE, or even CORBA (though this is fast losing its appeal thanks to the slow decline of the players in this space). Don't lose sight of the technical advantages or disadvantages of each of those solutions just because something has the Google name on it.
Friday, July 11, 2008 2:02:47 AM (Pacific Daylight Time, UTC-07:00)
|
|
 Wednesday, July 02, 2008
|
Polyglot Plurality
|
|
The Pragmatic Programmer says, "Learn a new language every year". This is great advice, not just because it puts new tools into your mental toolbox that you can pull out on various occasions to get a job done, but also because it opens your mind to new ideas and new concepts that will filter their way into your code even without explicit language support. For example, suppose you've looked at (J/Iron)Ruby or Groovy, and come to like the "internal iterator" approach as a way of simplifying moving across a collection of objects in a uniform way; for political and cultural reasons, though, you can't write code in anything but Java. You're frustrated, because local anonymous functions (also commonly--and, I think, mistakenly--called closures) are not a first-class concept in Java. Then, you later look at Haskell/ML/Scala/F#, which makes heavy use of what Java programmers would call "static methods" to carry out operations, and realize that this could, in fact, be adapted to Java to give you the "internal iteration" concept over the Java Collections: 1: package com.tedneward.util; 2: 3: import java.util.*; 4: 5: public interface Acceptor 6: { 7: public void each(Object obj); 8: } 9: 10: public class Collection 11: { 12: public static void each(List list, Acceptor acc) 13: { 14: for (Object o : list) 15: acc.each(o); 16: } 17: }
Where using it would look like this:
1: import com.tedneward.util.*; 2: 3: List personList = ...; 4: Collection.each(new Accpetor() { 5: public void each(Object person) { 6: System.out.println("Found person " + person + ", isn't that nice?"); 7: } 8: });
Is it quite as nice or as clean as using it from a language that has first-class support for anonymous local functions? No, but slowly migrating over to this style has a couple of definitive effects, most notably that you will start grooming the rest of your team (who may be reluctant to pick up these new languages) towards the new ideas that will be present in Groovy, and when they finally do see them (as they will, eventually, unless they hide under rocks on a daily basis), they will realize what's going on here that much more quickly, and start adding their voices to the call to start using (J/Iron)Ruby/Groovy for certain things in the codebase you support.
(By the way, this is so much easier to do in C# 2.0, thanks to generics, static classes and anonymous delegates...
1: namespace TedNeward.Util 2: { 3: public delegate void EachProc<T>(T obj); 4: public static class Collection 5: { 6: public static void each(ArrayList list, EachProc proc) 7: { 8: foreach (Object o in list) 9: proc(o); 10: } 11: } 12: } 13: 14: // ... 15: 16: ArrayList personList = ...; 17: Collection.each(list, delegate(Object person) { 18: System.Console.WriteLine("Found " + person + ", isn't that nice?"); 19: });
... though the collection classes in the .NET FCL are nowhere near as nicely designed as those in the Java Collections library, IMHO. C# programmers take note: spend at least a week studying the Java Collections API.)
This, then, opens the much harder question of, "Which language?" Without trying to infer any sort of order or importance, here's a list of languages to consider, with URLs where applicable; I invite your own suggestions, by the way, as I'm sure there's a lot of languages I don't know about, and quite frankly, would love to. The "current hotness" is to learn the languages marked in bold, so if you want to be daring and different, try one of those that isn't. (I've provided some links, but honestly it's kind of tiring to put all of them in; just remember that Google is your friend, and you should be OK. )
- Visual Basic. Yes, as in Visual Basic--if you haven't played with dynamic languages before, try turning "Option Strict Off", write some code, and see how interacting with the .NET FCL suddenly changes into a duck-typed scenario. If you're really curious, have a look at the generated code in Reflector or ILDasm, and notice how the generated code looks a lot like the generated JVM code from other dynamic languages on an execution environment, a la Groovy.
- Ruby (JRuby, IronRuby):
- Groovy: Some call this "javac 2.0"; I'm not sure it merits that title, or the assumption of the mantle of "King of the JVM" that would seem to go with that title, but the fact is, Groovy's a useful language.
- Scala: A "SCAlable LAnguage" for the JVM (and CLR, though that feature has been left to the community to support), incorporating both object-oriented and functional concepts, plus a few new ideas, into a single package. I'm obviously bullish on Scala, given the talks and articles I've done on it.
- F#: Originally OCaml-on-the-CLR, now F# is starting to take on a personality of its own as Microsoft productizes it. Like Scala and Erlang, F# will be immediately applicable in concurrency scenarios, I think. I'm obviously bullish on F#, given the talks, articles, and book I'm doing on it.
- Erlang: Functional language with a strong emphasis on parallel processing, scalability, and concurrency.
- Perl: People will perhaps be surprised I say this, given my public dislike of Perl's syntax, but I think every programmer should learn Perl, and decide for themselves what's right and what's wrong about Perl. Besides, there's clearly no argument that Perl is one of the power tools in every *nix sysadmin's toolbox.
- Python: Again, given my dislike of Python's significant w
| |