The Trouble with XML
by Christopher Fry
XML is supposed to be a
universal syntax for representing information. It is poor at this task because
of three fundamental limitations of the syntax. ConciseXML resolves these
problems in a backward-compatible way.
The Part-Whole vs. Type Problem
The worst problem with XML is
that the syntax does not help you distinguish between the name of a part and a
"type". Because of XML’s limitations, workarounds are convoluted and confusing.
Most people that have some experience with XML don’t understand this problem
which compounds it even further.
First some terminology;
Here’s a typical statement in XML:
<boat color="white" size="23">
Using XML terminology: This
is an xml element. "boat" is the tag name. Color is an attribute name and
white is an attribute value.
The content of the element is <sail area="47"/>.
Using English terminology
we have a boat with two properties [color
and size] and one part, a sail. From object oriented experience we know that
the difference between a "property" and a "part" often isn’t so important.
Usually both are represented by the same construct, a field in an object.
Sometimes whether something is a property or a part is ambiguous. Such as, is
color really the "paint part" of the object? Happily, most of the time this
doesn’t really matter and our programs don’t need to delve into Knowledge
Representation theory, we just use them [a property or a part] how we want to
in the domain of our program and everything works out ok.
So if we think the difference
between a part and a property doesn’t really matter then we ought to use the
same syntactic construct to describe each, so we might write our above
<boat color="white" size="23" sail=<sail area="47"/> > </boat>
or we might have a complex
object for the color, instead of a string, say:
<boat color=<paint "white" "marine_finish"/> size="23" />
Unfortunately xml permits
only strings as values of attributes. So you might hope we could write:
<boat color="white" size="23" sail="<sail area="47"/>" > </boat>
Even if we could "escape" the
double quotes around 47, we still can’t do this since attribute values must be
strings WHICH DO NOT CONTAIN ANGLE BRACKETS.
OK so we’re forced to use our
original syntax. Now let’s look a little more closely at it.
Observe that "<boat>"
makes us an instance of a "boat". So boat is really the name of a class, or a
Well using that same logic,
what’s "<sail/>" ? It must be creating a sail object so "sail" is the name
of a class. Well if sail is the name of a class, then it can’t be the "part
name" of something within boat. But in fact people use just this syntax to
name parts in XML all the time because they can’t use the foo="bar" syntax for naming
Now sometimes people try to
get around this by doing:
<boat color="white" size="23">
<sail name="mainsail" area="47"/>
We’ve now got a clear "name
of part within boat" right? Well not really because we are using our attribute
syntax for the part name. Although we MIGHT want to have a name of something be
an attribute of the thing, for the most part, its better to have the part name
OUTSIDE of the thing itself. After all the thing might be the "mainsail" part
of the boat, but at night we use it as the "top sheet" part on the bed. It is
still the same thing so we can’t change it just because we’re using it
If you embed a "name" in a
part, then when you go to find the part of the boat named mainsail, you
essentially have to do content addressable memory. It is slower first of all,
but more important is the inability to share this object in other places as is
normal in normal object oriented programming.
You may observe another
problem with our example. We’ve encoded the size attribute as a string "23".
But we all know it is really a number. Well what if we WANT the size to
sometimes actually be a string of digits, and other times actually be a number?
We can define our schema one way or the other but we can’t have it both ways as
we can in an object system that permits not just fields that can hold several
different types of objects, but also a syntax like
"23" for strings
In the XML community all this
hair doesn’t seem to bother anyone. They make up objects like:
and everyone’s happy since
"street" functions as both a type and a part name. Then someone wants to add a "cross street" part to our
element. Suppose we represent it like so:
Now we’ve got a problem because the value of our part
named cross-street is an object of type "street" but we’ve got some other PART
XML fails to make clear the
distinction between part-names and types. This ambiguity is deadly for any kind
of reasonably complex information.
The Lack of "by-position" Arguments
The most common and most
important kind of computer data is code. So we might represent a
function call in XML like so:
<launch_boat when="3:14PM" where="LA"/>
OK that’s not so bad [until
we have non-string arguments that is, see the above problem].
But programmers like to be
terse and don’t like to have to type in keywords all the time, so they
generally prefer syntaxes that allow passing of arguments by position
<launch_boat "3:14PM "LA"/>
The arguments are
distinguished by their order. But XML doesn’t allow such syntax. You must put
in the name of the attribute. We could do something like:
but that’s even more verbose
than mentioning the keywords in the first place.
Lack of by-position arguments
and the lack of elements as attribute values makes XML a lousy syntax for the
most important kind of data on a computer.
Verbose Ending Tag
When we have an element that
contains a content, you must end it with an ending tag that mentions the tag
again as in our top example:
<boat color="white" size="23">
<sail area="47" />
Mentioning boat a 2nd
time is redundant. Worse, if you want to change the name to, say, "sailboat",
now you’ve got to change it in two places making it likely that you’ll get the
two out of sync and cause an error. In situations where an ending tag is a long
ways away from the beginning tag, it is often clearer to name the ending tag.
But for short expressions, like what are typically in the most important kind
of computer data, long ending tags are a burden more than an aid.
Long ending tags were
designed to make the XML more readable. However, because you’ve got to use the
name of a tag twice every time you use it with a content, developers will be
reluctant to give tags long descriptive names in the first place, thus
rendering their XML less readable.
This error-prone and verbose syntax conspires to make it
more difficult to produce syntactically correct information that is easily
readable by humans.
Water solves these problems with XML 1.0 by using a concise form of XML called ConciseXML.