Tuesday, February 8, 2011

Before XML became a standard and given all its shortcomings, what made XML so popular?

Yes XML is human readable but so is comma delimited text and properties files.

XML is bloated, hard to parse, hard to modify in code, plus a ton of other problems that I can think about with it.

My questions is what are XML's most attractive qualities that has made it so popular????

  • It's structured.

    Autobyte : Can we not come up with something a little less cunbersome that is also structured?
    Rich Bradshaw : Well, possibly, but there hasn't been much that has been as widely adopted. JSON is another data exchange format, but it's nowhere near as structured, which for thigns such as SOAP requests is important.
  • it's compatable with many languages

    Autobyte : Why because it is text? There are other text formats that I am sure would have been less bloated and more easily parsed..
    Sara Chipps : yes, after my answer I reread your question and realized that it wasn't an argument against .csv
  • It's easier to write a parser for an XML dialect than for an arbitrary one because of tools that are available.

    Using a DOM parser, for example, is much simpler than lexx and yacc, especially in Java where it was popularized.

    Autobyte : this a chicken and the egg answer it's easier because there are tools - why did people develop tools for it in the first place???
  • Do you remember the days before XML became popular? Data just wasn't easily interchangeable -- one program would take .csv files, the next .xls, the next EBSIDIC-formatted files. XML has its weaknesses, but it's structured, which makes it parsable and transformable.

    As you point out, CSV files are pretty portable. However, there's no meaning to them. What does column(14) mean to me? As opposed to <customer id="14"/>?

    davetron5000 : This is almost entirely untrue; CSV and XLS are structured; how could they not be? Futher, some arbitrary XML format isn't any more interchangable. The source developer still needs to provide a formats/schema/DTD and the destination developer still needs to write a parser.
    davetron5000 : It **is** easier to reverse engineer XML than CSV, tho.
    tloach : What's the equivalent of an XSD file for CSV or XLS then?
    From Danimal
    1. Schema definition languages - you can describe the expected format of the XML
    2. It's a standard:) - it's definitely better than everybody using their own custom formats

    CSV is human readable but that's really the only good thing about it - it's so inflexible, and there are no meanings assigned to the values. If I started designing a system now I would definitely use YAML instead - it's less bloated and it's definitely gaining momentum.

    From Svet
  • XML is not hard to parse, in fact it's quite simple, given the volume of excellent APIs available for every language under the sun.

    XML itself is not bloated, it can be as concise as necessary, but it's up to your schema to keep it that way.

    XML handles hierarchical datasets in a way that comma-delimited text never could or should.

    XML is self-documenting/describing, and human readable. Why is it a standard? Well, first and foremost, because it can be standardized. CSV isn't (and can't be) a standard because there's an infinite amount of variation.

    Autobyte : The fact that there are APIs is because it has become so popular - I asked what qualities made it become popular. and yes it is verbose all the extra charaters not really required to understand the underlying data
    Danimal : well, it is sorta bloaded -- see Jeff's post on the "Angle Bracket tax" (http://www.codinghorror.com/blog/archives/001114.html). However, with bandwidth and RAM growing so quickly, I think it's pretty much a non-issue.
    David Hill : Bloat with a pay-off then. If you are concerned about size, then you won't want XML, but you won't want CSV either. You'll be serializing directly to binary, in a custom format of your own design.
    davetron5000 : I think we just need to be honest that XML is a binary format that is kinda easy to read and kinda easy to parse.
    tloach : if you're worried about size then NEVER use a human-readable file. Period. Lempel-ziv compression (zip) is quick and easy, even if you roll your own. Since there are generally a few tags used many times, the tags will quickly be added to the dictionary and compressed quite well.
    David Hill : I think we're all violently in agreement then!
    Daniel James : RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files: http://www.rfc-editor.org/rfc/rfc4180.txt
    Robert Rossney : There's a CSV standard, but there's also a vast ecology of software that doesn't respect it.
    From David Hill
  • It was the late 90s and the internet was hot hot hot, but companies had systems that couldn't get anywhere near the internet. They had spent countless hours dealing with CORBA and were plotting using Enterprise JavaBeans to get these older systems communicating with their newer systems.

    Along comes SGML, which is the precursor to almost all markup languages (I'm skipping GML). SGML was already used to define how to define HTML, but HTML had particular tags that HAD to be used in order for Netscape to properly display a given webpage.

    But what if we had other data that needed to be explained? Ah ha!

    So given that XML is structured, and you can feel free to define that structure, it naturally allows you to build interfaces (in a non-OO point of view). It doesn't really do anything that other interface languages already do, but it gave people the ability to design their own definitions.

    Interface languages like X12 and HL7 existed for sure, but with XML people could tailor it to their individual AIX or AS/400 systems.

    And with the predominance of tag language because of HTML, well it was just natural that XML would get pushed to the forefront because of its ease of use.

    Autobyte : Excellent answer!!!
    David Hill : Indeed, well done sir.
    kitsune : Excellent, this is how I remember it too
    From nathaniel
  • It has many advantages, and few shortcomings. The main problem is the increased size of file and slower processing. However, there are advantages:

    • it is structured, so you write a parser only once
    • it supports data with nested structure (hierarchies, trees, etc.)
    • you can embed multiple types of data structure in a single XML
    • you can describe the schema (data types, etc.) with standard language (XSL...)
    tloach : Size of file is a non-issue since XML compresses quite well. Compressed XML should be comparable to just about any other method of storing the same data, size-wise.
    Milan Babuškov : Compressing XML means you also lose CPU to do it, and to decompress it each time you read anything from it.
    orip : @tloach - I've found the compressibility of XML vs. leaner data representations (e.g. JSON) to be dependent on the scenario.
    Milan Babuškov : @orip: I started using JSON recently, and I find it much more compelling than XML. It's easier and faster to parse - at least, by machines. Of course, with proper indentation it works for humans as well.
    • You can be given an xml file and have a chance at understanding what the data means by reading it without needing a separate specification of your pre-xml data format.
    • Tools can be used to work with xml generically. Where before, if everybody used different file formats: comma separated, binary, etc. You'd need to write a custom tool.
    • You can extend it, by adding a new tag into the schema with a default value. And if done correctly, with xml that doesn't break all the old code that parses the xml but doesn't know about the tag. That usually isn't true with proprietry formats.
    • Probably the main thing that makes it popular is it looks a bit like HTML, which lots of people understood previously. So it became popular, then because it was popular it became more popular because its nice to work with one standard that everybody knows.
    • A bad thing is that xml is usually a lot bigger because of all the tags and because its text based than used to be used. But, as computers are bigger now, we can often handle that and its worth trading size for having better self-describing data.
    • You can get off the shelf code/libraries that will parse/write xml.
    Milan Babuškov : BIG computers, yeah! ;)
  • One of the major advantages it has over things like CSV files is that it can represent hierarchical data easily. To do this you either need a self-describing tree structure like XML, or a pre-defined format such as SWIFT or EDI (and if you've ever dealt with either of those, then you'll realise that XML is trivial to parse in comparison).

    One of the reasons it's actually quite easy to parse is because it's 'bloated'. Those end tags mean that you can accurately match the end of elements to the start and work out when the tree has become unbalanced. You can't do that in the 'lightweight' alternatives such as JSON.

    Another reason it's easy to parse is because it has had full support for Unicode encodings from the start, so you don't have to worry about what the default code page is on the target system, or how to encode multi-byte characters, because that information is all contained within the document.

    And let's not forget about the other artefacts that came with it like the defined description and validation mechanism (XSD) and the powerful and declarative transformation mechanism (XSLT).

    Danimal : amen on the nightmare that is EDI
    Kibbee : Yes, EDI is truly terrible.
    Darrel Miller : Isn't interesting how when you have the perspective of "how we used to have to do it", some of today's technology looks incredibly easy.
    AnthonyWJones : Anyone sat on an EDI committee? What a nightmare, each organisation trying to get things their way, no wonder the results were monsters!
    Hamish Smith : +1 unicode support. try and use csv for anything other than ascii and you are not actually using csv anymore.
    Lucero : ...not to forget namespace support, which allows combining different XML data structures in one document in a meaningful and unambigious way.
    From Greg Beech
  • How about the fact that it supports a standardized query language, XPath? That's pretty useful for me.

    Autobyte : XPATH is relatively new and did not factor in making XML popular but instead is a bi-product of it's current popularoty - but yes XPATH is cool!
    Darrel Miller : Xpath became a W3c recommendation in 1999. Considering XML only obtained this status a year earlier I would say they are similarly ancient.
    Lucero : ...and XSL transformations are perfect for making XML data more interchangeable.
  • Some inherent qualities of XML that make it so popular and useful:

    1. XML represents a tree, and tree-like structures are a very common pattern in programming. This is an evolutionary leap from record-based representations like CSV, made possible by today's cheap computing power and bandwidth.

    2. XML strikes a good balance between human factors (it is plain text, and fairly legible) and computing practicalities (terseness, ease in parsing, expressiveness, extensibility, etc).

    From Keeth
  • Straight from the horse's mouth, the design goals of XML were:

    1. XML shall be straightforwardly usable over the Internet.
    2. XML shall support a wide variety of applications.
    3. XML shall be compatible with SGML.
    4. It shall be easy to write programs which process XML documents.
    5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
    6. XML documents should be human-legible and reasonably clear.
    7. The XML design should be prepared quickly.
    8. The design of XML shall be formal and concise.
    9. XML documents shall be easy to create.
    10. Terseness in XML markup is of minimal importance.

    The reason why it became popular was because people needed a standard for a cross-platform data exchange format. XML may be a bit bloated, but it is a very simple way to delimit text data and it was backwards compatible with the large body of existing SGML systems.

    You really can't compare XML to CSV because CSV is an extremely limited way of representing data. CSV cannot handle anything outside of a basic row-column table and has no notion of hierarchy.

    XML is not that hard to parse and once you write or find a decent XML utility it's not difficult to deal with in code either.

    From 17 of 26
  • The primary advantage it bestows is a system independent representation of hierarchical data. Comma delimited text and properties files are more appropriate in many places where XML was used, but the ability to represent complex data structures and data types, character set awareness, and standards document allowed it to be used as a good inter application exchange format.

    My minor improvement suggestion for the language is to change the way end tags work. Just imagine how much bandwidth and disk space would be saved if you could end a tag with </>, like <my_tag>blah</> instead of <my_tag>blah</my_tag>. You aren't allowed to have overlapping tags, so I don't know why the standard insists on even more text than it needed. In fact, why use angle brackets at all?

    The ugliness of the angle brackets is a good show of what it could have been: JSON. JavaScript Object Notation achieves most of the goals of XML with a lot less typing. Another alternate syntax that makes XML bearable is the Builder syntax, as used by Groovy and Ruby. It's much more natural and readable.

  • XML provides a very straightforward way to represent data. Parsing is fairly easy - it's a very regular grammar and lends itself to straight forward recursive descent parsing. This makes it easy for data consumers and producers to exchange information without really having to know too much about their respective applications and internals.

    It is, however, an extremely inefficient way to represent data and lends itself to being abused horribly. An example of this is an object interface I worked with that, instead of exporting constructors and properties for particular objects, required me to author XML programmatically and pass in the resulting XML to the single constructor. Similarly, XML does not lend itself well to large data sets that may require random access without creating an added cataloging system (ie, if I have a thousand page document in XML, I will need to parse nearly the entire file to get to page 999, assuming the page data is ordered), whereas I'd be better off putting the actual page data in a separate file or files and use the XML to point to the correct file or position within a file.

    From plinth
  • I'd guess that its popularity orginally stemmed from the fact it solved the right problems in a way that wasn't exceeding bad for enough big players to gain their support and thus gain Widespread industry adoption. At this point, it's rather strongly embedded into the landscape since there's so much component development invested around XML. The HIPPA and other EDI XML schemas and adapters that ship with MS BizTalk Server (and BizTalk itself) are a great example of the mountain that's been gradually built on top of XML.

  • XML's popularity derives from other markup languages. HTML is the one people are most familiar with, but increasingly now we see "markdown" languages like that used by wikis and even the stackoverflow post form.

    HTML did an interesting job, of formatting text, but it was insufficient. It grew. Folks wanted to add tags for everything. <BLINK> anyone? Layouts, styles, and even data.

    XML is the extensible markup language (duh, right?), designed so that anyone could create their own tags, and so that your RECORD tag doesn't interfere with my RECORD tag, in case they have different meanings, and with sensitivity to the issues of encoding and tag-matching and escaping that HTML has.

    At the start, it was popular with people who already knew HTML, and liked the familiar concept of using markup to organize their data.

    From davenpcj
  • It's cross platform. We use it to encode robot control program and data running in C under VxWorks for execution, but our off line programming is done under dot net. XML is easily parsed by both.

    From Jim C
  • Compared to some of the previous standards it's a dream. Try writing HDF (Hierarchical Data Format) files or FITS. FITS was standardised before the invention of the disc drive - you have to worry about padding the file into block sizes!
    Even CSV isn't as simple. Quick question, whats the separator in a German CSV file?

    A lot of the complaints about XML are from people who use it to transfer data directly between machines where the data only exists for milliseconds. In a lot of areas the data will have to last for 50-100 years and be far more valuable than the machine it ran on. It's worth paying a closing tag tax sometimes.

  • Something I haven't seen mentioned yet is that not only is XML structured, but the way that attributes and elements interact creates a somewhat unusual structure that is still easily understandable by humans.

    If you compare an XML tree with its nearest structural neighbor, the directed acyclic graph, you might note that the typical DAG carries only an ID and a value at each node. XML carries this as well (gi/tag corresponding with ID, and the text of the node corresponding with the value), but each node then can also carry and arbitrary amount of additional metadata: the elements. This is very much like having an extra dimension — if you consider the DAG as spreading out flat in two dimensions with each branch, the XML document spreads out in three dimensions, flat, and then downwards to a subtree containing just the attributes.

    This is an optional bend to the structure. Walk a list of attributes like any list of child elements, and you're back to a two-dimensional tree. Ignore them completely, and you have a simplified node/value tree which may more purely represent the overall "shape" of contained data. But the extra dimension is there if you need the metadata.

    With decent indentation, this is something that a human being can pick up just by eyeballing the raw data, making XML a miniature visualization tool for a potentially complex structure — and having a visualization tool built into the data exchange of your application means that the programmers involved are more likely to build a structure that represents the way the data is likely to be used.

    From Zed
  • The two main things that made XML widely adopted are "Human readability" and "Sun Microsystem". They were (and there are still) other cross-language, cross-platform data exchange format that are more flexible, more easy to parse, less verbose than XML. Such as ASN.1.

    From gizmo
  • It is a text format that is one of it's major advantages. All binary formats are usually much smaller but you always need tools to "read" them. You can simply open and editor and modify XML files to your liking. However I'd argue it's stil a bloated format, but well you can compress it quite well.... if one looks at the specs for the Windows Office XML formats one just can imagine it's wonderful to be seemingly open....

    Regards Friedrich

    From Friedrich
  • another benefit of XML vs binary data is error resilliancy..

    for binary data, if a single bit goes wrong, the data are most likely unusable, with xml, as a last resort, you can still open it up and make corrections...

0 comments:

Post a Comment