This quote is not about using XML as a storage format in general (for which it is fine, depending on the requirements), but for database-type storage.
When people talk about databases, they usually mean storage systems that store huge quantities of data, often in the gigabyte or terabyte range. A database is potentially much larger than the amount of available RAM on the server that stores it. Since nobody ever needs all the data in a database at once, databases should be optimized for fast retrieval of selective subsets of their data: this is what the SELECT
statement is for, and relational databases as well as NoSQL solutions optimize their internal storage format for fast retrieval of such subsets.
XML, however, doesn't really fit these requirements. Due to its nested tag structure, it is impossible to determine where in the file a certain value is stored (in terms of a byte offset into a file) without walking the entire document tree, at least up to the match. A relational database has indexes, and looking up a value in an index, even with a primitive binary-search implementation, is a single O(log n) lookup, and then getting to the actual values is nothing but a file-seek (e.g. fseek(data_file_handle, row_index * row_size)
), which is O(1). In an XML file, the most efficient way is to run a SAX parser over your document, doing an awful lot of reads and seeks before you get to your actual data; you can hardly get this any better than O(n), unless you use indexes, but then, you'd have to rebuild the entire index for every insertion (see below).
Inserting is even worse. Relational databases do not guarantee row order, which means they can just append new rows, or overwrite any rows marked as 'deleted'. This is extremely fast: the DB can just keep a pool of writable locations around; getting an entry from the pool is O(1) unless the pool is empty; worst case, the pool is empty and a new page has to be created, but this too is O(1). By contrast, an XML-based database would have to move everything after the insertion point to make room; this is O(n). When indexes come into play, things become even more interesting: typical relational-database indexes can be updated with relatively low complexity, say O(log n); but if you want to index your XML files, every insertion potentially changes the on-disk location of every value in the document, so you have to rebuild the entire index. This also goes for updates, because updating, say, an element's text content, can change its size, which means the consecutive XML has to shift. A relational database doesn't have to touch the index at all if you update a non-indexed column; an XML database would have to rebuild the entire index for each update that changes the size of the updated XML node.
Those are the most important downsides, but there are more. XML is very verbose, which is good for server-to-server communication, because it adds safety (the receiving server can perform all sorts of integrity checks on the XML, and if anything went wrong in the transfer, the document is unlikely to validate). For mass storage, however, this is killing: it is not uncommon to have 100% or more overhead for XML data (it is not uncommon to see overhead ratios in the 1000% range for things like SOAP messages), while typical relational DB storage schemes have only a constant overhead for table metadata, plus a tiny bit per row; most of the overhead in relational databases comes from fixed column widths. If you have a terabyte of data, a 500% overhead is simply unacceptable, for many reasons.