Using XML as data storage

Question

I was thinking about the XML format and the following quote:

“XML is not a database. It was never meant to be a database. It is never going to be a database. Relational databases are proven technology with more than 20 years of implementation experience. They are solid, stable, useful products. They are not going away. XML is a very useful technology for moving data between different databases or between databases and other programs. However, it is not itself a database. Don't use it like one.“ -Effective XML: 50 Specific Ways to Improve Your XML by Elliotte Rusty Harold (page 230, Part 4, Item 41, 2nd paragraph)

This seems to really stress that XML should not be used for data storage and should only be used for program to program interoperability.

Personally, I disagree and .NET's app.config file that's used to store a program's settings is an example of data storage in an XML file. However for databases rather than configurations etc XML should not be used.

To develop my point, I will use two examples:
A) Data about customers with fields that are all on one level i.e. there are a number of fields all relating to one customer with no children
B) Data about configuration of an application where nested fields and properties make a lot of sense

So my question is, Is this still a valid statement and is it now acceptable to store data using XML?

EDIT: I've sent an email to the author of that quote to ask for his input/extra context.

A database is not about storing data but getting data on a given criteria. XML does simply not scale - try manipulating a 100 GB XML file with the data you describe. — user1249, Aug 31 '12 at 6:15
The question is unclear. Are you asking about storing data in an XML file instead of a DB or storing data inside a DB but as XML type. Further muddying is the example of .net config file as I don't see it as data storage. — Pratik, Aug 31 '12 at 10:02
No one has mentioned yet that no data storage format by itself is a database. A database includes a storage format and a retrieval mechanism. XML is not a retrieval mechanism, so it cannot be a database. XML also happens to be a terrible storage format for more than maybe 1MB of data. — GlenPeterson, Sep 5 '12 at 14:28

tdammers · Accepted Answer · 2012-08-31 09:39:12Z

This quote is not about using XML as a storage format in general (for which it is fine, depending on the requirements), but for database-type storage.

When people talk about databases, they usually mean storage systems that store huge quantities of data, often in the gigabyte or terabyte range. A database is potentially much larger than the amount of available RAM on the server that stores it. Since nobody ever needs all the data in a database at once, databases should be optimized for fast retrieval of selective subsets of their data: this is what the SELECT statement is for, and relational databases as well as NoSQL solutions optimize their internal storage format for fast retrieval of such subsets.

XML, however, doesn't really fit these requirements. Due to its nested tag structure, it is impossible to determine where in the file a certain value is stored (in terms of a byte offset into a file) without walking the entire document tree, at least up to the match. A relational database has indexes, and looking up a value in an index, even with a primitive binary-search implementation, is a single O(log n) lookup, and then getting to the actual values is nothing but a file-seek (e.g. fseek(data_file_handle, row_index * row_size)), which is O(1). In an XML file, the most efficient way is to run a SAX parser over your document, doing an awful lot of reads and seeks before you get to your actual data; you can hardly get this any better than O(n), unless you use indexes, but then, you'd have to rebuild the entire index for every insertion (see below).

Inserting is even worse. Relational databases do not guarantee row order, which means they can just append new rows, or overwrite any rows marked as 'deleted'. This is extremely fast: the DB can just keep a pool of writable locations around; getting an entry from the pool is O(1) unless the pool is empty; worst case, the pool is empty and a new page has to be created, but this too is O(1). By contrast, an XML-based database would have to move everything after the insertion point to make room; this is O(n). When indexes come into play, things become even more interesting: typical relational-database indexes can be updated with relatively low complexity, say O(log n); but if you want to index your XML files, every insertion potentially changes the on-disk location of every value in the document, so you have to rebuild the entire index. This also goes for updates, because updating, say, an element's text content, can change its size, which means the consecutive XML has to shift. A relational database doesn't have to touch the index at all if you update a non-indexed column; an XML database would have to rebuild the entire index for each update that changes the size of the updated XML node.

Those are the most important downsides, but there are more. XML is very verbose, which is good for server-to-server communication, because it adds safety (the receiving server can perform all sorts of integrity checks on the XML, and if anything went wrong in the transfer, the document is unlikely to validate). For mass storage, however, this is killing: it is not uncommon to have 100% or more overhead for XML data (it is not uncommon to see overhead ratios in the 1000% range for things like SOAP messages), while typical relational DB storage schemes have only a constant overhead for table metadata, plus a tiny bit per row; most of the overhead in relational databases comes from fixed column widths. If you have a terabyte of data, a 500% overhead is simply unacceptable, for many reasons.

Shoey · Answer 2 · 2015-02-12 01:07:48Z

I agree that it's not a relational database. I think the author is simply saying in the quote not to use it as one.

Having said that though you may or may not need one. If you don't really need to do much querying on the data, and only intend to store it and then fetch it later based on some limited query criteria then you need XML DOCUMENT storage and retrieval - not a relational database.

There are plenty of applications which simply need to store a document with data in it for retrieval in whole later. If this is the case then it's useless to create a SQL based schema, parse the XML, and then serialize it to the database only to do just the reverse later. There is a lot of code overhead potentially involved in doing that. There is less though if you do it right.

You can use ORM tools like Hibernate and tools like Apache Axis in order to autogenerate practically all the code you'd need to build a service which just handles simple CRU operations. You'd have to wrap that in authentication of course, and possibly might want to segregate the data based on the user, level of access, etc. You may even want to limit which operations a given user is allowed to do via SOAP service for example.

In this sense you're doing more like content management than anything else.

Daniel B · Answer 3 · 2012-09-05 12:01:01Z

The term database can refer to either the raw data only, or the database management system as well. This definition makes a big difference in the entire argument.

If we use the RDBMS definition, then XML has very little in that sense. You get very little in terms of ACID guarantees (you'd have to write your own code to accomplish those). If you need those (and most transactional systems do), you are already in major trouble. I could give a list of hundreds of features which are taken for granted with RDBMSes, which you'd have to reinvent and reimplement. Think security models, replication, backups, just to name a few basic ones.

In the above sense, no, XML is not a database, and you shouldn't try to use it as one.

If we use the "raw data" definition, XML fares a lot better, but still not that great. As others have pointed out though, it is hugely verbose in general, typically lacking binary encoding, and having duplicate tags, etc. These are trade-offs made so that XML can be human-readable - basically, efficiency is the enemy of this requirement. XML is also not a particularly good fit for even the simplest situations where you are inserting records continuously. Assuming you want your XML file to be valid, you need a single closing tag, which means that appending a record means you need to shift up the tags at the end. This is pretty expensive (how do we know where that tag begins? what if there are multiple "tables", do we just move up the entire file?), and if you want to work around it, you'll reinvent a similar approach to many databases - spreading out tables over multiple files, and dynamically growing those files as needed.

There are situations where XML is appropriate - config files are a great example, because they are typically small and human readability is an excellent feature to have. Having a database just for a config file may be overkill.

Databases, on the other hand, are excellent when you have thousands (or millions / billions) of records, and have many users concurrently updating them. So yes, XML is not a database, and you shouldn't use it like one. Your example happens to be one of those situations where you didn't need a DB in the first place, and XML is the better fit.

The way I see it is this: if you use XML as a DB (say, as a backing store for a transactional system), you will end up reinventing and rewriting an RDBMS. That's a really poor way to spend your time and energy. I think this is what that quote was saying as well.

Yusubov · Answer 4 · 2012-09-05 10:43:12Z

up vote 0 down vote

XML never meant to be a database or replace it.

XML is mainly defined for Web documents that allows for the creation of customized tags for individual information fields. However, you would never achieve relational centralized data management with it.

edited Sep 5 '12 at 10:43

answered Aug 31 '12 at 5:35

Yusubov
17.5k43266

add a comment |

deadly · Answer 5 · 2012-08-31 08:54:27Z

This seems to really stress that XML should not be used for data storage and should only be used for program to program interoperability.

Your premise is flawed.

The paragraph you quote is actually saying that XML is not a replacement for a database, not that it shouldn't be used for data storage.

It is clear that a settings file is not the same thing as a database, and so different technologies can (and should?) be used.

Correct me if I am wrong, but you seem to have more experience with mark-up languages than databases. If you got a bit of experience with databases you'd realise which domains the two different techologies are suited for.

Traxxus · Answer 6 · 2012-08-31 08:02:36Z

XML is an excellent choice for configuration settings. Not only are XML files easy to parse/highlight in an IDE, they're very easy for non-programmers to edit. I find them incredibly useful in web development scenarios where maintenance tasks are being performed by designers and content managers.

XML should typically not be used as a primary data source for any non-trivial applications. The serialization/deserialization overhead alone begs for a different solution.

Simon · Answer 7 · 2012-08-31 06:21:37Z

Short answer: It depends.

Long answer: From my point of view this strongly depends on the amount of data you want to store. E.g. if you have a couple of objects in your application during runtime and you want to store them after running the tool a XML file is perfectly fine. However, if your webshop has 5000 custumers and even more orders a database would be a more appropriate data storage.

Additionally I think storing settings in a database and not in a file like app.config is in most cases not very useful, but I don't think this example proves the quote wrong.

Emmad Kareem · Answer 8 · 2012-08-31 06:10:39Z

my question is, Is this still a valid statement and is it now acceptable to store data using XML?

I see your point in you example about .NET configuration files. However, any other file format could have been used. In fact, in the old days, such settings used to be stored in regular text files called INI files.

I see that the statement you have presented in gray, is valid and correct if you define a database as a software system.

The definition of XML in XML-Definition states that " (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable."

This definition focuses on readability and language rather than on mechanisms to manage the data.

Compared to an RDBMS, XML does not provide means to randomly insert and delete rows in an XML file. For example, if you have 1000000 rows, and you wanted to delete rows at random even in a single user environment XML based file would not be a good choice for a database. Also, XML does not provide any native mechanisms for locking data. In fact, since XML is not a software, all ACID (atomicity, consistency, isolation, durability) properties that guarantee that database transactions are processed reliably in a shared environment are left to the developer to build (with the exception of Durability). XML does not have a robust specification to handle data integrity across XML files, let alone different servers (e.g. customer xml file and orders xml file - No FKs to enforce integrity).

The above is not an enumeration of what XML lacks, instead, it could server as a quick justification of the statement that XML is not a database software.

Steven Burnap · Answer 9 · 2012-08-31 06:04:32Z

XML is lousy for data storage. First, it is very verbose. Data stored in an XML file will take much more disk space then the same data stored in any reasonable database system. In an XML record, the name of a particular field will be stored twice, along with the string representation of the data. So, for example, to store a single integar in a field called "foobar", you end up with this 19 byte string:

<foobar>42</foobar>

On the other hand, a real database will store this as a single integar value, taking 4 bytes. If your database is small, that doesn't mean much, but if you have 10,000 records, that's a problem.

Second, an XML has to be parsed from text every single time the file is read. For the above field, a real database simply reads the binary data into memory from the offset it knows it stored the field "foobar" in. If the file is stored as XML, it has to read the field "foobar", parse that text, determine what field it is, then parse the string "42" and convert it into the binary 42.

Thus the performance penalties for using XML are huge. The benefits of XML are that it is somewhat human readable, and that it allows for easy transfer of data between completely separate systems. Neither of those advantages applies for a local database.

The one exception is configuration files, which are generally small, and generally need to be editable by humans.

An XML database absolutely will be larger and slower than any reasonable SQL system. Unless you can find a counterbalancing advantage in human readability or interoperability, there's just no point in using it for data storage.

The critical point here is the size of the file. For static data less than a meg in size, the performance hit of loading an XML once is not that great. I worked on an application about 5 years ago and found the cost of loading such a file was in the 10s of ms area. I dare say computers are a bit faster now. — dave, Aug 31 '12 at 6:14
@dave: but once you're in that size area the XML format loses significantly in the "human editable" department. — Joachim Sauer, Sep 5 '12 at 10:46
To highlight the problem even more, storing the value "1000000000" would still be 4 bytes in a real DB, while being 27 bytes in the XML. — Daniel B, Sep 5 '12 at 11:52

Ryan Ternier · Answer 10 · 2012-08-30 23:50:49Z

XML Is viable depending on the context. If your data is pretty static, and not changing much (Sample data for example), yes XML Is a good use.

Configuration settings, sample data (even if it's millions of rows, but rarely changing), are all good uses of XML.

Hard disk read/writes are expensive, way more than accessing data from an Oracle/Sql stack.

Kyle Trauberman · Answer 11 · 2012-08-30 23:48:08Z

up vote 3 down vote

This is really subjective. That quote is, like, someones opinion, man.

Honestly, I think XML is a viable alternative to a database as it has multiple advantages over a RDMS, including low overhead, which equals cheaper storage (especially when using a hosting service that charges for databases separately).

Take a look at dasBlog and BlogEngine. Both of those applications use xml for storage as a default.

That said. It isn't a RDMS, and if you have high volatility (lots of updates, inserts, or deletes) in your data or require high availability, use a database. XML is fine for storing small things like configuration data and low volatility data.

edited Aug 30 '12 at 23:48

answered Aug 30 '12 at 23:43

Kyle Trauberman
1317

The quote is actually from a book. I should add that in – Kian Aug 30 '12 at 23:44

2

"Low overhead?" I think you mean "requires no installation." Accessing data in a large XML file has a huge time, I/O, and processor overhead. Yes, XML is good for small things (< 1MB), but no, XML is not good for low volatility data in general, only small things in general. – GlenPeterson Sep 5 '12 at 14:24

Nice Big Lebowski hommage! – InvisiblePanda Oct 23 '14 at 11:19

add a comment |

zxcdw · Answer 12 · 2012-08-30 23:44:15Z

up vote 0 down vote

Why would you actually want to use XML for storing data in the first place? I mean, it's a language after all...

While one could argue that it's a flexible and easy to understand format, that only applies when you have to do manual editing to the files. When you actually interact with the database with common interface(fetch data X which meets the requirements Y and Z, store/update data X, ...) those advantages become void.

answered Aug 30 '12 at 23:44

zxcdw
4,12511627

1

Natural Languages have been used to store data for centuries. Understandability also applies if the application that reads it becomes unusable (e.g. some 16-bit app that never got upgraded). Storing data in a human-readable format makes it easier to port; particularly if the format was never particularly well documented or the documentation is also lost. – Paul Butcher Aug 31 '12 at 9:15

1

Using natural language to store data isn't itself problematic, but actually storing data in a format which itself provides horrible(in comparison to what it could be) readability, information efficiency and information to content ratio is something I'd personally speak against. – zxcdw Aug 31 '12 at 9:26

add a comment |

asked	3 years ago
viewed	27844 times
active	6 months ago

current community

your communities

more stack exchange communities

Using XML as data storage

12 Answers 12

Your Answer

Not the answer you're looking for? Browse other questions tagged database database-design xml or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Using XML as data storage

12 Answers 12

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged database database-design xml or ask your own question.

Linked

Related

Hot Network Questions