Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I've been wondering why XML has an L in its name.

By itself, XML doesn't "do" anything. It's just a data storage format, not a language! Languages "do" things.

The way you get XML to "do" stuff, to turn it into a language proper, is to add xmlns attributes to its root element. Only then does it tell its environment what it's about.
One example is XHTML. It's active, it has links, hypertext, styles etc, all triggered by the xmlns. Without that, an XHTML file is just a bunch of data in markup nodes.

So why then is XML called a language? It doesn't describe anything, it doesn't interpret, it just is.

share|improve this question
20  
On what do you base your requirement that a language has to "do" something? I don't see that in any of the definitions at dictionary.com. – kdgregory 14 hours ago
2  
@kdgregory Those are human languages, not computer languages. However, but anyway the page makes a big deal of the "communication" part. A language is meant to be understood by both parties, the sender and the receiver. This is what xml does only if it contains namespace information! – Mr Lister 14 hours ago
4  
Markup language is a common term en.wikipedia.org/wiki/Markup_language – Paparazzi 11 hours ago
4  
@MrLister: "Those are human languages, not computer languages" A language is a language. At its most extreme, even English requires contextual information (which dialect is being used) to understand unambiguously. Doesn't stop it from being a language. Your question simply has a false premise. – BarryTheHatchet 11 hours ago
3  
Languages don'rt do things, they express and communicate things – Hagen von Eitzen 6 hours ago

Because it is a language. A markup language, not a programming language.

Notice that natural human languages like English and Spanish don't "do" anything either. In fact, technically C++ and Java and the like don't "do" anything until they're fed into a compiler and the output gets executed. Doing stuff and being a language are largely orthogonal to each other.

share|improve this answer
3  
Substitute "interpreter" for "compiler". Being fed to a compiler doesn't make them "do" anything, either, it just translates them into a different language, which, again, doesn't "do" anything. All execution is interpretation. Sometimes, the interpreter might be extremely simple and implemented in silicon, in which case we call it an "execution unit", but it's still an interpreter. </nitpick> Anyway, good answer! – Jörg W Mittag 8 hours ago
2  
@JörgWMittag Good point. Since I randomly chose languages that are normally compiled, added "and the output gets executed". – Ixrec 8 hours ago

Let Σ be a non-empty, finite set of symbols, called an alphabet. Then Σ* is the countable infinite set of finite words that can be formed by concatenating zero or more symbols from Σ. Any well-defined subset L ⊆ Σ* is a language.

Let's apply this to XML. Its alphabet is the Unicode character set U, which is non-empty and finite. Not every concatenation of zero or more Unicode characters is a well-formed XML document, for example, the string

<tag> soup &; not <//good>

is clearly not. The subset XML ⊂ U* that forms well-formed XML documents is decidable (or “recursive”). There exists a machine (algorithm or computer program) that takes as input any word wU* and after a finite amount of time, outputs either 1 if w ∈ XML and 0 otherwise. Such an algorithm is a sub-routine of any XML processing software. Not all languages are decidable. For example, the set of valid C programs that terminate in a finite amount of time, is not (this is known as the halting problem). When one designs a new language, an important decision to make is whether it should be as powerful as possible or whether the expressiveness would better be restricted in favor of decidability.

Schemata are an addition to XML that allow refining the set of well-formed documents. A well-formed document that follows a certain schema is called valid according to that schema. For example, the string

<?xml version="1.0" encoding="utf-8" ?>
<root>all evil</root>

is a well-formed XML document but not a valid XHTML document. There exists schemata for XHTML, SVG, XSLT and what not else. Schema validation can also be done by an algorithm that is guaranteed to halt after finite amount of steps for every input. Such a program is called a validator or a validating parser.

Because you can define your own schemata, XML is called an extensible language, which is the origin of the “X” in “XML”.

You can define a set of rules that gives XML documents an interpretation as descriptions of computer programs. XSLT, mentioned earlier, is an example of such a programming language built with XML. More generally, you can serialize the abstract syntax tree of almost any programming language quite naturally into XML, if this is what you want.

share|improve this answer
    
What do you mean by well-defined subset? And why do you require that a subset of Σ* be well-defined in order for it to be a language? Wouldn't any subset be called a language? – Giorgio 7 hours ago
    
@Giorgio: In mathematics, "well-defined" is largely just an intensifier: everything that mathematically exists is already well-defined. – Kevin 6 hours ago
2  
@Giorgio With “well-defined” I mean that there is a formal predicate that tells whether an item belongs to the set or not. This predicate will in general not be computable but it has to be clearly specified without contradiction. Otherwise, bad things might happen. “The pairs of strings (w, M) where M is the smallest description of a Turing machine that outputs w and then halts” is a well-defined but non-computable (see Kolmogorov complexity) predicate. … – 5gon12eder 6 hours ago
1  
… “The set containing all strings that are not contained in the set” is self-contradicting and not useful. – 5gon12eder 6 hours ago
1  
@5gon12eder: “The set containing all strings that are not contained in the set” does not exist. The term "well-defined" is ironically not well-defined. – Kevin 6 hours ago

In computer science, formal language is just a set of strings, usually infinite and often described using rules (two common versions of those rules are regular expressions and formal grammars).

Note that this means that all a language needs is syntax, language doesn't need to describe what each valid string means (that's called semantics).

Now, this means that programming languages are formal languages that also have semantics, which describes some computation. And for example XHTML is a formal language, whose semantics describe (roughly and informally) how a hypertext document looks and behaves.

XML is still a language, even though it doesn't have semantics itself (but many languages derived from XML do, like XHTML and XAML).

Technically, binary formats are also languages, but they're not called that way. The term "language" is reserved for human-readable formats.

share|improve this answer
    
Then why aren't other storage formats called languages too? The format used in BMP or JFIF files? Database files? – Mr Lister 14 hours ago
5  
@MrLister Because they're not human-readable. When they aren't human-readable we tend to call them formats or data formats instead. – Mason Wheeler 13 hours ago
    
@MrLister - It's worth pointing out that XML (like HTML) borrowed heavily from SGML which borrowed heavily from IBM's GML (generalized markup language) which dates to the 60's. Having the ML demonstrates that there is some relationship between those languages. Plus given the dates involved, the definition you use for what denotes a language in computing is something that has changed to a great extent since the use was coined. – James Snell 13 hours ago
2  
@JamesSnell Not to be confused with the other ML language family, of course. Yay for over-crowded acronyms! – Mason Wheeler 13 hours ago
    
Mason is correct, but it is important to note that formats such as BMP do have languages. The difference is solely in how we refer to them. – Snowman 9 hours ago

A language is a method of conveying information.

A programming language is a method of conveying algorithms.

A markup language like XML is a language for conveying data.

share|improve this answer

XML is a meta-language. You use it to define specific languages. Languages never do anything, they just allow us to express things. Also, it is not true that XML is a "storage language". Just the opposite, in fact. You can store XML docs however you please. XML is better thought of as a transfer language. PS. If you don't think XML "does" anything, you'll have to explain how it is that many systems (e.g. jetty) use XML as a (bad) programming language. It's a lamentable abuse of XML, but it exists in the wild, and that just one of many examples.

share|improve this answer

There is no formal definition of the word language in computing. Any structured textual format which can be processed computationally can be called a language. A language doesn't "do" anything as such, but some software might process input in the language and "do" something based on it.

You note that XML is a "storage format" which is true, but a textual storage format can be called a language, these term are not mutually exclusive.

Programming languages are a subset of languages. E.g. HTML and CSS are languages but not programming languages, while JavaScript is a real programming language. That said, there is no formal definition of programming language either, and there is a large grey zone of languages which could be called either data formats or programming languages depending on point of view.

Given this, XML is clearly an language just not a programming language - although it can be used to define programming languages like XSLT.

Your point about namespaces are irrelevant. Namespaces are an optional feature of XML and does not change the semantics of an XML format. It is just needed to disambiguate element names if the format may contain multiple vocabularies.

share|improve this answer
4  
    
@Snowman: A "formal language" does not necessarily correspond to what is usually called a language in computing. For example a "formal language" does not need to be textual - machine code is as a formal language as is most binary formats and protocols. So I wouldn't say the term "formal language" covers the same meaning as "language" in computing. – JacquesB 9 hours ago
3  
I am not aware of any requirement that a language needs to be textual or not be textual. The idea of constructing a sentence from terminals has nothing to do with an arbitrary interpretation of the bits in those terminals, or which types of computers (silicon or carbon-based) are capable of reading them. – Snowman 9 hours ago
1  
@JacquesB: "For example a "formal language" does not need to be textual - machine code is as a formal language as is most binary formats and protocols." Are you saying that machine languages aren't languages just because they're encoded in binary for consumption by a CPU? – Nicol Bolas 6 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.