When XML is transformed into something else, in most cases XSLT comes to play. One of the challenges of XSLT is to select just the nodes you are interested in. This task is done by XPath, “a query language for selecting nodes from a XML document.”
However, it can be tedious to create a XPath expression, run the transformation, and check if you got the expected result. After hours of debugging you find out: It’s the wrong XPath expression!
To make it easier: Test your XPath expressions in the internal xmllint shell!
Using Easy XPath Expressions
Generally, xmllint is known as a popular tool to validate your XML structure. Mostly unknown is its internal shell. With this shell you can make some spiffy XPath tests and check if it returns exactly what you want. Let’s consider the following DocBook 4 document:
<book lang="en"> <title>Dancing with Penguins</title> <bookinfo> <author> <firstname>Tux</firstname> <surname>Penguin</surname> </author> </bookinfo> <chapter id="know.penguins"> <title>Getting to Know Penguins</title> <abstract> <para>Penguins are cute.</para> </abstract> <sect1> <title>The Head</title> <para>...</para> </sect1> <!-- A small comment --> <sect1 id="penguin.coat"> <title>The Coat</title> <para>...</para> </sect1> </chapter> </book>
The content is not so important than the structure. To examine some XPath features of xmllint, we load the document into its shell using its --shell option:
xmllint --shell penguin-dance.xml
You first see the prompt:
/ >
The prompt shows you the path to your current node. After loading you just see the root node, which is indicated as /. Pretty similar than a Linux path notation.
Use help to list all available commands. For this little post, we focus on the xpath command. It evaluates an XPath expression in the context and prints the result. Let’s try an absolute XPath:
/ > xpath /book
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT book
ATTRIBUTE lang
TEXT
content=en
Well, that was to be expected. The interesting part is, you can change the context. For example, we could change it to the first chapter:
/ > cd book/chapter
chapter >
Surprised we didn’t use an absolute XPath? Well, our context was already the root node, containing the book node. In this case, it doesn’t matter to use a relative or absolute XPath. Both lead to the same node. However, this is not always the case.
Let’s see what we have inside book:
chapter > xpath *
1 ELEMENT title
2 ELEMENT abstract
3 ELEMENT sect1
4 ELEMENT sect1
Yes, that’s right. Ok, we want all sections in this chapter, that don’t have an id attribute. This can be achieved by using a XPath predicate and the XPath function not:
chapter > xpath sect1[not(@id)]
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT sect1
We need the title, so we just append /title after the previous expression:
chapter > xpath sect1[not(@id)]/title
1 ELEMENT title
and we want the content so we wrap it into the string XPath function:
chapter > xpath string(sect1[not(@id)]/title)
Object is a string : The Head
We could use a lot more expressions to get the previous or following nodes, the parent nodes or the child nodes. For now, this section is enough and I make it a bit more difficult.
Using Namespaces in XPath Expressions
When dealing with XML it is not uncommon that documents contain one or more XML namespaces. To work with such structures, it is not enough to reuse the previous expressions. They will not work. Before you can work with namespaces, you have to define it first.
Let’s consider KIWI. The configuration is a XML file, based on a RELAX NG schema. The RELAX NG schema are bound to a namespace. Load the KIWI schema with the following command:
xmllint --shell http://gitorious.org/kiwi/kiwi/blobs/raw/master/modules/KIWISchema.rng
As the KIWI schema can (and probably will) change, your results might be a little different than mine. But the principle is the same.
Now we want to know, what contains the root element. As we do not know (yet) the root element’s name, we use a wildcard:
/ > xpath *
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT grammar
namespace db href=http://docbook.org/ns/docbook
namespace a href=http://relaxng.org/ns/compatibility/anno...
namespace rng href=http://relaxng.org/ns/structure/1.0
namespace xsi href=http://www.w3.org/2001/XMLSchema-instanc...
default namespace href=http://relaxng.org/ns/structure/1.0
ATTRIBUTE datatypeLibrary
TEXT
content=http://www.w3.org/2001/XMLSchema-datatyp...
As you can see, the KIWI schema defines 5 namespaces in the grammar element. A RELAX NG schema uses the namespace http://relaxng.org/ns/structure/1.0 which is bound to the rng prefix in our case. For convenience reason, we define it with the setns command just as r:
/ > setns r=http://relaxng.org/ns/structure/1.0
The prefix is unimportant, important is the namespace. We could use the one which is definied in the schema, it wouldn’t matter. But “r” is shorter than “rng”. 🙂 After you have definied the XML namespace, you can enter all XPath expressions. However, you have to insert the prefix in front of your element names. For example, we can count all definied elements in the KIWI schema. RELAX NG uses the name element (surprise!) for this. To get the number of all definied elements, apply the XPath function count on the // expression:
/ > xpath count(//r:element)
Object is a number : 80
Generally, every RELAX NG schema contains a start element. What contains it?
/ > xpath /r:grammar/r:start/*
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT ref
ATTRIBUTE name
TEXT
content=k.image
Aha, there is a ref element. This element contains an attribute name. We could also use an absolute path. Let’s try it:
/ > xpath /r:grammar/r:start/r:ref/@name
Object is a Node Set :
Set contains 1 nodes:
1 ATTRIBUTE name
TEXT
content=k.image
In RELAX NG, every ref element has to point to a define element. Let’s see what we get, when we want it all, using the // expression again:
/ > xpath //r:define
Object is a Node Set :
Set contains 310 nodes:
1 ELEMENT define
ATTRIBUTE name
TEXT
content=k.image.name.attribute
...
310 ELEMENT define
ATTRIBUTE name
TEXT
content=k.users
Ohh, that’s a bit too much. We want to know just the one from /r:grammar/r:start/r:ref/@name. The good news is: you can combine both with a predicate:
/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT define
ATTRIBUTE name
TEXT
content=k.image
What’s inside?
/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]/*
Object is a Node Set :
Set contains 1 nodes:
1 ELEMENT element
ATTRIBUTE name
TEXT
content=image
An image element! This element appears as root element in every KIWI configuration (which you guessed already.) And what’s the definition?
/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]/r:element/*
Object is a Node Set :
Set contains 3 nodes:
1 ELEMENT a:documentation
2 ELEMENT db:para
3 ELEMENT interleave
The first two elements (a:documentation, db:para) are just for documentation. Interesting part is interleave. I leave it up to you, to investigate XPath and the KIWI schema. 🙂
This was just an overview of the xmllint shell. It is very helpful to test some XPath expressions before you integrate them in XSLT or in programs.
Happy XPath-ing! 🙂
Both comments and pings are currently closed.
You want want to look at vtd-xml, another XPath engine that offers a lot of cool features
http://vtd-xml.sf.net
Thanks, I’ll have a look.