Home Home > 2009 > 11 > 23 > Playing With XPath Expressions in The xmllint Shell
Sign up | Login

Deprecation notice: openSUSE Lizards user blog platform is deprecated, and will remain read only for the time being. Learn more...

Playing With XPath Expressions in The xmllint Shell

November 23rd, 2009 by

When XML is transformed into something else, in most cases XSLT comes to play. One of the challenges of XSLT is to select just the nodes you are interested in. This task is done by XPath, “a query language for selecting nodes from a XML document.”

However, it can be tedious to create a XPath expression, run the transformation, and check if you got the expected result. After hours of debugging you find out: It’s the wrong XPath expression!

To make it easier: Test your XPath expressions in the internal xmllint shell!

Using Easy XPath Expressions

Generally, xmllint is known as a popular tool to validate your XML structure. Mostly unknown is its internal shell. With this shell you can make some spiffy XPath tests and check if it returns exactly what you want. Let’s consider the following DocBook 4 document:

<book lang="en">
  <title>Dancing with Penguins</title>
  <bookinfo>
    <author>
      <firstname>Tux</firstname>
     <surname>Penguin</surname>
    </author>
  </bookinfo>
  <chapter id="know.penguins">
    <title>Getting to Know Penguins</title>
    <abstract>
      <para>Penguins are cute.</para>
    </abstract>
    <sect1>
      <title>The Head</title>
      <para>...</para>
    </sect1>
    <!-- A small comment -->
    <sect1 id="penguin.coat">
      <title>The Coat</title>
      <para>...</para>
    </sect1>
  </chapter>
</book>

The content is not so important than the structure. To examine some XPath features of xmllint, we load the document into its shell using its --shell option:

xmllint --shell penguin-dance.xml

You first see the prompt:

/ >

The prompt shows you the path to your current node. After loading you just see the root node, which is indicated as /. Pretty similar than a Linux path notation.

Use help to list all available commands. For this little post, we focus on the xpath command. It evaluates an XPath expression in the context and prints the result. Let’s try an absolute XPath:

/ > xpath /book
Object is a Node Set :
Set contains 1 nodes:
1  ELEMENT book
    ATTRIBUTE lang
      TEXT
        content=en

Well, that was to be expected. The interesting part is, you can change the context. For example, we could change it to the first chapter:

/ > cd book/chapter
  chapter >

Surprised we didn’t use an absolute XPath? Well, our context was already the root node, containing the book node. In this case, it doesn’t matter to use a relative or absolute XPath. Both lead to the same node. However, this is not always the case.

Let’s see what we have inside book:

chapter > xpath *
  1  ELEMENT title
  2  ELEMENT abstract
  3  ELEMENT sect1
  4  ELEMENT sect1

Yes, that’s right. Ok, we want all sections in this chapter, that don’t have an id attribute. This can be achieved by using a XPath predicate and the XPath function not:

chapter > xpath sect1[not(@id)]
  Object is a Node Set :
  Set contains 1 nodes:
  1  ELEMENT sect1

We need the title, so we just append /title after the previous expression:

chapter > xpath sect1[not(@id)]/title
  1  ELEMENT title

and we want the content so we wrap it into the string XPath function:

chapter > xpath string(sect1[not(@id)]/title)
  Object is a string : The Head

We could use a lot more expressions to get the previous or following nodes, the parent nodes or the child nodes. For now, this section is enough and I make it a bit more difficult.

Using Namespaces in XPath Expressions

When dealing with XML it is not uncommon that documents contain one or more XML namespaces. To work with such structures, it is not enough to reuse the previous expressions. They will not work. Before you can work with namespaces, you have to define it first.

Let’s consider KIWI. The configuration is a XML file, based on a RELAX NG schema. The RELAX NG schema are bound to a namespace. Load the KIWI schema with the following command:

xmllint --shell http://gitorious.org/kiwi/kiwi/blobs/raw/master/modules/KIWISchema.rng

As the KIWI schema can (and probably will) change, your results might be a little different than mine. But the principle is the same.

Now we want to know, what contains the root element. As we do not know (yet) the root element’s name, we use a wildcard:

/ > xpath *
Object is a Node Set :
Set contains 1 nodes:
1  ELEMENT grammar
namespace db href=http://docbook.org/ns/docbook
namespace a href=http://relaxng.org/ns/compatibility/anno...
namespace rng href=http://relaxng.org/ns/structure/1.0
namespace xsi href=http://www.w3.org/2001/XMLSchema-instanc...
default namespace href=http://relaxng.org/ns/structure/1.0
ATTRIBUTE datatypeLibrary
TEXT
content=http://www.w3.org/2001/XMLSchema-datatyp...

As you can see, the KIWI schema defines 5 namespaces in the grammar element. A RELAX NG schema uses the namespace http://relaxng.org/ns/structure/1.0 which is bound to the rng prefix in our case. For convenience reason, we define it with the setns command just as r:

/ > setns r=http://relaxng.org/ns/structure/1.0

The prefix is unimportant, important is the namespace. We could use the one which is definied in the schema, it wouldn’t matter. But “r” is shorter than “rng”. 🙂 After you have definied the XML namespace, you can enter all XPath expressions. However, you have to insert the prefix in front of your element names. For example, we can count all definied elements in the KIWI schema. RELAX NG uses the name element (surprise!) for this. To get the number of all definied elements, apply the XPath function count on the // expression:

/ > xpath count(//r:element)
Object is a number : 80

Generally, every RELAX NG schema contains a start element. What contains it?

/ > xpath /r:grammar/r:start/*
Object is a Node Set :
Set contains 1 nodes:
1  ELEMENT ref
    ATTRIBUTE name
      TEXT
        content=k.image

Aha, there is a ref element. This element contains an attribute name. We could also use an absolute path. Let’s try it:

/ > xpath /r:grammar/r:start/r:ref/@name
Object is a Node Set :
Set contains 1 nodes:
1  ATTRIBUTE name
    TEXT
      content=k.image

In RELAX NG, every ref element has to point to a define element. Let’s see what we get, when we want it all, using the // expression again:

/ > xpath //r:define
Object is a Node Set :
Set contains 310 nodes:
1  ELEMENT define
    ATTRIBUTE name
      TEXT
        content=k.image.name.attribute
...
310  ELEMENT define
    ATTRIBUTE name
      TEXT
        content=k.users

Ohh, that’s a bit too much. We want to know just the one from /r:grammar/r:start/r:ref/@name. The good news is: you can combine both with a predicate:

/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]
Object is a Node Set :
Set contains 1 nodes:
1  ELEMENT define
    ATTRIBUTE name
      TEXT
        content=k.image

What’s inside?

/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]/*
Object is a Node Set :
Set contains 1 nodes:
1  ELEMENT element
    ATTRIBUTE name
      TEXT
        content=image

An image element! This element appears as root element in every KIWI configuration (which you guessed already.) And what’s the definition?

/ > xpath //r:define[@name=/r:grammar/r:start/r:ref/@name ]/r:element/*
Object is a Node Set :
Set contains 3 nodes:
1  ELEMENT a:documentation
2  ELEMENT db:para
3  ELEMENT interleave

The first two elements (a:documentation, db:para) are just for documentation. Interesting part is interleave. I leave it up to you, to investigate XPath and the KIWI schema. 🙂

This was just an overview of the xmllint shell. It is very helpful to test some XPath expressions before you integrate them in XSLT or in programs.

Happy XPath-ing! 🙂

Both comments and pings are currently closed.

2 Responses to “Playing With XPath Expressions in The xmllint Shell”

  1. You want want to look at vtd-xml, another XPath engine that offers a lot of cool features

    http://vtd-xml.sf.net

  2. Thanks, I’ll have a look.