Maybe you know this problem: You have a couple of XML files and you need a specific information. Probably everybody would think of grep or similar tools first. But maybe your query is a bit more complicated than just a simple piece of text. What do do?
Recently I’ve found a very useful command line utility, which is probably not very known. It’s named xpathgrep.py and you can get it from the lxml repository (you need lxml too). Let’s assume we have the following DocBook file:
File db.xml
<?xml version="1.0"?>
<book>
<title>My Cooking Book</title>
<chapter>
<title>Ingredients</title>
<para>...</para>
</chapter>
<chapter id="howtocook">
<title>How to cook</title>
<para>...</para>
</chapter>
</book>
Now, if I want to get all the titles I have to use a XPath (which is a path description language for XML, similar to Unix/Linux paths, but more powerful). To get all title elements all I have to do is to write //title, regardless of the level:
$ xpathgrep.py //title db.xml
and I get this:
<title>My Cooking Book</title>
<title>Ingredients</title>
<title>How to cook</title>
Nice, isn’t it? Probably you say: “But, hey, I can get this with grep too!” Yes, but if you want just all chapter titles, you have a problem with grep. With XPath and xpathgrep.py I only modify my XPath expression a bit:
$ xpathgrep.py //chapter/title db.xml
Now this reduces the above output just to the wanted chapter titles. And I can extent my query just for all chapters that doesn’t have an id attribute:
$ xpathgrep.py '//chapter[not(@id)]/title' db.xml
(You need the apostroph because of the shell.) The tool outputs this:
<title>Ingredients</title>
That’s nice, isn’t it? There are a lot of more to discover. A few hours ago I send a small patch to the lxml-devel mailinglist to support namespaces. Hopefully, it will be accepted. 🙂