Coder Perfect

How do you parse XML and count the number of instances of a specific node attribute?

Problem

I’m trying to develop a Python script to count instances of a specific node attribute in a database with numerous rows containing XML.

My tree appears to be

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How can I use Python to retrieve the XML properties “1” and “2”?

Asked by randombits

Solution #1

I recommend ElementTree. Other compatible implementations of the same API exist in the Python standard library, such as lxml and cElementTree; but, in this context, what they mostly add is speed; the ease of programming portion is determined by the API, which ElementTree defines.

Create an Element instance root from the XML, for example, by using the XML function or processing a file with something like:

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

Or any of the other options listed on ElementTree. Then do anything along these lines:

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

And comparable coding patterns, which are usually quite easy.

Answered by Alex Martelli

Solution #2

The quickest and most straightforward method is minidom.

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

Output:

4
item1
item1
item2
item3
item4

Answered by Ryan Christensen

Solution #3

BeautifulSoup can be used in the following ways:

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

Answered by YOU

Solution #4

There are numerous options available. If performance and memory utilization are important, cElementTree appears to be a good choice. When compared to merely reading in the file using readlines, it has a very low overhead.

The following table, copied from the cElementTree website, contains the pertinent metrics:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

cElementTree comes packaged with Python, as @jfs pointed out:

Answered by Cyrus

Solution #5

For simplicity, I choose xmltodict.

It parses your XML to an OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

Answered by myildirim

Post is based on https://stackoverflow.com/questions/1912434/how-to-parse-xml-and-count-instances-of-a-particular-node-attribute