lxml.objectify notes

Author: Dave Kuhlman
Address:
dkuhlman (at) davekuhlman (dot) org
Revision: 1.1a
Date: April 09, 2015
Copyright:Copyright (c) 2015 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT License http://www.opensource.org/licenses/mit-license.php.
Abstract:A document intended to help those getting started with lxml.objectify and, in particular, to help those attempting to transition from generateDS.py.

Contents

1   Introduction

This document is an attempt to give a little help to those starting out with lxml.objectify. But, it does not attempt to replace the official doc, which you can find here: http://lxml.de/objectify.html.

Much of the code in this document assumes that you have done the following in your Python code:

from lxml import objectify

2   Migrating from generateDS.py to lxml.objectify

With lxml.objectify, unlike generateDS.py, there is no need to generate code before processing an XML instance document.

2.1   Parsing an XML instance document

Use something like the following:

def objectify_parse(infilename):
    doctree = objectify.parse(infilename)
    root = doctree.getroot()
    return doctree, root

Or, when you want to validate against a schema while parsing, use:

def objectify_parse_with_schema(schemaname, infilename):
    schema = etree.XMLSchema(file=schemaname)
    parser = objectify.makeparser(schema=schema)
    doctree = objectify.parse(infilename, parser=parser)
    root = doctree.getroot()
    return doctree, root

And, if validation against a schema is one of your needs, don't forget the xmllint command line tool. For example:

$ xmllint --noout --schema my_schema.xsd my_instancedoc.xml

2.2   Exporting an XML document

There are several ways:

>>> print etree.tostring(doctree)
>>> print etree.tostring(root)
>>> doctree.write(sys.stdout)
>>> doctree.write(sys.stdout, pretty_print=True)

You can also export a sub-tree:

In [163]: person = root.person[1]
In [164]: print etree.tostring(person, pretty_print=True)

And, with optional pretty printing (indenting) and an XML declaration:

>>> doctree.write(my_output_file, pretty_print=True)
>>> doctree.write(my_output_file, xml_declaration=True)
>>> doctree.write(my_output_file, pretty_print=True, xml_declaration=True)

Yet more examples:

>>> a = obj.fromstring('<aaa><bbb>111</bbb><bbb><ccc>222</ccc></bbb></aaa>')
>>> etree.tostring(a)
>>> print etree.tostring(a)
>>> print etree.tostring(a, pretty_print=True)
>>> print etree.tostring(a.bbb[1], pretty_print=True)    # pretty print a subtree

2.2.1   Exporting without "ignorable whitespace"

The export methods generated by generateDS.py support an optional argument (pretty_print=True) that enables you to export a document without ignorable whitespace. lxml.objectify has support for that also:

  1. Parse the document initially without the ignorable whitespace. Example:

    parser = etree.XMLParser(remove_blank_text=True)
    doc = etree.parse(filename, parser)
    root = doc.getroot()
    
  2. In some cases you might need to remove ignorable whitespace with something like the following:

    for element in root.iter():
        element.tail = None
    

The above code examples and more information on ignorable whitespace and formatting serialized output (also known as "export" in generateDS.py) can be found in the lxml FAQ: http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output

2.3   The lxml.objectify API -- access to children and attributes

Attributes -- The attributes of an lxml.objectify XML element are available in a dictionary-like object. But you can also access them directly throught the element. Examples:

In [81]: element.attrib
Out[81]: {'ratio': '3.2', 'id': '1', 'value': 'abcd'}
In [82]:
In [82]: element.get('ratio')
Out[82]: '3.2'
In [83]: print element.get('ratio')
3.2
In [84]: print element.get('ratioxxx')
None

And, use element.set(key, value) to set an attribute's value.

Iterate over the attributes using the standard dictionary API on the elements el.attrib attribute. Example:

In [48]: link = root.Link[2]
In [49]: for key, value in link.attrib.items():
   ....:     print 'key: {}  value: {}'.format(key, value)
   ....:
key: rel  value: down
key: type  value: application/vnd.vmware.admin.vmwExtension+xml
key: href  value: https://vcloud.example.com/api/admin/extension

Children -- The children of an XML element are available by using the child's tag as an attribute. For example, if the element people has one or more children whose tag is person, then those children can be accessed as follows:

In [87]: people.person        # first person available without index
Out[87]: <Element person at 0x7fa0f1814ea8>
In [88]: people.person[0]     # same as previous
Out[88]: <Element person at 0x7fa0f1814ea8>
In [89]: people.person[1]
Out[89]: <Element person at 0x7fa0f1814e60>

You can also use getattr() to access child elements. You may need to do this when there are children from different namespaces within the same element. Examples:

In [50]: rootgroup = root.RootGroup
In [51]: rootgroup.Group
Out[51]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group at 0x7f8d34a05b48>
In [52]:
In [52]: getattr(rootgroup, 'Group')
Out[52]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group at 0x7f8d34a05b48>
In [53]:
In [53]: getattr(rootgroup, '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group')
Out[53]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group at 0x7f8d34a05b48>
In [54]:
In [54]: getattr(rootgroup, '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group')[1]
Out[54]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group at 0x7f8d34a05ab8>

Iterate over the children by using the element's el.iterchildren() method. Example:

In [47]: for child in root.iterchildren():
    print child.tag
   ....:
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link

2.4   Manipulating and modifying the element tree

Modify text content -- You can assign to a leaf element to modify its text content, for example:

>>> dataset.datanode = 'a simple string'

However, you may want to use lxml.objectify data types. If you do not, lxml.objectify may put them in a different namespace. Here are examples that preserve the existing data types:

>>> dataset.datanode = objectify.StringElement('a simple string')
>>> dataset.datanode = objectify.IntElement('200')
>>> dataset.datanode = objectify.FloatElement('300.5')

See the following for more on how to work with Python data types: http://lxml.de/objectify.html#python-data-types

Creating new elements -- See this for information on how to add elements to the XML element tree: http://lxml.de/objectify.html#creating-objectify-trees

You can also copy existing elements or sub-trees of elements, for example:

>>> import copy
>>> new_element = copy.deepcopy(old_element)
>>> parent_element.append(new_element)

3   Useful tips and hints

3.1   A mini-library of helpful functions

Some of the helper functions mentioned below are available here: objectify_helpers.py.

3.2   Printing a (more) readable representation

In order to get a picture of the API available at various elements, you can use the objectify.dump(element). For example:

In [237]: print objectify.dump(root.programmer)
programmer = None [ObjectifiedElement]
  * id = '2'
  * language = 'python'
  * editor = 'xml'
    name = 'Charles Carlson' [StringElement]
    interest = 'programming' [StringElement]
    category = 2233 [IntElement]
    description = 'A very happy programmer' [StringElement]
    email = 'charles@happyprogrammers.com' [StringElement]
    elposint = 14 [IntElement]
    elnonposint = 0 [IntElement]
    elnegint = -12 [IntElement]
    elnonnegint = 4 [IntElement]
    eldate = '2005-04-26' [StringElement]
    eldatetime = '2005-04-26T10:11:12' [StringElement]
    eldatetime1 = '2006-05-27T10:11:12.40' [StringElement]
    eltoken = 'aa bb    cc\tdd\n            ee' [StringElement]
    elshort = 123 [IntElement]
    ellong = 1324123412 [IntElement]
    elparam = u'' [StringElement]
      * id = 'id001'
      * name = 'Davy'
      * semantic = 'a big    semantic'
      * type = 'abc'
    elparam = u'' [StringElement]
      * id = 'id002'
      * name = 'Davy'
      * semantic = 'a big    semantic'
      * type = 'int'

A similar display can be gotten by using str(element). But, in order to do so, you may need to call objectify.enable_recursive_str(), first. For example:

In [238]: print str(root.programmer)
programmer = None [ObjectifiedElement]
  * id = '2'
  * language = 'python'
  * editor = 'xml'
    name = 'Charles Carlson' [StringElement]
    interest = 'programming' [StringElement]
    category = 2233 [IntElement]
    description = 'A very happy programmer' [StringElement]
    email = 'charles@happyprogrammers.com' [StringElement]
    elposint = 14 [IntElement]
    elnonposint = 0 [IntElement]
    elnegint = -12 [IntElement]
    elnonnegint = 4 [IntElement]
    eldate = '2005-04-26' [StringElement]
    eldatetime = '2005-04-26T10:11:12' [StringElement]
    eldatetime1 = '2006-05-27T10:11:12.40' [StringElement]
    eltoken = 'aa bb    cc\tdd\n            ee' [StringElement]
    elshort = 123 [IntElement]
    ellong = 1324123412 [IntElement]
    elparam = u'' [StringElement]
      * id = 'id001'
      * name = 'Davy'
      * semantic = 'a big    semantic'
      * type = 'abc'
    elparam = u'' [StringElement]
      * id = 'id002'
      * name = 'Davy'
      * semantic = 'a big    semantic'
      * type = 'int'

This behavior of str(o) can be turned on and off with the following:

In [75]: objectify.enable_recursive_str(True)
In [76]: objectify.enable_recursive_str(False)

And, here is an implementation that mimics objectify.dump(o) but has several additional features:

  • It enables you to limit the number of levels of nesting and display of children and their children etc. Imagine displaying the root node of a very large file containing many levels of nested children.
  • It writes to a file rather than accumulating a string. For some situations, this saves having to type print in order to format the output. And, again thinking about very large documents, it might save us from building up a huge string.
def swrite(element, maxlevels=None, outfile=sys.stdout):
    """Recursively write out a formatted, readable representation of element.
    Possibly do shallow recursion.
    Limit recursion to maxlevels (default is all levels).
    Write output to file outfile (default is sys.stdout).
    """
    wrt = outfile.write
    swrite_(element, 0, maxlevels, wrt)


def swrite_(element, indent, maxlevels, wrt):
    indentstr = '    ' * indent
    wrt('{}{}: {}\n'.format(indentstr, element.tag, repr(element), ))
    for name, value in element.attrib.iteritems():
        wrt('  {}* {}: {}\n'.format(indentstr, name, value, ))
    indent += 1
    if maxlevels is not None and indent > maxlevels:
        return
    for child in element.iterchildren():
        swrite_(child, indent, maxlevels, wrt)

3.3   Exploring element-specific API

With lxml.objectify, inspecting objects to determine the API for that specific element type is a frequent task. You may find a function something like the following helpful:

Standard_attrs = set([ '__dict__', '__getattr__', 'addattr',
    'countchildren', 'descendantpaths', '__class__', '__contains__',
    '__copy__', '__deepcopy__', '__delattr__', '__delitem__',
    '__doc__', '__format__', '__getattribute__', '__getitem__',
    '__hash__', '__init__', '__iter__', '__len__', '__new__',
    '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__',
    '__reversed__', '__setattr__', '__setitem__', '__sizeof__',
    '__str__', '__subclasshook__', '_init', 'addnext',
    'addprevious', 'append', 'attrib', 'base', 'clear', 'extend',
    'find', 'findall', 'findtext', 'get', 'getchildren',
    'getiterator', 'getnext', 'getparent', 'getprevious',
    'getroottree', 'index', 'insert', 'items', 'iter',
    'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind',
    'itersiblings', 'itertext', 'keys', 'makeelement', 'nsmap',
    'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag',
    'tail', 'text', 'values', 'xpath', ])

def members(element):
    names = [attr for attr in dir(element) if attr not in Standard_attrs]
    return names

I obtained that list of Standard_attrs by doing print dir(element) on a standard element (and then modifying it a bit).

However, instead of calling that members(o) function (above), the following snippet is likely just as useful:

In [96]: [child.tag for child in element.iterchildren()]
Out[96]: ['example1', 'name', 'interest', 'interest', 'category', 'hot.agent']
In [97]:
In [97]: sorted([child.tag for child in element.iterchildren()])
Out[97]: ['category', 'example1', 'hot.agent', 'interest', 'interest', 'name']

And, to save typing, the following functions might be helpful:

def children(element, tag=None):
    """Return a list of children of an element.
    Optional argument tag can be a single string or list of strings
    to select only children with that tag name.
    """
    child_list = [child for child in element.iterchildren(tag=tag)]
    return child_list

def child_tags(element, tag=None):
    """Return a list of the tag names of the children of an element.
    Optional argument tag can be a single string or list of strings
    to select only children with that tag name.
    """
    tags = [child.tag for child in element.iterchildren(tag=tag)]
    return tags

Or, you may find this shallow dump function useful. It uses objectify.dump(o), but attempts to only return the description of the top level object:

def sdump(element):
    content = objectify.dump(element)
    content = content.splitlines()
    prefix = '        '
    content = [line for line in content if not line.startswith(prefix)]
    content = '\n'.join(content)
    return content

3.4   Searching an XML document

lxml.objectify has its own XPath-like search capability with a (possibly) simpler form of the XPath/XQuery language. See this for information about ObjectPath: http://lxml.de/objectify.html#objectpath

And, you can also use that lxml xpath on lxml.objectify elements. Example:

In [68]: root.xpath('.//@Name')
Out[68]:
['dataset1-1',
 'dataset1-2',
 'subgroup01',
 'dataset2-1',
 'dataset2-2',
 'subgroup02',
 'dataset3-1',
 'dataset3-2',
 'subgroup03',
 'dataset3-3']

See this for information about the lxml support for xpath: http://lxml.de/xpathxslt.html. And, see this for information about the XPath path language:

4   Sample applications with lxml.objectify

  1. Here is a sample application that parses and displays weather information from an XML document: weather_test.py.

  2. This sample application picks data out of an XML document that was generated with h5dump. For example:

    $ h5dump -x my_data.hdf5 > my_data.hdf5.xml
    

    This sample application attempts to create a new hdf5 data file from that XML document. The code is here: obj_hdf_xml.py

    Here is more information about HDF5:

  3. Here are several small applications that pick data out of files related to Vcloud. I've included the Python code, a sample XML file, and a dump of the XML file produced by objectify.dump(root). The code is here: vcloud_samples.zip And, you can learn more about Vcloud here: http://pubs.vmware.com/vcd-51/topic/com.vmware.vcloud.api.reference.doc_51/about.html

5   Evaluation and comparison -- lxml.objectify vs. generateDS.py

5.1   API discovery

generateDS.py generates a class for each xs:complexType. Therefore, there is Python code that you can inspect to determine the (generated) API, for example, getters, setters, constructor, export function, etc. In order to do that, you will need to identify which generated class is the implementation for the element in which you are interested.

lxml.objectify objects can be inspected using objectify.dump(o) or one of the helper functions described in this document in section Useful tips and hints. In order to perform this inspection, you must get access to an object of the type that you want to inspect. Here are several ways to do that (and you may think of others):

  • Drop into the Python debugger by placing this code in your application where you have access to an object of the type you are interested in:

    import pdb
    pdb.set_trace()
    

    Or, if you have installed ipython and ipdb, use:

    import ipdb
    ipdb.set_trace()
    

    ipdb gives you tab completion for names available in the current scope.

  • Parse and dump an XML instance document (using objectify.dump(el)), capture it in a file, then look for the element of interest with your text editor. Here is a simple utility script to help do that:

    #!/usr/bin/env python
    
    import sys
    from lxml import objectify
    
    def dump(infilename):
        doc = objectify.parse(infilename)
        root = doc.getroot()
        print objectify.dump(root)
    
    def main():
        args = sys.argv[1:]
        infilename = args[0]
        dump(infilename)
    
    if __name__ == '__main__':
        main()
    
  • Insert the the following code in your application at some point where it will have access to the element whose API you wish to discover:

    print objectify.dump(element)
    

    Or, if stdout (standard output) is not available and visible to you, something like the following:

    import tempfile
    
    with tempfile.NamedTemporaryFile('w', delete=False) as outfile:
        outfile.write(objectify.dump(element))
        outfilename = outfile.name
    
  • Or, use one of the helpers above, for example:

    print objectify_helpers.child_tags(element)
    

5.2   Namespaces

lxml.objectify handles namespaces correctly; generateDS.py, especially when there are multiple namespaces in the same XML document, does not.

Mostly, lxml.objectify handles namespaces for you without additional effort on your part. If you are working with an element that contains items from different namespaces, then see this: http://lxml.de/objectify.html#namespace-handling. Sometimes, when you use getattr(el, 'xxx') or el.iterchildren(tag='xxx'), you will need to include the namespace. Examples:

In [15]: rootgroup = root.RootGroup
In [16]: rootgroup.Group.tag
Out[16]: '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group'
In [17]: [el.tag for el in rootgroup.iterchildren(tag="{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group")]
Out[17]:
['{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group',
 '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group',
 '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group']

and:

In [24]: rootgroup.Dataset
Out[24]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Dataset at 0x7f0293b65c68>
In [25]: getattr(rootgroup, "{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Dataset")
Out[25]: <Element {http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Dataset at 0x7f0293b65c68>

5.3   Summary

Although their approaches are very different, generateDS.py and lxml.objectify seem to solve the same set of problems and answer equivalent sets of needs. generateDS.py generates and gives you an API for each element type, specifically a Python class. With lxml.objectify, you can discover a (simulated) API by inspecting a dump produced by lxml.objectify.dump(o) or by using lxml.objectify and lxml capabilities in each element to inspect the element.

When might you want to use one rather than the other?

  • Since generateDS.py requires an XML schema in order to generate code, if you do not have an XML schema for your document type, then generateDS.py is not an option for you.
  • If you must handle an XML document that is defined by an XML schema that contains multiple namespaces, then, because of the problems that generateDS.py has with namespaces, you should choose lxml.objectify.
  • If you want to produce Python code that defines and implements an API for a specific XML document type and you have an XML schema that defines that document type, then you may want to consider generateDS.py. If you want to be able to send that generated API for use by other developers, then the generateDS.py approach might be an advantage to you. However, the content produced by lxml.objectify.dump(o) is very close to a description of an API for accessing an manipulating each element in an XML document.