.. vim:ft=rst:
======================
lxml.objectify notes
======================
:author: Dave Kuhlman
:address:
dkuhlman (at) davekuhlman (dot) org
:revision: 1.1a
:date: |date|
.. |date| date:: %B %d, %Y
:Copyright: Copyright (c) 2015 Dave Kuhlman. All Rights Reserved.
This software is subject to the provisions of the MIT License
http://www.opensource.org/licenses/mit-license.php.
:Abstract: A document intended to help those getting started with
``lxml.objectify`` and, in particular, to help those
attempting to transition from ``generateDS.py``.
.. sectnum::
.. contents::
Introduction
==============
This document is an attempt to give a little help to those starting
out with ``lxml.objectify``. But, it does not attempt to replace
the official doc, which you can find here:
http://lxml.de/objectify.html.
Much of the code in this document assumes that you have done the
following in your Python code::
from lxml import objectify
Migrating from generateDS.py to lxml.objectify
================================================
With ``lxml.objectify``, unlike ``generateDS.py``, there is no need
to generate code before processing an XML instance document.
Parsing an XML instance document
----------------------------------
Use something like the following::
def objectify_parse(infilename):
doctree = objectify.parse(infilename)
root = doctree.getroot()
return doctree, root
Or, when you want to validate against a schema while parsing, use::
def objectify_parse_with_schema(schemaname, infilename):
schema = etree.XMLSchema(file=schemaname)
parser = objectify.makeparser(schema=schema)
doctree = objectify.parse(infilename, parser=parser)
root = doctree.getroot()
return doctree, root
And, if validation against a schema is one of your needs, don't
forget the ``xmllint`` command line tool. For example::
$ xmllint --noout --schema my_schema.xsd my_instancedoc.xml
Exporting an XML document
---------------------------
There are several ways::
>>> print etree.tostring(doctree)
>>> print etree.tostring(root)
>>> doctree.write(sys.stdout)
>>> doctree.write(sys.stdout, pretty_print=True)
You can also export a sub-tree::
In [163]: person = root.person[1]
In [164]: print etree.tostring(person, pretty_print=True)
And, with optional pretty printing (indenting) and an XML
declaration::
>>> doctree.write(my_output_file, pretty_print=True)
>>> doctree.write(my_output_file, xml_declaration=True)
>>> doctree.write(my_output_file, pretty_print=True, xml_declaration=True)
Yet more examples::
>>> a = obj.fromstring('111222')
>>> etree.tostring(a)
>>> print etree.tostring(a)
>>> print etree.tostring(a, pretty_print=True)
>>> print etree.tostring(a.bbb[1], pretty_print=True) # pretty print a subtree
Exporting without "ignorable whitespace"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``export`` methods generated by ``generateDS.py`` support an
optional argument (``pretty_print=True``) that enables you to export
a document *without* ignorable whitespace. ``lxml.objectify`` has
support for that also:
1. Parse the document initially without the ignorable whitespace.
Example::
parser = etree.XMLParser(remove_blank_text=True)
doc = etree.parse(filename, parser)
root = doc.getroot()
2. In some cases you might need to remove ignorable whitespace with
something like the following::
for element in root.iter():
element.tail = None
The above code examples and more information on ignorable whitespace
and formatting serialized output (also known as "export" in
``generateDS.py``) can be found in the ``lxml`` FAQ:
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
The lxml.objectify API -- access to children and attributes
-------------------------------------------------------------
**Attributes** -- The attributes of an ``lxml.objectify`` XML element are
available in a dictionary-like object. But you can also access them
directly throught the element. Examples::
In [81]: element.attrib
Out[81]: {'ratio': '3.2', 'id': '1', 'value': 'abcd'}
In [82]:
In [82]: element.get('ratio')
Out[82]: '3.2'
In [83]: print element.get('ratio')
3.2
In [84]: print element.get('ratioxxx')
None
And, use ``element.set(key, value)`` to set an attribute's value.
Iterate over the attributes using the standard dictionary API on the
elements ``el.attrib`` attribute. Example::
In [48]: link = root.Link[2]
In [49]: for key, value in link.attrib.items():
....: print 'key: {} value: {}'.format(key, value)
....:
key: rel value: down
key: type value: application/vnd.vmware.admin.vmwExtension+xml
key: href value: https://vcloud.example.com/api/admin/extension
**Children** -- The children of an XML element are available by using
the child's tag as an attribute. For example, if the element
``people`` has one or more children whose tag is ``person``, then
those children can be accessed as follows::
In [87]: people.person # first person available without index
Out[87]:
In [88]: people.person[0] # same as previous
Out[88]:
In [89]: people.person[1]
Out[89]:
You can also use ``getattr()`` to access child elements. You may
need to do this when there are children from different namespaces
within the same element. Examples::
In [50]: rootgroup = root.RootGroup
In [51]: rootgroup.Group
Out[51]:
In [52]:
In [52]: getattr(rootgroup, 'Group')
Out[52]:
In [53]:
In [53]: getattr(rootgroup, '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group')
Out[53]:
In [54]:
In [54]: getattr(rootgroup, '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group')[1]
Out[54]:
Iterate over the children by using the element's
``el.iterchildren()`` method. Example::
In [47]: for child in root.iterchildren():
print child.tag
....:
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
{http://www.vmware.com/vcloud/v1.5}Link
Manipulating and modifying the element tree
---------------------------------------------
Modify text content -- You can assign to a leaf element to modify
its text content, for example::
>>> dataset.datanode = 'a simple string'
However, you may want to use ``lxml.objectify`` data types. If you
do not, ``lxml.objectify`` may put them in a different namespace.
Here are examples that preserve the existing data types::
>>> dataset.datanode = objectify.StringElement('a simple string')
>>> dataset.datanode = objectify.IntElement('200')
>>> dataset.datanode = objectify.FloatElement('300.5')
See the following for more on how to work with Python data types:
http://lxml.de/objectify.html#python-data-types
Creating new elements -- See this for information on how to add
elements to the XML element tree:
http://lxml.de/objectify.html#creating-objectify-trees
You can also copy existing elements or sub-trees of elements, for
example::
>>> import copy
>>> new_element = copy.deepcopy(old_element)
>>> parent_element.append(new_element)
Useful tips and hints
=======================
A mini-library of helpful functions
-------------------------------------
Some of the helper functions mentioned below are available here:
`objectify_helpers.py `_.
Printing a (more) readable representation
-------------------------------------------
In order to get a picture of the API available at various elements,
you can use the ``objectify.dump(element)``. For example::
In [237]: print objectify.dump(root.programmer)
programmer = None [ObjectifiedElement]
* id = '2'
* language = 'python'
* editor = 'xml'
name = 'Charles Carlson' [StringElement]
interest = 'programming' [StringElement]
category = 2233 [IntElement]
description = 'A very happy programmer' [StringElement]
email = 'charles@happyprogrammers.com' [StringElement]
elposint = 14 [IntElement]
elnonposint = 0 [IntElement]
elnegint = -12 [IntElement]
elnonnegint = 4 [IntElement]
eldate = '2005-04-26' [StringElement]
eldatetime = '2005-04-26T10:11:12' [StringElement]
eldatetime1 = '2006-05-27T10:11:12.40' [StringElement]
eltoken = 'aa bb cc\tdd\n ee' [StringElement]
elshort = 123 [IntElement]
ellong = 1324123412 [IntElement]
elparam = u'' [StringElement]
* id = 'id001'
* name = 'Davy'
* semantic = 'a big semantic'
* type = 'abc'
elparam = u'' [StringElement]
* id = 'id002'
* name = 'Davy'
* semantic = 'a big semantic'
* type = 'int'
A similar display can be gotten by using ``str(element)``. But,
in order to do so, you may need to call
``objectify.enable_recursive_str()``, first. For
example::
In [238]: print str(root.programmer)
programmer = None [ObjectifiedElement]
* id = '2'
* language = 'python'
* editor = 'xml'
name = 'Charles Carlson' [StringElement]
interest = 'programming' [StringElement]
category = 2233 [IntElement]
description = 'A very happy programmer' [StringElement]
email = 'charles@happyprogrammers.com' [StringElement]
elposint = 14 [IntElement]
elnonposint = 0 [IntElement]
elnegint = -12 [IntElement]
elnonnegint = 4 [IntElement]
eldate = '2005-04-26' [StringElement]
eldatetime = '2005-04-26T10:11:12' [StringElement]
eldatetime1 = '2006-05-27T10:11:12.40' [StringElement]
eltoken = 'aa bb cc\tdd\n ee' [StringElement]
elshort = 123 [IntElement]
ellong = 1324123412 [IntElement]
elparam = u'' [StringElement]
* id = 'id001'
* name = 'Davy'
* semantic = 'a big semantic'
* type = 'abc'
elparam = u'' [StringElement]
* id = 'id002'
* name = 'Davy'
* semantic = 'a big semantic'
* type = 'int'
This behavior of ``str(o)`` can be turned on and off with the
following::
In [75]: objectify.enable_recursive_str(True)
In [76]: objectify.enable_recursive_str(False)
And, here is an implementation that mimics ``objectify.dump(o)`` but has
several additional features:
- It enables you to limit the number of levels of nesting and
display of children and their children etc. Imagine displaying
the root node of a very large file containing many levels of
nested children.
- It writes to a file rather than accumulating a string. For some
situations, this saves having to type ``print`` in order to format
the output. And, again thinking about very large documents, it
might save us from building up a huge string.
::
def swrite(element, maxlevels=None, outfile=sys.stdout):
"""Recursively write out a formatted, readable representation of element.
Possibly do shallow recursion.
Limit recursion to maxlevels (default is all levels).
Write output to file outfile (default is sys.stdout).
"""
wrt = outfile.write
swrite_(element, 0, maxlevels, wrt)
def swrite_(element, indent, maxlevels, wrt):
indentstr = ' ' * indent
wrt('{}{}: {}\n'.format(indentstr, element.tag, repr(element), ))
for name, value in element.attrib.iteritems():
wrt(' {}* {}: {}\n'.format(indentstr, name, value, ))
indent += 1
if maxlevels is not None and indent > maxlevels:
return
for child in element.iterchildren():
swrite_(child, indent, maxlevels, wrt)
Exploring element-specific API
--------------------------------
With ``lxml.objectify``, inspecting objects to determine the API for
that specific element type is a frequent task. You may find a
function something like the following helpful::
Standard_attrs = set([ '__dict__', '__getattr__', 'addattr',
'countchildren', 'descendantpaths', '__class__', '__contains__',
'__copy__', '__deepcopy__', '__delattr__', '__delitem__',
'__doc__', '__format__', '__getattribute__', '__getitem__',
'__hash__', '__init__', '__iter__', '__len__', '__new__',
'__nonzero__', '__reduce__', '__reduce_ex__', '__repr__',
'__reversed__', '__setattr__', '__setitem__', '__sizeof__',
'__str__', '__subclasshook__', '_init', 'addnext',
'addprevious', 'append', 'attrib', 'base', 'clear', 'extend',
'find', 'findall', 'findtext', 'get', 'getchildren',
'getiterator', 'getnext', 'getparent', 'getprevious',
'getroottree', 'index', 'insert', 'items', 'iter',
'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind',
'itersiblings', 'itertext', 'keys', 'makeelement', 'nsmap',
'prefix', 'remove', 'replace', 'set', 'sourceline', 'tag',
'tail', 'text', 'values', 'xpath', ])
def members(element):
names = [attr for attr in dir(element) if attr not in Standard_attrs]
return names
I obtained that list of ``Standard_attrs`` by doing ``print
dir(element)`` on a standard element (and then modifying it a bit).
However, instead of calling that ``members(o)`` function (above),
the following snippet is likely just as useful::
In [96]: [child.tag for child in element.iterchildren()]
Out[96]: ['example1', 'name', 'interest', 'interest', 'category', 'hot.agent']
In [97]:
In [97]: sorted([child.tag for child in element.iterchildren()])
Out[97]: ['category', 'example1', 'hot.agent', 'interest', 'interest', 'name']
And, to save typing, the following functions might be helpful::
def children(element, tag=None):
"""Return a list of children of an element.
Optional argument tag can be a single string or list of strings
to select only children with that tag name.
"""
child_list = [child for child in element.iterchildren(tag=tag)]
return child_list
def child_tags(element, tag=None):
"""Return a list of the tag names of the children of an element.
Optional argument tag can be a single string or list of strings
to select only children with that tag name.
"""
tags = [child.tag for child in element.iterchildren(tag=tag)]
return tags
Or, you may find this shallow dump function useful. It uses
``objectify.dump(o)``, but attempts to *only* return the description
of the top level object::
def sdump(element):
content = objectify.dump(element)
content = content.splitlines()
prefix = ' '
content = [line for line in content if not line.startswith(prefix)]
content = '\n'.join(content)
return content
Searching an XML document
---------------------------
``lxml.objectify`` has its own XPath-like search capability with a
(possibly) simpler form of the XPath/XQuery language. See this for
information about ObjectPath: http://lxml.de/objectify.html#objectpath
And, you can also use that lxml xpath on ``lxml.objectify``
elements. Example::
In [68]: root.xpath('.//@Name')
Out[68]:
['dataset1-1',
'dataset1-2',
'subgroup01',
'dataset2-1',
'dataset2-2',
'subgroup02',
'dataset3-1',
'dataset3-2',
'subgroup03',
'dataset3-3']
See this for information about the ``lxml`` support for
``xpath``: http://lxml.de/xpathxslt.html. And, see this for
information about the XPath path language:
- http://www.w3.org/TR/2014/REC-xpath-30-20140408/#unabbrev
- http://www.w3.org/TR/2014/REC-xpath-30-20140408/#abbrev
Sample applications with lxml.objectify
==========================================
1. Here is a sample application that parses and displays weather
information from an XML document: `weather_test.py
`_.
2. This sample application picks data out of an XML document that
was generated with ``h5dump``. For example::
$ h5dump -x my_data.hdf5 > my_data.hdf5.xml
This sample application attempts to create a new hdf5 data file from that XML
document. The code is here:
`obj_hdf_xml.py `_
Here is more information about HDF5:
- http://www.hdfgroup.org/
- http://docs.h5py.org/en/latest/index.html -- HDF5 for Python
3. Here are several small applications that pick data out of files
related to Vcloud. I've included the Python code, a sample XML
file, and a dump of the XML file produced by
``objectify.dump(root)``. The code is here:
`vcloud_samples.zip `_ And, you can learn
more about Vcloud here:
http://pubs.vmware.com/vcd-51/topic/com.vmware.vcloud.api.reference.doc_51/about.html
Evaluation and comparison -- lxml.objectify vs. generateDS.py
===============================================================
API discovery
---------------
``generateDS.py`` generates a class for each ``xs:complexType``.
Therefore, there is Python code that you can inspect to determine
the (generated) API, for example, getters, setters, constructor,
export function, etc. In order to do that, you will need to
identify which generated class is the implementation for the element
in which you are interested.
``lxml.objectify`` objects can be inspected using
``objectify.dump(o)`` or one of the helper functions described in
this document in section `Useful tips and hints`_. In order to
perform this inspection, you must get access to an object of the
type that you want to inspect. Here are several ways to do that
(and you may think of others):
- Drop into the Python debugger by placing this code in your
application where you have access to an object of the type you are
interested in::
import pdb
pdb.set_trace()
Or, if you have installed ``ipython`` and ``ipdb``, use::
import ipdb
ipdb.set_trace()
``ipdb`` gives you tab completion for names available in the
current scope.
- Parse and dump an XML instance document (using
``objectify.dump(el)``), capture it in a file, then look for the
element of interest with your text editor. Here is a simple
utility script to help do that::
#!/usr/bin/env python
import sys
from lxml import objectify
def dump(infilename):
doc = objectify.parse(infilename)
root = doc.getroot()
print objectify.dump(root)
def main():
args = sys.argv[1:]
infilename = args[0]
dump(infilename)
if __name__ == '__main__':
main()
- Insert the the following code in your application at some point
where it will have access to the element whose API you wish to
discover::
print objectify.dump(element)
Or, if stdout (standard output) is not available and visible to
you, something like the following::
import tempfile
with tempfile.NamedTemporaryFile('w', delete=False) as outfile:
outfile.write(objectify.dump(element))
outfilename = outfile.name
- Or, use one of the helpers above, for example::
print objectify_helpers.child_tags(element)
Namespaces
------------
``lxml.objectify`` handles namespaces correctly; ``generateDS.py``,
especially when there are multiple namespaces in the same XML
document, does not.
Mostly, ``lxml.objectify`` handles namespaces for you without
additional effort on your part. If you are working with an element
that contains items from different namespaces, then see this:
http://lxml.de/objectify.html#namespace-handling. Sometimes, when
you use ``getattr(el, 'xxx')`` or el.iterchildren(tag='xxx'), you
will need to include the namespace. Examples::
In [15]: rootgroup = root.RootGroup
In [16]: rootgroup.Group.tag
Out[16]: '{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group'
In [17]: [el.tag for el in rootgroup.iterchildren(tag="{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group")]
Out[17]:
['{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group',
'{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group',
'{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Group']
and::
In [24]: rootgroup.Dataset
Out[24]:
In [25]: getattr(rootgroup, "{http://hdfgroup.org/HDF5/XML/schema/HDF5-File.xsd}Dataset")
Out[25]:
Summary
---------
Although their approaches are very different, ``generateDS.py`` and
``lxml.objectify`` seem to solve the same set of problems and answer
equivalent sets of needs. ``generateDS.py`` generates and gives you
an API for each element type, specifically a Python class. With
``lxml.objectify``, you can discover a (simulated) API by inspecting
a dump produced by ``lxml.objectify.dump(o)`` or by using
``lxml.objectify`` and ``lxml`` capabilities in each element to
inspect the element.
When might you want to use one rather than the other?
- Since ``generateDS.py`` requires an XML schema in order to
generate code, if you do *not* have an XML schema for your
document type, then ``generateDS.py`` is not an option for you.
- If you must handle an XML document that is defined by an XML
schema that contains multiple namespaces, then, because of the
problems that ``generateDS.py`` has with namespaces, you should
choose ``lxml.objectify``.
- If you want to produce Python code that defines and implements an
API for a specific XML document type and you have an XML schema
that defines that document type, then you may want to consider
``generateDS.py``. If you want to be able to send that generated
API for use by other developers, then the ``generateDS.py``
approach might be an advantage to you. However, the content
produced by ``lxml.objectify.dump(o)`` is very close to a
description of an API for accessing an manipulating each element
in an XML document.