Iterating through XIST trees
There are three related methods available for iterating through an XML tree and
finding nodes in the tree: The methods walk()
,
walknodes()
and walkpaths()
.
The walk()
method
The method walk()
is a generator. When called without
any arguments it visits each node in the tree once. Furthermore without
arguments parent nodes are yielded before their children, and no attribute nodes
are yielded. (This can however be changed by passing certain arguments to
walk()
.)
What walk()
outputs is a Cursor
object (in fact walk()
always yields the same cursor
object, but the attributes will be updated during the traversal).
A Cursor
object has the following attributes:
root
The node where traversal has been started (i.e. the object for which the
walk()
method has been called).node
The current node being traversed.
path
A list of nodes that contains the path through the tree from the root to the current node (i.e.
path[0]
isroot
andpath[-1]
isnode
).index
A path of indices (e.g.
[0, 1]
if the current node is the second child of the first child of the root). Inside attributes the index path will contain the name of the attribute (or a (attribute name, namespace name) tuple inside a global attribute).event
A string that specifies which event is currently being handled. Possible values are:
"enterelementnode"
,"leaveelementnode"
,"enterattrnode"
,"leaveattrnode"
,"textnode"
,"commentnode"
,"doctypenode"
,"procinstnode"
,"entitynode"
and"nullnode"
.
The following example shows the basic usage of the walk()
method:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i) for i in range(3))
>>> for cursor in e.walk():
... print(f"{cursor.event} {cursor.node!r}")
...
enterelementnode <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x43fbb0>
enterelementnode <ll.xist.ns.html.li element object (1 child/no attrs) at 0x452750>
textnode <ll.xist.xsc.Text content='0' at 0x5b1670>
enterelementnode <ll.xist.ns.html.li element object (1 child/no attrs) at 0x452830>
textnode <ll.xist.xsc.Text content='1' at 0x5b16e8>
enterelementnode <ll.xist.ns.html.li element object (1 child/no attrs) at 0x5b30d0>
textnode <ll.xist.xsc.Text content='2' at 0x5b1760>
The path
attribute can be used like this:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i) for i in range(3))
>>> for cursor in e.walk():
... print([f"{n.__class__.__module__}.{n.__class__.__qualname__}" for n in cursor.path])
...
['ll.xist.ns.html.ul']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
The following example shows how the index
attribute works:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i) for i in range(3))
>>> for cursor in e.walk():
... print(f"{cursor.index} {cursor.node!r}")
...
[] <ll.xist.ns.html.ul element object (5 children/no attrs) at 0x4b7bb0>
[0] <ll.xist.ns.html.li element object (1 child/no attrs) at 0x4ca750>
[0, 0] <ll.xist.xsc.Text content='0' at 0x629670>
[1] <ll.xist.ns.html.li element object (1 child/no attrs) at 0x4ca830>
[1, 0] <ll.xist.xsc.Text content='1' at 0x6296e8>
[2] <ll.xist.ns.html.li element object (1 child/no attrs) at 0x62b0d0>
[2, 0] <ll.xist.xsc.Text content='2' at 0x629760>
Changing which parts of the tree are traversed
The walk()
method has a few additional parameters that
specify which part of the tree should be traversed and in which order:
entercontent
(defaultTrue
)Should the content of an element be entered? Note that when you call
walk()
withentercontent
being false,walk()
will only yield the root node itself.enterattrs
(defaultFalse
)Should the attributes of an element be entered? The following example shows the usage of
enterattrs
:>>> from ll.xist.ns import html >>> e = html.ul(html.li(i, class_=f"li-{i}") for i in range(3)) >>> for cursor in e.walk(enterattrs=True): ... indent = "\t"*(len(cursor.path)-1) ... print(f"{indent}{cursor.node!r}") ... <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x51e790> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x51e8b0> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x532f30> <ll.xist.xsc.Text content='0' at 0x67e6c0> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x67f8b0> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x671720> <ll.xist.xsc.Text content='1' at 0x67e7b0> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x67f930> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x671630> <ll.xist.xsc.Text content='2' at 0x67e990>
When both entercontent
and enterattrs
are true, the attributes
will always be entered before the content. Setting enterattrs
to true
will only visit the attribute nodes themselves, but not their content.
enterattr
(defaultFalse
)Should the content of the attributes of an element be entered? (This is only relevant if
enterattrs
is true.) The following example shows the usage of theenterattr
parameter:>>> from ll.xist.ns import html >>> e = html.ul(html.li(i, class_=f"li-{i}") for i in range(3)) >>> for cursor in e.walk(enterattrs=True, enterattr=True): ... indent = "\t"*(len(cursor.path)-1) ... print(f"{indent}{cursor.node!r}") ... <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x4c1790> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x4c18b0> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x4d5f30> <ll.xist.xsc.Text content='li-0' at 0x621788> <ll.xist.xsc.Text content='0' at 0x621710> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x6228b0> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x614720> <ll.xist.xsc.Text content='li-1' at 0x621968> <ll.xist.xsc.Text content='1' at 0x621800> <ll.xist.ns.html.li element object (1 child/1 attr) at 0x622930> <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x614630> <ll.xist.xsc.Text content='li-2' at 0x621ad0> <ll.xist.xsc.Text content='2' at 0x6219e0>
Changing traversal order
The default traversal order is “top down”. The following walk()
parameters can be used to change that into “bottom up” order or into visiting
each element or attribute both on the way down and up:
enterelementnode
(defaultTrue
)Should the generator yield the cursor before it enters an element (i.e. before it visits the attributes and content of the element)? The cursor attribute
event
will have the value"enterelementnode"
in this case.leaveelementnode
(defaultFalse
)Should the generator yield the cursor after it has visited an element? The cursor attribute
event
will have the value"leaveelementnode"
in this case. Passingenterelementnode=False, leaveelementnode=True
towalk()
will change “top down” traversal into “bottom up”.enterattrnode
(defaultTrue
)Should the generator yield the cursor before it enters an attribute? The cursor attribute
event
will have the value"enterattrnode"
in this case. Note that the attribute will only be entered whenenterattr
is true and it will only be visited ifenterattrs
is true.leaveattrnode
(defaultFalse
)Should the generator yield the cursor after it has visited an attribute? The cursor attribute
event
will have the value"leaveattrnode"
in this case. Note that the attribute will only be entered whenenterattr
is true and it will only be visited ifenterattrs
is true.
Passing True
for all these parameters gives us the following output:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i, class_=f"li-{i}") for i in range(3))
>>> for cursor in e.walk(entercontent=True, enterattrs=True, enterattr=True,
... enterelementnode=True, leaveelementnode=True,
... enterattrnode=True, leaveattrnode=True):
... indent = "\t"*(len(cursor.path)-1)
... print(f"{indent}{cursor.event} {cursor.index} {cursor.node!r}")
...
enterelementnode [] <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x4cbe50>
enterelementnode [0] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x4de850>
enterattrnode [0, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x4f2f90>
textnode [0, 'class', 0] <ll.xist.xsc.Text content='li-0' at 0x63f800>
leaveattrnode [0, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x4f2f90>
textnode [0, 0] <ll.xist.xsc.Text content='0' at 0x63f788>
leaveelementnode [0] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x4de850>
enterelementnode [1] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x63e870>
enterattrnode [1, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x631780>
textnode [1, 'class', 0] <ll.xist.xsc.Text content='li-1' at 0x63f9e0>
leaveattrnode [1, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x631780>
textnode [1, 0] <ll.xist.xsc.Text content='1' at 0x63f878>
leaveelementnode [1] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x63e870>
enterelementnode [2] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x63e8f0>
enterattrnode [2, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x631690>
textnode [2, 'class', 0] <ll.xist.xsc.Text content='li-2' at 0x63fb48>
leaveattrnode [2, 'class'] <ll.xist.ns.html.coreattrs.class_ attr object (1 child) at 0x631690>
textnode [2, 0] <ll.xist.xsc.Text content='2' at 0x63fa58>
leaveelementnode [2] <ll.xist.ns.html.li element object (1 child/1 attr) at 0x63e8f0>
leaveelementnode [] <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x4cbe50>
Skipping parts of the tree
It is possible to change the cursor attributes that specify the traversal order
during the traversal to skip certain parts of the tree. In the following example
the content of li
elements is skipped if they have a
class
attribute:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i, class_=None if i%2 else f"li-{i}") for i in range(3))
>>> for cursor in e.walk():
... if isinstance(cursor.node, html.li) and "class_" in cursor.node.attrs:
... cursor.entercontent = False
... indent = "\t"*(len(cursor.path)-1)
... print(f"{indent}{cursor.event} {cursor.node!r}")
...
enterelementnode <ll.xist.ns.html.ul element object (3 children/no attrs) at 0x495790>
enterelementnode <ll.xist.ns.html.li element object (1 child/1 attr) at 0x4958d0>
enterelementnode <ll.xist.ns.html.li element object (1 child/no attrs) at 0x5f6130>
textnode <ll.xist.xsc.Text content='1' at 0x5f4760>
enterelementnode <ll.xist.ns.html.li element object (1 child/1 attr) at 0x5f6570>
This works for the following attributes:
entercontent
enterattrs
enterattr
enterelementnode
leaveelementnode
enterattrnode
leaveattrnode
After the walk()
generator has been reentered and the
modified attribute has been taken into account all those attributes wil be
reset to their initial value (i.e. the value that has been passed to
walk()
).
The methods walknodes()
and walkpaths()
In addition to walk()
two other methods are available:
walknodes()
and walkpaths()
.
These generators don’t produce a cursor object like walk()
does. walknodes()
produces the node itself as the
following example demonstrates:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i) for i in range(3))
>>> for node in e.walknodes():
... print(repr(node))
...
<ll.xist.ns.html.ul element object (3 children/no attrs) at 0x43fbb0>
<ll.xist.ns.html.li element object (1 child/no attrs) at 0x452750>
<ll.xist.xsc.Text content='0' at 0x5b1670>
<ll.xist.ns.html.li element object (1 child/no attrs) at 0x452830>
<ll.xist.xsc.Text content='1' at 0x5b16e8>
<ll.xist.ns.html.li element object (1 child/no attrs) at 0x5b30d0>
<ll.xist.xsc.Text content='2' at 0x5b1760>
walkpaths()
produces the path. This is a copy of the
path, so it won’t be changed once walkpaths()
is
reentered:
>>> from ll.xist.ns import html
>>> e = html.ul(html.li(i) for i in range(3))
>>> for path in e.walkpaths():
... print([f"{n.__class__.__module__}.{n.__class__.__qualname__}" for n in path])
...
['ll.xist.ns.html.ul']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li']
['ll.xist.ns.html.ul', 'll.xist.ns.html.li', 'll.xist.xsc.Text']
Filtering the output of the tree traversal
All three tree traversal methods provide an additional argument (*selectors
)
that can be used to filter which nodes/paths are produced. This argument can be
specified multiple times (which also means that all other arguments must be
passed as keyword arguments).
Passing a node class
In the simplest case you can pass a Node
subclass to get
only instances of that class. The following example prints all the links on the
Python home page:
from ll.xist import xsc, parse
from ll.xist.ns import xml, html
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
for node in doc.walknodes(html.a):
print(node.attrs.href)
This gives the output:
http://www.python.org/
http://www.python.org/#left%2Dhand%2Dnavigation
http://www.python.org/#content%2Dbody
http://www.python.org/search
http://www.python.org/about/
http://www.python.org/news/
http://www.python.org/doc/
http://www.python.org/download/
http://www.python.org/getit/
http://www.python.org/community/
...
Passing multiple selector arguments
You can also pass multiple classes to search for nodes that are an instance of any of the classes.
The following example will print all header element on the Python home page:
from ll.xist import xsc, parse
from ll.xist.ns import xml, html, chars
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
for node in doc.walknodes(html.h1, html.h2, html.h3, html.h4, html.h5, html.h6):
print(node.string())
This will output:
<h1 id="logoheader">
<a accesskey="1" href="http://www.python.org/" id="logolink">
<img alt="homepage" border="0" id="logo" src="http://www.python.org/images/python-logo.gif" />
</a>
</h1>
<h4><a href="http://www.python.org/about/help/">Help</a></h4>
<h4><a href="http://pypi.python.org/pypi" title="Repository of Python Software">Package Index</a></h4>
<h4><a href="http://www.python.org/download/releases/2.7.3/">Quick Links (2.7.3)</a></h4>
<h4><a href="http://www.python.org/download/releases/3.3.0/">Quick Links (3.3.0)</a></h4>
<h4><a href="http://www.python.org/community/jobs/" title="Employers and Job Openings">Python Jobs</a></h4>
<h4><a href="http://www.python.org/community/merchandise/" title="T-shirts & more; a portion goes to the PSF">Python Merchandise</a></h4>
<h4><a href="http://wiki.python.org/moin/" style="margin-top: 1.5em">Python Wiki</a></h4>
<h4><a href="http://blog.python.org/" style="margin-top: 1.5em">Python Insider Blog</a></h4>
<h4><a href="http://wiki.python.org/moin/Python2orPython3" style="margin-top: 1.5em">Python 2 or 3?</a></h4>
<h4><a href="http://www.python.org/psf/donations/" style="color: #D58228; margin-top: 1.5em">Help Fund Python</a></h4>
<h4><a href="http://wiki.python.org/moin/Languages">Non-English Resources</a></h4>
<h1 class="pageheading">Python Programming Language – Official Website</h1>
<h4>Support the Python Community</h4>
<h4><a href="http://wiki.python.org/moin/Python2orPython3">Python 3</a> Poll</h4>
<h4>NASA uses Python...</h4>
<h4>What they are saying...</h4>
<h4>Using Python For...</h4>
<h2 class="news">Python 3.3.0 released</h2>
<h2 class="news">Third rc for Python 3.3.0 released</h2>
<h2 class="news">Python Software Foundation announces Distinguished Service Award</h2>
<h2 class="news">ConFoo conference in Canada, February 25th - March 13th</h2>
<h2 class="news">Second rc for Python 3.3.0 released</h2>
<h2 class="news">First rc for Python 3.3.0 released</h2>
<h2 class="news">Fifth annual pyArkansas conference to be held</h2>
Passing a callable
It is also possible to pass a function to walk()
.
This function will be called for each visited node and gets passed the path to
the visited node. If the function returns true, the node will be output.
The following example will find all external links on the Python home page:
from ll.xist import xsc, parse
from ll.xist.ns import xml, html, chars
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
def isextlink(path):
return isinstance(path[-1], html.a) and not str(path[-1].attrs.href).startswith("http://www.python.org")
for node in doc.walknodes(isextlink):
print(node.attrs.href)
This gives the output:
http://docs.python.org/devguide/
http://pypi.python.org/pypi
http://docs.python.org/2/
http://docs.python.org/3/
http://wiki.python.org/moin/
http://blog.python.org/
http://wiki.python.org/moin/Python2orPython3
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/Languages
...
xfind
selectors
The selector arguments for the walk methods get converted into a so called
xfind
selector. xfind
selectors look somewhat
like XPath expressions, but are implemented as pure Python expressions
(overloading various Python operators).
Every subclass of Node
can be used as an xfind selector
and combined with other xfind
selector to create more complex
ones. For example searching for links that contain images works as follows:
for path in doc.walkpaths(html.a/html.img):
print(path[-2].attrs.href, path[-1].attrs.src)
The output looks like this:
http://www.python.org/ http://www.python.org/images/python-logo.gif
http://www.python.org/#left%2Dhand%2Dnavigation http://www.python.org/images/trans.gif
http://www.python.org/#content%2Dbody http://www.python.org/images/trans.gif
http://www.python.org/psf/donations/ http://www.python.org/images/donate.png
http://wiki.python.org/moin/Languages http://www.python.org/images/worldmap.jpg
http://www.python.org/about/success/usa/ http://www.python.org/images/success/nasa.jpg
If the img
elements are not immediate children of the
a
elements, the xfind
selector above
won’t output them. In this case you can use a “decendant selector” instead of a
“child selector”. To do this simply replace html.a/html.img
with
html.a//html.img
.
Apart from the /
and //
operators you can also use the |
and
&
operators to combine xfind
selector:
from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
for node in doc.walknodes((html.a | html.area) & xfind.hasattr("href")):
print(node.attrs.href)
Here’s another example that finds all elements that have an id
attribute:
from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html, chars
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
for node in doc.walknodes(xfind.hasattr("id")):
print(node.attrs.id)
The output looks like this:
screen-switcher-stylesheet
logoheader
logolink
logo
skiptonav
skiptocontent
utility-menu
searchbox
searchform
...
For more examples refer to the documentation of the xfind
module.
CSS selectors
It’s also possible to use CSS selectors as selectors for the
walk()
method. The module ll.xist.css
provides a
function selector()
that turns a CSS selector expression
into an xfind
selector:
from ll.xist import xsc, parse, css
from ll.xist.ns import xml, html, chars
doc = parse.tree(
parse.URL("http://www.python.org"),
parse.Expat(ns=True),
parse.Node(pool=xsc.Pool(xml, html, chars))
)
for cursor in doc.walk(css.selector("div#menu ul.level-one li > a")):
print(cursor.node.attrs.href)
This outputs all the first level links in the navigation:
http://www.python.org/about/
http://www.python.org/news/
http://www.python.org/doc/
http://www.python.org/download/
http://www.python.org/getit/
http://www.python.org/community/
http://www.python.org/psf/
http://docs.python.org/devguide/
Most of the CSS 3 selectors are supported.
For more examples see the documentation of the css
module.