<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="Docutils 0.12: http://docutils.sourceforge.net/" /> <meta name="version" content="S5 1.1" /> <title>Implementing XML languages with lxml</title> <style type="text/css"> /* :Author: David Goodger (goodger@python.org) :Id: $Id: html4css1.css 7614 2013-02-21 15:55:51Z milde $ :Copyright: This stylesheet has been placed in the public domain. Default cascading style sheet for the HTML output of Docutils. See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to customize this style sheet. */ /* used to remove borders from tables and images */ .borderless, table.borderless td, table.borderless th { border: 0 } table.borderless td, table.borderless th { /* Override padding for "table.docutils td" with "! important". The right padding separates the table cells. */ padding: 0 0.5em 0 0 ! important } .first { /* Override more specific margin styles with "! important". */ margin-top: 0 ! important } .last, .with-subtitle { margin-bottom: 0 ! important } .hidden { display: none } a.toc-backref { text-decoration: none ; color: black } blockquote.epigraph { margin: 2em 5em ; } dl.docutils dd { margin-bottom: 0.5em } object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] { overflow: hidden; } /* Uncomment (and remove this text!) to get bold-faced definition list terms dl.docutils dt { font-weight: bold } */ div.abstract { margin: 2em 5em } div.abstract p.topic-title { font-weight: bold ; text-align: center } div.admonition, div.attention, div.caution, div.danger, div.error, div.hint, div.important, div.note, div.tip, div.warning { margin: 2em ; border: medium outset ; padding: 1em } div.admonition p.admonition-title, div.hint p.admonition-title, div.important p.admonition-title, div.note p.admonition-title, div.tip p.admonition-title { font-weight: bold ; font-family: sans-serif } div.attention p.admonition-title, div.caution p.admonition-title, div.danger p.admonition-title, div.error p.admonition-title, div.warning p.admonition-title, .code .error { color: red ; font-weight: bold ; font-family: sans-serif } /* Uncomment (and remove this text!) to get reduced vertical space in compound paragraphs. div.compound .compound-first, div.compound .compound-middle { margin-bottom: 0.5em } div.compound .compound-last, div.compound .compound-middle { margin-top: 0.5em } */ div.dedication { margin: 2em 5em ; text-align: center ; font-style: italic } div.dedication p.topic-title { font-weight: bold ; font-style: normal } div.figure { margin-left: 2em ; margin-right: 2em } div.footer, div.header { clear: both; font-size: smaller } div.line-block { display: block ; margin-top: 1em ; margin-bottom: 1em } div.line-block div.line-block { margin-top: 0 ; margin-bottom: 0 ; margin-left: 1.5em } div.sidebar { margin: 0 0 0.5em 1em ; border: medium outset ; padding: 1em ; background-color: #ffffee ; width: 40% ; float: right ; clear: right } div.sidebar p.rubric { font-family: sans-serif ; font-size: medium } div.system-messages { margin: 5em } div.system-messages h1 { color: red } div.system-message { border: medium outset ; padding: 1em } div.system-message p.system-message-title { color: red ; font-weight: bold } div.topic { margin: 2em } h1.section-subtitle, h2.section-subtitle, h3.section-subtitle, h4.section-subtitle, h5.section-subtitle, h6.section-subtitle { margin-top: 0.4em } h1.title { text-align: center } h2.subtitle { text-align: center } hr.docutils { width: 75% } img.align-left, .figure.align-left, object.align-left { clear: left ; float: left ; margin-right: 1em } img.align-right, .figure.align-right, object.align-right { clear: right ; float: right ; margin-left: 1em } img.align-center, .figure.align-center, object.align-center { display: block; margin-left: auto; margin-right: auto; } .align-left { text-align: left } .align-center { clear: both ; text-align: center } .align-right { text-align: right } /* reset inner alignment in figures */ div.align-right { text-align: inherit } /* div.align-center * { */ /* text-align: left } */ ol.simple, ul.simple { margin-bottom: 1em } ol.arabic { list-style: decimal } ol.loweralpha { list-style: lower-alpha } ol.upperalpha { list-style: upper-alpha } ol.lowerroman { list-style: lower-roman } ol.upperroman { list-style: upper-roman } p.attribution { text-align: right ; margin-left: 50% } p.caption { font-style: italic } p.credits { font-style: italic ; font-size: smaller } p.label { white-space: nowrap } p.rubric { font-weight: bold ; font-size: larger ; color: maroon ; text-align: center } p.sidebar-title { font-family: sans-serif ; font-weight: bold ; font-size: larger } p.sidebar-subtitle { font-family: sans-serif ; font-weight: bold } p.topic-title { font-weight: bold } pre.address { margin-bottom: 0 ; margin-top: 0 ; font: inherit } pre.literal-block, pre.doctest-block, pre.math, pre.code { margin-left: 2em ; margin-right: 2em } pre.code .ln { color: grey; } /* line numbers */ pre.code, code { background-color: #eeeeee } pre.code .comment, code .comment { color: #5C6576 } pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold } pre.code .literal.string, code .literal.string { color: #0C5404 } pre.code .name.builtin, code .name.builtin { color: #352B84 } pre.code .deleted, code .deleted { background-color: #DEB0A1} pre.code .inserted, code .inserted { background-color: #A3D289} span.classifier { font-family: sans-serif ; font-style: oblique } span.classifier-delimiter { font-family: sans-serif ; font-weight: bold } span.interpreted { font-family: sans-serif } span.option { white-space: nowrap } span.pre { white-space: pre } span.problematic { color: red } span.section-subtitle { /* font-size relative to parent (h1..h6 element) */ font-size: 80% } table.citation { border-left: solid 1px gray; margin-left: 1px } table.docinfo { margin: 2em 4em } table.docutils { margin-top: 0.5em ; margin-bottom: 0.5em } table.footnote { border-left: solid 1px black; margin-left: 1px } table.docutils td, table.docutils th, table.docinfo td, table.docinfo th { padding-left: 0.5em ; padding-right: 0.5em ; vertical-align: top } table.docutils th.field-name, table.docinfo th.docinfo-name { font-weight: bold ; text-align: left ; white-space: nowrap ; padding-left: 0 } /* "booktabs" style (no vertical lines) */ table.docutils.booktabs { border: 0px; border-top: 2px solid; border-bottom: 2px solid; border-collapse: collapse; } table.docutils.booktabs * { border: 0px; } table.docutils.booktabs th { border-bottom: thin solid; text-align: left; } h1 tt.docutils, h2 tt.docutils, h3 tt.docutils, h4 tt.docutils, h5 tt.docutils, h6 tt.docutils { font-size: 100% } ul.auto-toc { list-style-type: none } </style> <!-- configuration parameters --> <meta name="defaultView" content="slideshow" /> <meta name="controlVis" content="hidden" /> <!-- style sheet links --> <script src="ui/default/slides.js" type="text/javascript"></script> <link rel="stylesheet" href="ui/default/slides.css" type="text/css" media="projection" id="slideProj" /> <link rel="stylesheet" href="ui/default/outline.css" type="text/css" media="screen" id="outlineStyle" /> <link rel="stylesheet" href="ui/default/print.css" type="text/css" media="print" id="slidePrint" /> <link rel="stylesheet" href="ui/default/opera.css" type="text/css" media="projection" id="operaFix" /> </head> <body> <div class="layout"> <div id="controls"></div> <div id="currentSlide"></div> <div id="header"> </div> <div id="footer"> <h1>Implementing XML languages with lxml</h1> <h2>Dr. Stefan Behnel, EuroPython 2008, Vilnius/Lietuva</h2> </div> </div> <div class="presentation"> <div class="slide" id="slide0"> <h1 class="title">Implementing XML languages with lxml</h1> <h2 class="subtitle" id="dr-stefan-behnel">Dr. Stefan Behnel</h2> <p class="center"><a class="reference external" href="http://codespeak.net/lxml/">http://codespeak.net/lxml/</a></p> <p class="center"><a class="reference external" href="mailto:lxml-dev@codespeak.net">lxml-dev@codespeak.net</a></p> <img alt="tagpython.png" class="center" src="tagpython.png" /> <!-- Definitions of interpreted text roles (classes) for S5/HTML data. --> <!-- This data file has been placed in the public domain. --> <!-- Colours ======= --> <!-- Text Sizes ========== --> <!-- Display in Slides (Presentation Mode) Only ========================================== --> <!-- Display in Outline Mode Only ============================ --> <!-- Display in Print Only ===================== --> <!-- Display in Handout Mode Only ============================ --> <!-- Incremental Display =================== --> </div> <div class="slide" id="what-is-an-xml-language"> <h1>What is an »XML language«?</h1> <ul class="simple"> <li>a language in XML notation</li> <li>aka »XML dialect«<ul> <li>except that it's not a dialect</li> </ul> </li> <li>Examples:<ul> <li>XML Schema</li> <li>Atom/RSS</li> <li>(X)HTML</li> <li>Open Document Format</li> <li>SOAP</li> <li>... add your own one here</li> </ul> </li> </ul> </div> <div class="slide" id="popular-mistakes-to-avoid-1"> <h1>Popular mistakes to avoid (1)</h1> <p>"That's easy, I can use regular expressions!"</p> <p class="incremental center">No, you can't.</p> </div> <div class="slide" id="popular-mistakes-to-avoid-2"> <h1>Popular mistakes to avoid (2)</h1> <p>"This is tree data, I'll take the DOM!"</p> </div> <div class="slide" id="id1"> <h1>Popular mistakes to avoid (2)</h1> <p>"This is tree data, I'll take the DOM!"</p> <ul class="simple"> <li>DOM is ubiquitous, but it's as complicated as Java</li> <li>uglify your application with tons of DOM code to<ul> <li>walk over non-element nodes to find the data you need</li> <li>convert text content to other data types</li> <li>modify the XML tree in memory</li> </ul> </li> </ul> <p>=> write verbose, redundant, hard-to-maintain code</p> </div> <div class="slide" id="popular-mistakes-to-avoid-3"> <h1>Popular mistakes to avoid (3)</h1> <p>"SAX is <em>so</em> fast and consumes <em>no</em> memory!"</p> </div> <div class="slide" id="id2"> <h1>Popular mistakes to avoid (3)</h1> <p>"SAX is <em>so</em> fast and consumes <em>no</em> memory!"</p> <ul class="simple"> <li>but <em>writing</em> SAX code is <em>not</em> fast!</li> <li>write error-prone, state-keeping SAX code to<ul> <li>figure out where you are</li> <li>find the sections you need</li> <li>convert text content to other data types</li> <li>copy the XML data into custom data classes</li> <li>... and don't forget the way back into XML!</li> </ul> </li> </ul> <p>=> write confusing state-machine code</p> <p>=> debugging into existence</p> </div> <div class="slide" id="working-with-xml"> <h1>Working with XML</h1> <blockquote> <p><strong>Getting XML work done</strong></p> <p>(instead of getting time wasted)</p> </blockquote> </div> <div class="slide" id="how-can-you-work-with-xml"> <h1>How can you work with XML?</h1> <ul class="simple"> <li>Preparation:<ul> <li>Implement usable data classes as an abstraction layer</li> <li>Implement a mapping from XML to the data classes</li> <li>Implement a mapping from the data classes to XML</li> </ul> </li> <li>Workflow:<ul> <li>parse XML data</li> <li>map XML data to data classes</li> <li>work with data classes</li> <li>map data classes to XML</li> <li>serialise XML</li> </ul> </li> </ul> <ul class="incremental simple"> <li>Approach:<ul> <li>get rid of XML and do everything in your own code</li> </ul> </li> </ul> </div> <div class="slide" id="what-if-you-could-simplify-this"> <h1>What if you could simplify this?</h1> <ul class="simple"> <li>Preparation:<ul> <li>Extend usable XML API classes into an abstraction layer</li> </ul> </li> <li>Workflow:<ul> <li>parse XML data into XML API classes</li> <li>work with XML API classes</li> <li>serialise XML</li> </ul> </li> </ul> <ul class="incremental simple"> <li>Approach:<ul> <li>cover only the quirks of XML and make it work <em>for</em> you</li> </ul> </li> </ul> </div> <div class="slide" id="id3"> <h1>What if you could simplify this ...</h1> <ul class="simple"> <li>... without sacrificing usability or flexibility?</li> <li>... using a high-speed, full-featured, pythonic XML toolkit?</li> <li>... with the power of XPath, XSLT and XML validation?</li> </ul> <p class="incremental center">... then »lxml« is your friend!</p> </div> <div class="slide" id="overview"> <h1>Overview</h1> <ul class="simple"> <li>What is lxml?<ul> <li>what & who</li> </ul> </li> <li>How do you use it?<ul> <li>Lesson 0: quick API overview<ul> <li>ElementTree concepts and lxml features</li> </ul> </li> <li>Lesson 1: parse XML<ul> <li>how to get XML data into memory</li> </ul> </li> <li>Lesson 2: generate XML<ul> <li>how to write an XML generator for a language</li> </ul> </li> <li>Lesson 3: working with XML trees made easy<ul> <li>how to write an XML API for a language</li> </ul> </li> </ul> </li> </ul> </div> <div class="slide" id="what-is-lxml"> <h1>What is lxml?</h1> <ul class="simple"> <li>a fast, full-featured toolkit for XML and HTML handling<ul> <li><a class="reference external" href="http://codespeak.net/lxml/">http://codespeak.net/lxml/</a></li> <li><a class="reference external" href="mailto:lxml-dev@codespeak.net">lxml-dev@codespeak.net</a></li> </ul> </li> <li>based on and inspired by<ul> <li>the C libraries libxml2 and libxslt (by Daniel Veillard)</li> <li>the ElementTree API (by Fredrik Lundh)</li> <li>the Cython compiler (by Robert Bradshaw, Greg Ewing & me)</li> <li>the Python language (by Guido & [<em>paste Misc/ACKS here</em>])</li> <li>user feedback, ideas and patches (by you!)<ul> <li>keep doing that, we love you all!</li> </ul> </li> </ul> </li> <li>maintained (and major parts) written by myself<ul> <li>initial design and implementation by Martijn Faassen</li> <li>extensive HTML API and tools by Ian Bicking</li> </ul> </li> </ul> </div> <div class="slide" id="what-do-you-get-for-your-money"> <h1>What do you get for your money?</h1> <ul class="simple"> <li>many tools in one:<ul> <li>Generic, ElementTree compatible XML API: <strong>lxml.etree</strong><ul> <li>but faster for many tasks and much more feature-rich</li> </ul> </li> <li>Special tool set for HTML handling: <strong>lxml.html</strong></li> <li>Special API for pythonic data binding: <strong>lxml.objectify</strong></li> <li>General purpose path languages: XPath and CSS selectors</li> <li>Validation: DTD, XML Schema, RelaxNG, Schematron</li> <li>XSLT, XInclude, C14N, ...</li> <li>Fast tree iteration, event-driven parsing, ...</li> </ul> </li> <li>it's free, but it's worth every €-Cent!<ul> <li>what users say:<ul> <li>»no qualification, I would recommend lxml for just about any HTML task«</li> <li>»THE tool [...] for newbies and experienced developers«</li> <li>»you can do pretty much anything with an intuitive API«</li> <li>»lxml takes all the pain out of XML«</li> </ul> </li> </ul> </li> </ul> </div> <div class="slide" id="lesson-0-a-quick-overview"> <h1>Lesson 0: a quick overview</h1> <blockquote> <p>why <strong>»lxml takes all the pain out of XML«</strong></p> <p>(a quick overview of lxml features and ElementTree concepts)</p> </blockquote> <!-- >>> from lxml import etree, cssselect, html >>> some_xml_data = "<root><speech class='dialog'><p>So be it!</p></speech><p>stuff</p></root>" >>> some_html_data = "<p>Just a quick note<br>next line</p>" >>> xml_tree = etree.XML(some_xml_data) >>> html_tree = html.fragment_fromstring(some_html_data) --> </div> <div class="slide" id="namespaces-in-elementtree"> <h1>Namespaces in ElementTree</h1> <ul> <li><p class="first">uses Clark notation:</p> <ul class="simple"> <li>wrap namespace URI in <tt class="docutils literal"><span class="pre">{...}</span></tt></li> <li>append the tag name</li> </ul> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">tag</span> <span class="o">=</span> <span class="s2">"{http://www.w3.org/the/namespace}tagname"</span> <span class="gp">>>> </span><span class="n">element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">Element</span><span class="p">(</span><span class="n">tag</span><span class="p">)</span> </pre></div> </li> <li><p class="first">no prefixes!</p> </li> <li><p class="first">a single, self-containing tag identifier</p> </li> </ul> </div> <div class="slide" id="text-content-in-elementtree"> <h1>Text content in ElementTree</h1> <ul> <li><p class="first">uses <tt class="docutils literal">.text</tt> and <tt class="docutils literal">.tail</tt> attributes:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">div</span> <span class="o">=</span> <span class="n">html</span><span class="o">.</span><span class="n">fragment_fromstring</span><span class="p">(</span> <span class="gp">... </span> <span class="s2">"<div><p>a paragraph<br>split in two</p> parts</div>"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">p</span> <span class="o">=</span> <span class="n">div</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="gp">>>> </span><span class="n">br</span> <span class="o">=</span> <span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="gp">>>> </span><span class="n">p</span><span class="o">.</span><span class="n">text</span> <span class="go">'a paragraph'</span> <span class="gp">>>> </span><span class="n">br</span><span class="o">.</span><span class="n">text</span> <span class="gp">>>> </span><span class="n">br</span><span class="o">.</span><span class="n">tail</span> <span class="go">'split in two'</span> <span class="gp">>>> </span><span class="n">p</span><span class="o">.</span><span class="n">tail</span> <span class="go">' parts'</span> </pre></div> </li> <li><p class="first">no text nodes!</p> <ul class="simple"> <li>simplifies tree traversal a lot</li> <li>simplifies many XML algorithms</li> </ul> </li> </ul> </div> <div class="slide" id="attributes-in-elementtree"> <h1>Attributes in ElementTree</h1> <ul> <li><p class="first">uses <tt class="docutils literal">.get()</tt> and <tt class="docutils literal">.set()</tt> methods:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span> <span class="gp">... </span> <span class="s1">'<root a="the value" b="of an" c="attribute"/>'</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span> <span class="go">'the value'</span> <span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span> <span class="s2">"THE value"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'a'</span><span class="p">)</span> <span class="go">'THE value'</span> </pre></div> </li> <li><p class="first">or the <tt class="docutils literal">.attrib</tt> dictionary property:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">d</span> <span class="o">=</span> <span class="n">root</span><span class="o">.</span><span class="n">attrib</span> <span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span> <span class="go">['a', 'b', 'c']</span> <span class="gp">>>> </span><span class="nb">list</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">values</span><span class="p">()))</span> <span class="go">['THE value', 'attribute', 'of an']</span> </pre></div> </li> </ul> </div> <div class="slide" id="tree-iteration-in-lxml-etree-1"> <h1>Tree iteration in lxml.etree (1)</h1> <!-- >>> import collections --> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span> <span class="gp">... </span> <span class="s2">"<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>"</span><span class="p">)</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">child</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">root</span><span class="p">])</span> <span class="c1"># children</span> <span class="go">['a', 'c']</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">root</span><span class="o">.</span><span class="n">iter</span><span class="p">()])</span> <span class="c1"># self and descendants</span> <span class="go">['root', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">root</span><span class="o">.</span><span class="n">iterdescendants</span><span class="p">()])</span> <span class="go">['a', 'b', 'b', 'c', 'd', 'e', 'f', 'g']</span> <span class="gp">>>> </span><span class="k">def</span> <span class="nf">iter_breadth_first</span><span class="p">(</span><span class="n">root</span><span class="p">):</span> <span class="gp">... </span> <span class="n">bfs_queue</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">deque</span><span class="p">([</span><span class="n">root</span><span class="p">])</span> <span class="gp">... </span> <span class="k">while</span> <span class="n">bfs_queue</span><span class="p">:</span> <span class="gp">... </span> <span class="n">el</span> <span class="o">=</span> <span class="n">bfs_queue</span><span class="o">.</span><span class="n">popleft</span><span class="p">()</span> <span class="c1"># pop next element</span> <span class="gp">... </span> <span class="n">bfs_queue</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">el</span><span class="p">)</span> <span class="c1"># append its children</span> <span class="gp">... </span> <span class="k">yield</span> <span class="n">el</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">el</span><span class="o">.</span><span class="n">tag</span> <span class="k">for</span> <span class="n">el</span> <span class="ow">in</span> <span class="n">iter_breadth_first</span><span class="p">(</span><span class="n">root</span><span class="p">)])</span> <span class="go">['root', 'a', 'c', 'b', 'b', 'd', 'e', 'g', 'f']</span> </pre></div> </div> <div class="slide" id="tree-iteration-in-lxml-etree-2"> <h1>Tree iteration in lxml.etree (2)</h1> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span> <span class="gp">... </span> <span class="s2">"<root> <a><b/><b/></a> <c><d/><e><f/></e><g/></c> </root>"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">tree_walker</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">iterwalk</span><span class="p">(</span><span class="n">root</span><span class="p">,</span> <span class="n">events</span><span class="o">=</span><span class="p">(</span><span class="s1">'start'</span><span class="p">,</span> <span class="s1">'end'</span><span class="p">))</span> <span class="gp">>>> </span><span class="k">for</span> <span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">element</span><span class="p">)</span> <span class="ow">in</span> <span class="n">tree_walker</span><span class="p">:</span> <span class="gp">... </span> <span class="k">print</span><span class="p">(</span><span class="s2">"</span><span class="si">%s</span><span class="s2"> (</span><span class="si">%s</span><span class="s2">)"</span> <span class="o">%</span> <span class="p">(</span><span class="n">element</span><span class="o">.</span><span class="n">tag</span><span class="p">,</span> <span class="n">event</span><span class="p">))</span> <span class="go">root (start)</span> <span class="go">a (start)</span> <span class="go">b (start)</span> <span class="go">b (end)</span> <span class="go">b (start)</span> <span class="go">b (end)</span> <span class="go">a (end)</span> <span class="go">c (start)</span> <span class="go">d (start)</span> <span class="go">d (end)</span> <span class="go">e (start)</span> <span class="go">f (start)</span> <span class="go">f (end)</span> <span class="go">e (end)</span> <span class="go">g (start)</span> <span class="go">g (end)</span> <span class="go">c (end)</span> <span class="go">root (end)</span> </pre></div> </div> <div class="slide" id="path-languages-in-lxml"> <h1>Path languages in lxml</h1> <div class="highlight"><pre><span class="nt"><root></span> <span class="nt"><speech</span> <span class="na">class=</span><span class="s">'dialog'</span><span class="nt">><p></span>So be it!<span class="nt"></p></speech></span> <span class="nt"><p></span>stuff<span class="nt"></p></span> <span class="nt"></root></span> </pre></div> <ul> <li><p class="first">search it with XPath</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">find_paragraphs</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XPath</span><span class="p">(</span><span class="s2">"//p"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">paragraphs</span> <span class="o">=</span> <span class="n">find_paragraphs</span><span class="p">(</span><span class="n">xml_tree</span><span class="p">)</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span> <span class="n">p</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">paragraphs</span> <span class="p">])</span> <span class="go">['So be it!', 'stuff']</span> </pre></div> </li> <li><p class="first">search it with CSS selectors</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">find_dialogs</span> <span class="o">=</span> <span class="n">cssselect</span><span class="o">.</span><span class="n">CSSSelector</span><span class="p">(</span><span class="s2">"speech.dialog p"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">paragraphs</span> <span class="o">=</span> <span class="n">find_dialogs</span><span class="p">(</span><span class="n">xml_tree</span><span class="p">)</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span> <span class="n">p</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">paragraphs</span> <span class="p">])</span> <span class="go">['So be it!']</span> </pre></div> </li> </ul> </div> <div class="slide" id="summary-of-lesson-0"> <h1>Summary of lesson 0</h1> <ul class="simple"> <li>lxml comes with various tools<ul> <li>that aim to hide the quirks of XML</li> <li>that simplify finding and handling data</li> <li>that make XML a pythonic tool by itself</li> </ul> </li> </ul> </div> <div class="slide" id="lesson-1-parsing-xml-html"> <h1>Lesson 1: parsing XML/HTML</h1> <blockquote> <p><strong>The input side</strong></p> <p>(a quick overview)</p> </blockquote> </div> <div class="slide" id="parsing-xml-and-html-from"> <h1>Parsing XML and HTML from ...</h1> <ul class="simple"> <li>strings: <tt class="docutils literal">fromstring(xml_data)</tt><ul> <li>byte strings, but also unicode strings</li> </ul> </li> <li>filenames: <tt class="docutils literal">parse(filename)</tt></li> <li>HTTP/FTP URLs: <tt class="docutils literal">parse(url)</tt></li> <li>file objects: <tt class="docutils literal">parse(f)</tt><ul> <li><tt class="docutils literal">f = open(filename, 'rb')</tt> !</li> </ul> </li> <li>file-like objects: <tt class="docutils literal">parse(f)</tt><ul> <li>only need a <tt class="docutils literal">f.read(size)</tt> method</li> </ul> </li> <li>data chunks: <tt class="docutils literal">parser.feed(xml_chunk)</tt><ul> <li><tt class="docutils literal">result = parser.close()</tt></li> </ul> </li> </ul> <p class="small right">(parsing from strings and filenames/URLs frees the GIL)</p> </div> <div class="slide" id="example-parsing-from-a-string"> <h1>Example: parsing from a string</h1> <ul> <li><p class="first">using the <tt class="docutils literal">fromstring()</tt> function:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">some_xml_data</span><span class="p">)</span> </pre></div> </li> <li><p class="first">using the <tt class="docutils literal">fromstring()</tt> function with a specific parser:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">HTMLParser</span><span class="p">(</span><span class="n">remove_comments</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">some_html_data</span><span class="p">,</span> <span class="n">parser</span><span class="p">)</span> </pre></div> </li> <li><p class="first">or the <tt class="docutils literal">XML()</tt> and <tt class="docutils literal">HTML()</tt> aliases for literals in code:</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XML</span><span class="p">(</span><span class="s2">"<root><child/></root>"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">root_element</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">HTML</span><span class="p">(</span><span class="s2">"<p>some<br>paragraph</p>"</span><span class="p">)</span> </pre></div> </li> </ul> </div> <div class="slide" id="parsing-xml-into"> <h1>Parsing XML into ...</h1> <ul class="simple"> <li>a tree in memory<ul> <li><tt class="docutils literal">parse()</tt> and <tt class="docutils literal">fromstring()</tt> functions</li> </ul> </li> <li>a tree in memory, but step-by-step with a generator<ul> <li><tt class="docutils literal">iterparse()</tt> generates <tt class="docutils literal">(start/end, element)</tt> events</li> <li>tree can be cleaned up to save space</li> </ul> </li> <li>SAX-like callbacks without building a tree<ul> <li><tt class="docutils literal">parse()</tt> and <tt class="docutils literal">fromstring()</tt> functions</li> <li>pass a <tt class="docutils literal">target</tt> object into the parser</li> </ul> </li> </ul> </div> <div class="slide" id="summary-of-lesson-1"> <h1>Summary of lesson 1</h1> <ul class="simple"> <li>parsing XML/HTML in lxml is mostly straight forward<ul> <li>simple functions that do the job</li> </ul> </li> <li>advanced use cases are pretty simple<ul> <li>event-driven parsing using <tt class="docutils literal">iterparse()</tt></li> <li>special parser configuration with keyword arguments<ul> <li>configuration is generally local to a parser</li> </ul> </li> </ul> </li> <li>BTW: parsing is <em>very</em> fast, as is serialising<ul> <li>don't hesitate to do parse-serialise-parse cycles</li> </ul> </li> </ul> </div> <div class="slide" id="lesson-2-generating-xml"> <h1>Lesson 2: generating XML</h1> <blockquote> <p><strong>The output side</strong></p> <p>(and how to make it safe and simple)</p> </blockquote> </div> <div class="slide" id="the-example-language-atom"> <h1>The example language: Atom</h1> <p>The Atom XML format</p> <ul class="simple"> <li>Namespace: <a class="reference external" href="http://www.w3.org/2005/Atom">http://www.w3.org/2005/Atom</a></li> <li>W3C recommendation derived from RSS and friends</li> <li>Atom feeds describe news entries and annotated links<ul> <li>a <tt class="docutils literal">feed</tt> contains one or more <tt class="docutils literal">entry</tt> elements</li> <li>an <tt class="docutils literal">entry</tt> contains <tt class="docutils literal">author</tt>, <tt class="docutils literal">link</tt>, <tt class="docutils literal">summary</tt> and/or <tt class="docutils literal">content</tt></li> </ul> </li> </ul> </div> <div class="slide" id="example-generate-xml-1"> <h1>Example: generate XML (1)</h1> <p>The ElementMaker (or <em>E-factory</em>)</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.builder</span> <span class="kn">import</span> <span class="n">ElementMaker</span> <span class="gp">>>> </span><span class="n">A</span> <span class="o">=</span> <span class="n">ElementMaker</span><span class="p">(</span><span class="n">namespace</span><span class="o">=</span><span class="s2">"http://www.w3.org/2005/Atom"</span><span class="p">,</span> <span class="gp">... </span> <span class="n">nsmap</span><span class="o">=</span><span class="p">{</span><span class="bp">None</span> <span class="p">:</span> <span class="s2">"http://www.w3.org/2005/Atom"</span><span class="p">})</span> </pre></div> <div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span> <span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span> <span class="gp">... </span> <span class="p">)</span> <span class="gp">... </span><span class="p">)</span> </pre></div> </div><div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.etree</span> <span class="kn">import</span> <span class="n">tostring</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">(</span> <span class="n">tostring</span><span class="p">(</span><span class="n">atom</span><span class="p">,</span> <span class="n">pretty_print</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="p">)</span> </pre></div> </div></div> <div class="slide" id="example-generate-xml-2"> <h1>Example: generate XML (2)</h1> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span> <span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span> <span class="gp">... </span> <span class="p">)</span> <span class="gp">... </span><span class="p">)</span> </pre></div> <div class="highlight"><pre><span class="nt"><feed</span> <span class="na">xmlns=</span><span class="s">"http://www.w3.org/2005/Atom"</span><span class="nt">></span> <span class="nt"><author></span> <span class="nt"><name></span>Stefan Behnel<span class="nt"></name></span> <span class="nt"></author></span> <span class="nt"><entry></span> <span class="nt"><title></span>News from lxml<span class="nt"></title></span> <span class="nt"><link</span> <span class="na">href=</span><span class="s">"http://codespeak.net/lxml/"</span><span class="nt">/></span> <span class="nt"><summary</span> <span class="na">type=</span><span class="s">"html"</span><span class="nt">></span>See what's <span class="ni">&lt;</span>b<span class="ni">&gt;</span>fun<span class="ni">&lt;</span>/b<span class="ni">&gt;</span> about lxml...<span class="nt"></summary></span> <span class="nt"></entry></span> <span class="nt"></feed></span> </pre></div> </div> <div class="slide" id="be-careful-what-you-type"> <h1>Be careful what you type!</h1> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">titel</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span> <span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span> <span class="gp">... </span> <span class="p">)</span> <span class="gp">... </span><span class="p">)</span> </pre></div> <div class="highlight"><pre><span class="nt"><feed</span> <span class="na">xmlns=</span><span class="s">"http://www.w3.org/2005/Atom"</span><span class="nt">></span> <span class="nt"><author></span> <span class="nt"><name></span>Stefan Behnel<span class="nt"></name></span> <span class="nt"></author></span> <span class="nt"><entry></span> <span class="nt"><titel></span>News from lxml<span class="nt"></titel></span> <span class="nt"><link</span> <span class="na">href=</span><span class="s">"http://codespeak.net/lxml/"</span><span class="nt">/></span> <span class="nt"><summary</span> <span class="na">type=</span><span class="s">"html"</span><span class="nt">></span>See what's <span class="ni">&lt;</span>b<span class="ni">&gt;</span>fun<span class="ni">&lt;</span>/b<span class="ni">&gt;</span> about lxml...<span class="nt"></summary></span> <span class="nt"></entry></span> <span class="nt"></feed></span> </pre></div> </div> <div class="slide" id="want-more-type-safety"> <h1>Want more 'type safety'?</h1> <p>Write an XML generator <em>module</em> instead:</p> <div class="highlight"><pre><span class="c1"># atomgen.py</span> <span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">etree</span> <span class="kn">from</span> <span class="nn">lxml.builder</span> <span class="kn">import</span> <span class="n">ElementMaker</span> <span class="n">ATOM_NAMESPACE</span> <span class="o">=</span> <span class="s2">"http://www.w3.org/2005/Atom"</span> <span class="n">A</span> <span class="o">=</span> <span class="n">ElementMaker</span><span class="p">(</span><span class="n">namespace</span><span class="o">=</span><span class="n">ATOM_NAMESPACE</span><span class="p">,</span> <span class="n">nsmap</span><span class="o">=</span><span class="p">{</span><span class="bp">None</span> <span class="p">:</span> <span class="n">ATOM_NAMESPACE</span><span class="p">})</span> <span class="n">feed</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span> <span class="n">entry</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span> <span class="n">title</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">title</span> <span class="c1"># ... and so on and so forth ...</span> <span class="c1"># plus a little validation function: isvalid()</span> <span class="n">isvalid</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">RelaxNG</span><span class="p">(</span><span class="nb">file</span><span class="o">=</span><span class="s2">"atom.rng"</span><span class="p">)</span> </pre></div> </div> <div class="slide" id="the-atom-generator-module"> <h1>The Atom generator module</h1> <!-- >>> import sys >>> sys.path.insert(0, "ep2008") --> <div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">atomgen</span> <span class="kn">as</span> <span class="nn">A</span> <span class="gp">>>> </span><span class="n">atom</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">feed</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">author</span><span class="p">(</span> <span class="n">A</span><span class="o">.</span><span class="n">name</span><span class="p">(</span><span class="s2">"Stefan Behnel"</span><span class="p">)</span> <span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">entry</span><span class="p">(</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">link</span><span class="p">(</span><span class="n">href</span><span class="o">=</span><span class="s2">"http://codespeak.net/lxml/"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">),</span> <span class="gp">... </span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"See what's <b>fun</b> about lxml..."</span><span class="p">,</span> <span class="gp">... </span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">),</span> <span class="gp">... </span> <span class="p">)</span> <span class="gp">... </span><span class="p">)</span> <span class="gp">>>> </span><span class="n">A</span><span class="o">.</span><span class="n">isvalid</span><span class="p">(</span><span class="n">atom</span><span class="p">)</span> <span class="c1"># ok, forgot the ID's => invalid XML ...</span> <span class="go">False</span> <span class="gp">>>> </span><span class="n">title</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">titel</span><span class="p">(</span><span class="s2">"News from lxml"</span><span class="p">)</span> <span class="gt">Traceback (most recent call last):</span> <span class="c">...</span> <span class="gr">AttributeError</span>: <span class="n">'module' object has no attribute 'titel'</span> </pre></div> </div> <div class="slide" id="mixing-languages-1"> <h1>Mixing languages (1)</h1> <p>Atom can embed <em>serialised</em> HTML</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">lxml.html.builder</span> <span class="kn">as</span> <span class="nn">h</span> <span class="gp">>>> </span><span class="n">html_fragment</span> <span class="o">=</span> <span class="n">h</span><span class="o">.</span><span class="n">DIV</span><span class="p">(</span> <span class="gp">... </span> <span class="s2">"this is some</span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span> <span class="gp">... </span> <span class="n">h</span><span class="o">.</span><span class="n">A</span><span class="p">(</span><span class="s2">"HTML"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="s2">"http://w3.org/MarkUp/"</span><span class="p">),</span> <span class="gp">... </span> <span class="s2">"</span><span class="se">\n</span><span class="s2">content"</span><span class="p">)</span> </pre></div> <div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">serialised_html</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">html_fragment</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">"html"</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">summary</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="n">serialised_html</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s2">"html"</span><span class="p">)</span> </pre></div> </div><div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">summary</span><span class="p">))</span> <span class="go"><summary xmlns="http://www.w3.org/2005/Atom" type="html"></span> <span class="go"> &lt;div&gt;this is some</span> <span class="go"> &lt;a href="http://w3.org/MarkUp/"&gt;HTML&lt;/a&gt;</span> <span class="go"> content&lt;/div&gt;</span> <span class="go"></summary></span> </pre></div> </div></div> <div class="slide" id="mixing-languages-2"> <h1>Mixing languages (2)</h1> <p>Atom can also embed non-escaped XHTML</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="kn">from</span> <span class="nn">copy</span> <span class="kn">import</span> <span class="n">deepcopy</span> <span class="gp">>>> </span><span class="n">xhtml_fragment</span> <span class="o">=</span> <span class="n">deepcopy</span><span class="p">(</span><span class="n">html_fragment</span><span class="p">)</span> <span class="gp">>>> </span><span class="kn">from</span> <span class="nn">lxml.html</span> <span class="kn">import</span> <span class="n">html_to_xhtml</span> <span class="gp">>>> </span><span class="n">html_to_xhtml</span><span class="p">(</span><span class="n">xhtml_fragment</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">summary</span> <span class="o">=</span> <span class="n">A</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="n">xhtml_fragment</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s2">"xhtml"</span><span class="p">)</span> </pre></div> <div class="incremental"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">summary</span><span class="p">,</span> <span class="n">pretty_print</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span> <span class="go"><summary xmlns="http://www.w3.org/2005/Atom" type="xhtml"></span> <span class="go"> <html:div xmlns:html="http://www.w3.org/1999/xhtml">this is some</span> <span class="go"> <html:a href="http://w3.org/MarkUp/">HTML</html:a></span> <span class="go"> content</html:div></span> <span class="go"></summary></span> </pre></div> </div></div> <div class="slide" id="summary-of-lesson-2"> <h1>Summary of lesson 2</h1> <ul class="simple"> <li>generating XML is easy<ul> <li>use the ElementMaker</li> </ul> </li> <li>wrap it in a module that provides<ul> <li>the target namespace</li> <li>an ElementMaker name for each language element</li> <li>a validator</li> <li>maybe additional helper functions</li> </ul> </li> <li>mixing languages is easy<ul> <li>define a generator module for each</li> </ul> </li> </ul> <p>... this is all you need for the <em>output</em> side of XML languages</p> </div> <div class="slide" id="lesson-3-designing-xml-apis"> <h1>Lesson 3: Designing XML APIs</h1> <blockquote> <p><strong>The Element API</strong></p> <p>(and how to make it the way <em>you</em> want)</p> </blockquote> </div> <div class="slide" id="trees-in-c-and-in-python"> <h1>Trees in C and in Python</h1> <ul class="simple"> <li>Trees have two representations:<ul> <li>a plain, complete, low-level C tree provided by libxml2</li> <li>a set of Python Element proxies, each representing one element</li> </ul> </li> <li>Proxies are created on-the-fly:<ul> <li>lxml creates an Element object for a C node on request</li> <li>proxies are garbage collected when going out of scope</li> <li>XML trees are garbage collected when deleting the last proxy</li> </ul> </li> </ul> <img alt="ep2008/proxies.png" class="center" src="ep2008/proxies.png" /> </div> <div class="slide" id="mapping-python-classes-to-nodes"> <h1>Mapping Python classes to nodes</h1> <ul class="simple"> <li>Proxies can be assigned to XML nodes <em>by user code</em><ul> <li>lxml tells you about a node, you return a class</li> </ul> </li> </ul> </div> <div class="slide" id="example-a-simple-element-class-1"> <h1>Example: a simple Element class (1)</h1> <ul> <li><p class="first">define a subclass of ElementBase</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="k">class</span> <span class="nc">HonkElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span> <span class="gp">... </span> <span class="nd">@property</span> <span class="gp">... </span> <span class="k">def</span> <span class="nf">honking</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="gp">... </span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'honking'</span><span class="p">)</span> <span class="o">==</span> <span class="s1">'true'</span> </pre></div> </li> <li><p class="first">let it replace the default Element class</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">lookup</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">ElementDefaultClassLookup</span><span class="p">(</span> <span class="gp">... </span> <span class="n">element</span><span class="o">=</span><span class="n">HonkElement</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XMLParser</span><span class="p">()</span> <span class="gp">>>> </span><span class="n">parser</span><span class="o">.</span><span class="n">set_element_class_lookup</span><span class="p">(</span><span class="n">lookup</span><span class="p">)</span> </pre></div> </li> </ul> </div> <div class="slide" id="example-a-simple-element-class-2"> <h1>Example: a simple Element class (2)</h1> <ul> <li><p class="first">use the new Element class</p> <div class="highlight"><pre><span class="gp">>>> </span><span class="n">root</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XML</span><span class="p">(</span><span class="s1">'<root><honk honking="true"/></root>'</span><span class="p">,</span> <span class="gp">... </span> <span class="n">parser</span><span class="p">)</span> <span class="gp">>>> </span><span class="n">root</span><span class="o">.</span><span class="n">honking</span> <span class="go">False</span> <span class="gp">>>> </span><span class="n">root</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">honking</span> <span class="go">True</span> </pre></div> </li> </ul> </div> <div class="slide" id="id4"> <h1>Mapping Python classes to nodes</h1> <ul class="simple"> <li>The Element class lookup<ul> <li>lxml tells you about a node, you return a class</li> <li>no restrictions on lookup algorithm</li> <li>each parser can use a different class lookup scheme</li> <li>lookup schemes can be chained through fallbacks</li> </ul> </li> <li>Classes can be selected based on<ul> <li>the node type (element, comment or processing instruction)<ul> <li><tt class="docutils literal">ElementDefaultClassLookup()</tt></li> </ul> </li> <li>the namespaced node name<ul> <li><tt class="docutils literal">CustomElementClassLookup()</tt> + a fallback</li> <li><tt class="docutils literal">ElementNamespaceClassLookup()</tt> + a fallback</li> </ul> </li> <li>the value of an attribute (e.g. <tt class="docutils literal">id</tt> or <tt class="docutils literal">class</tt>)<ul> <li><tt class="docutils literal">AttributeBasedElementClassLookup()</tt> + a fallback</li> </ul> </li> <li>read-only inspection of the tree<ul> <li><tt class="docutils literal">PythonElementClassLookup()</tt> + a fallback</li> </ul> </li> </ul> </li> </ul> </div> <div class="slide" id="designing-an-atom-api"> <h1>Designing an Atom API</h1> <ul> <li><p class="first">a feed is a container for entries</p> <div class="highlight"><pre><span class="c1"># atom.py</span> <span class="n">ATOM_NAMESPACE</span> <span class="o">=</span> <span class="s2">"http://www.w3.org/2005/Atom"</span> <span class="n">_ATOM_NS</span> <span class="o">=</span> <span class="s2">"{</span><span class="si">%s</span><span class="s2">}"</span> <span class="o">%</span> <span class="n">ATOM_NAMESPACE</span> <span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span> <span class="nd">@property</span> <span class="k">def</span> <span class="nf">entries</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"entry"</span><span class="p">)</span> </pre></div> </li> <li><p class="first">it also has a couple of meta-data children, e.g. <tt class="docutils literal">title</tt></p> <div class="highlight"><pre><span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">etree</span><span class="o">.</span><span class="n">ElementBase</span><span class="p">):</span> <span class="c1"># ...</span> <span class="nd">@property</span> <span class="k">def</span> <span class="nf">title</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="s2">"return the title or None"</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s2">"title"</span><span class="p">)</span> </pre></div> </li> </ul> </div> <div class="slide" id="consider-lxml-objectify"> <h1>Consider lxml.objectify</h1> <ul class="simple"> <li>ready-to-use, generic Python object API for XML</li> </ul> <div class="highlight"><pre><span class="o">>>></span> <span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">objectify</span> <span class="o">>>></span> <span class="n">feed</span> <span class="o">=</span> <span class="n">objectify</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s2">"atom-example.xml"</span><span class="p">)</span> <span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">title</span><span class="p">)</span> <span class="n">Example</span> <span class="n">Feed</span> <span class="o">>>></span> <span class="k">print</span><span class="p">([</span><span class="n">entry</span><span class="o">.</span><span class="n">title</span> <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">])</span> <span class="p">[</span><span class="s1">'Atom-Powered Robots Run Amok'</span><span class="p">]</span> <span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">title</span><span class="p">)</span> <span class="n">Atom</span><span class="o">-</span><span class="n">Powered</span> <span class="n">Robots</span> <span class="n">Run</span> <span class="n">Amok</span> </pre></div> </div> <div class="slide" id="still-room-for-more-convenience"> <h1>Still room for more convenience</h1> <div class="highlight"><pre><span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">chain</span> <span class="k">class</span> <span class="nc">FeedElement</span><span class="p">(</span><span class="n">objectify</span><span class="o">.</span><span class="n">ObjectifiedElement</span><span class="p">):</span> <span class="k">def</span> <span class="nf">addIDs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="s2">"initialise the IDs of feed and entries"</span> <span class="k">for</span> <span class="n">element</span> <span class="ow">in</span> <span class="n">chain</span><span class="p">([</span><span class="bp">self</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">entry</span><span class="p">):</span> <span class="k">if</span> <span class="n">element</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"id"</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span> <span class="nb">id</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">SubElement</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">_ATOM_NS</span> <span class="o">+</span> <span class="s2">"id"</span><span class="p">)</span> <span class="nb">id</span><span class="o">.</span><span class="n">text</span> <span class="o">=</span> <span class="n">make_guid</span><span class="p">()</span> </pre></div> </div> <div class="slide" id="incremental-api-design"> <h1>Incremental API design</h1> <ul class="simple"> <li>choose an XML API to start with<ul> <li>lxml.etree is general purpose</li> <li>lxml.objectify is nice for document-style XML</li> </ul> </li> <li>fix Elements that really need some API sugar<ul> <li>dict-mappings to children with specific content/attributes</li> <li>properties for specially typed attributes or child values</li> <li>simplified access to varying content types of an element</li> <li>shortcuts for unnecessarily deep subtrees</li> </ul> </li> <li>ignore what works well enough with the Element API<ul> <li>lists of homogeneous children -> Element iteration</li> <li>string attributes -> .get()/.set()</li> </ul> </li> <li>let the API grow at your fingertips<ul> <li>play with it and test use cases</li> <li>avoid "I want because I can" feature explosion!</li> </ul> </li> </ul> </div> <div class="slide" id="setting-up-the-element-mapping"> <h1>Setting up the Element mapping</h1> <p>Atom has a namespace => leave the mapping to lxml</p> <div class="highlight"><pre><span class="c1"># ...</span> <span class="n">_atom_lookup</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">ElementNamespaceClassLookup</span><span class="p">(</span> <span class="n">objectify</span><span class="o">.</span><span class="n">ObjectifyElementClassLookup</span><span class="p">())</span> <span class="c1"># map the classes to tag names</span> <span class="n">ns</span> <span class="o">=</span> <span class="n">_atom_lookup</span><span class="o">.</span><span class="n">get_namespace</span><span class="p">(</span><span class="n">ATOM_NAMESPACE</span><span class="p">)</span> <span class="n">ns</span><span class="p">[</span><span class="s2">"feed"</span><span class="p">]</span> <span class="o">=</span> <span class="n">FeedElement</span> <span class="n">ns</span><span class="p">[</span><span class="s2">"entry"</span><span class="p">]</span> <span class="o">=</span> <span class="n">EntryElement</span> <span class="c1"># ... and so on</span> <span class="c1"># or use ns.update(vars()) with appropriate class names</span> <span class="c1"># create a parser that does some whitespace cleanup</span> <span class="n">atom_parser</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">XMLParser</span><span class="p">(</span><span class="n">remove_blank_text</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c1"># make it use our Atom classes</span> <span class="n">atom_parser</span><span class="o">.</span><span class="n">set_element_class_lookup</span><span class="p">(</span><span class="n">_atom_lookup</span><span class="p">)</span> <span class="c1"># and help users in using our parser setup</span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="nb">input</span><span class="p">):</span> <span class="k">return</span> <span class="n">etree</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="nb">input</span><span class="p">,</span> <span class="n">atom_parser</span><span class="p">)</span> </pre></div> </div> <div class="slide" id="using-your-new-atom-api"> <h1>Using your new Atom API</h1> <div class="highlight"><pre><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">atom</span> <span class="gp">>>> </span><span class="n">feed</span> <span class="o">=</span> <span class="n">atom</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s2">"ep2008/atom-example.xml"</span><span class="p">)</span><span class="o">.</span><span class="n">getroot</span><span class="p">()</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">))</span> <span class="go">1</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">entry</span><span class="o">.</span><span class="n">title</span> <span class="k">for</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">entry</span><span class="p">])</span> <span class="go">['Atom-Powered Robots Run Amok']</span> <span class="gp">>>> </span><span class="n">link_tag</span> <span class="o">=</span> <span class="s2">"{</span><span class="si">%s</span><span class="s2">}link"</span> <span class="o">%</span> <span class="n">atom</span><span class="o">.</span><span class="n">ATOM_NAMESPACE</span> <span class="gp">>>> </span><span class="k">print</span><span class="p">([</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"href"</span><span class="p">)</span> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">feed</span><span class="o">.</span><span class="n">iter</span><span class="p">(</span><span class="n">link_tag</span><span class="p">)])</span> <span class="go">['http://example.org/', 'http://example.org/2003/12/13/atom03']</span> </pre></div> </div> <div class="slide" id="summary-of-lesson-3"> <h1>Summary of lesson 3</h1> <p>To implement an XML API ...</p> <ol class="arabic simple"> <li>start off with lxml's Element API<ul> <li>or take a look at the object API of lxml.objectify</li> </ul> </li> <li>specialise it into a set of custom Element classes</li> <li>map them to XML tags using one of the lookup schemes</li> <li>improve the API incrementally while using it<ul> <li>discover inconveniences and beautify them</li> <li>avoid putting work into things that work</li> </ul> </li> </ol> </div> <div class="slide" id="conclusion"> <h1>Conclusion</h1> <p>lxml ...</p> <ul class="simple"> <li>provides a convenient set of tools for XML and HTML<ul> <li>parsing</li> <li>generating</li> <li>working with in-memory trees</li> </ul> </li> <li>follows Python idioms wherever possible<ul> <li>highly extensible through wrapping and subclassing</li> <li>callable objects for XPath, CSS selectors, XSLT, schemas</li> <li>iteration for tree traversal (even while parsing)</li> <li>list-/dict-like APIs, properties, keyword arguments, ...</li> </ul> </li> <li>makes extension and specialisation easy<ul> <li>write a special XML generator module in trivial code</li> <li>write your own XML API incrementally on-the-fly</li> </ul> </li> </ul> </div> </div> </body> </html>