<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <!-- Generated by HsColour, http://www.cs.york.ac.uk/fp/darcs/hscolour/ --> <title>Text/HTML/TagSoup.hs</title> <link type='text/css' rel='stylesheet' href='hscolour.css' /> </head> <body> <pre><a name="line-1"></a><span class='hs-comment'>{-# LANGUAGE TypeSynonymInstances, PatternGuards #-}</span> <a name="line-2"></a> <a name="line-3"></a><span class='hs-comment'>{-| <a name="line-4"></a> This module is for working with HTML/XML. It deals with both well-formed XML and <a name="line-5"></a> malformed HTML from the web. It features: <a name="line-6"></a> <a name="line-7"></a> * A lazy parser, based on the HTML 5 specification - see 'parseTags'. <a name="line-8"></a> <a name="line-9"></a> * A renderer that can write out HTML/XML - see 'renderTags'. <a name="line-10"></a> <a name="line-11"></a> * Utilities for extracting information from a document - see '~==', 'sections' and 'partitions'. <a name="line-12"></a> <a name="line-13"></a> The standard practice is to parse a 'String' to @[@'Tag' 'String'@]@ using 'parseTags', <a name="line-14"></a> then operate upon it to extract the necessary information. <a name="line-15"></a>-}</span> <a name="line-16"></a> <a name="line-17"></a><span class='hs-keyword'>module</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-layout'>(</span> <a name="line-18"></a> <span class='hs-comment'>-- * Data structures and parsing</span> <a name="line-19"></a> <span class='hs-conid'>Tag</span><span class='hs-layout'>(</span><span class='hs-keyglyph'>..</span><span class='hs-layout'>)</span><span class='hs-layout'>,</span> <span class='hs-conid'>Row</span><span class='hs-layout'>,</span> <span class='hs-conid'>Column</span><span class='hs-layout'>,</span> <span class='hs-conid'>Attribute</span><span class='hs-layout'>,</span> <a name="line-20"></a> <span class='hs-keyword'>module</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-varop'>.</span><span class='hs-conid'>Parser</span><span class='hs-layout'>,</span> <a name="line-21"></a> <span class='hs-keyword'>module</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-varop'>.</span><span class='hs-conid'>Render</span><span class='hs-layout'>,</span> <a name="line-22"></a> <span class='hs-varid'>canonicalizeTags</span><span class='hs-layout'>,</span> <a name="line-23"></a> <a name="line-24"></a> <span class='hs-comment'>-- * Tag identification</span> <a name="line-25"></a> <span class='hs-varid'>isTagOpen</span><span class='hs-layout'>,</span> <span class='hs-varid'>isTagClose</span><span class='hs-layout'>,</span> <span class='hs-varid'>isTagText</span><span class='hs-layout'>,</span> <span class='hs-varid'>isTagWarning</span><span class='hs-layout'>,</span> <span class='hs-varid'>isTagPosition</span><span class='hs-layout'>,</span> <a name="line-26"></a> <span class='hs-varid'>isTagOpenName</span><span class='hs-layout'>,</span> <span class='hs-varid'>isTagCloseName</span><span class='hs-layout'>,</span> <a name="line-27"></a> <a name="line-28"></a> <span class='hs-comment'>-- * Extraction</span> <a name="line-29"></a> <span class='hs-varid'>fromTagText</span><span class='hs-layout'>,</span> <span class='hs-varid'>fromAttrib</span><span class='hs-layout'>,</span> <a name="line-30"></a> <span class='hs-varid'>maybeTagText</span><span class='hs-layout'>,</span> <span class='hs-varid'>maybeTagWarning</span><span class='hs-layout'>,</span> <a name="line-31"></a> <span class='hs-varid'>innerText</span><span class='hs-layout'>,</span> <a name="line-32"></a> <a name="line-33"></a> <span class='hs-comment'>-- * Utility</span> <a name="line-34"></a> <span class='hs-varid'>sections</span><span class='hs-layout'>,</span> <span class='hs-varid'>partitions</span><span class='hs-layout'>,</span> <a name="line-35"></a> <a name="line-36"></a> <span class='hs-comment'>-- * Combinators</span> <a name="line-37"></a> <span class='hs-conid'>TagRep</span><span class='hs-layout'>(</span><span class='hs-keyglyph'>..</span><span class='hs-layout'>)</span><span class='hs-layout'>,</span> <span class='hs-layout'>(</span><span class='hs-varop'>~==</span><span class='hs-layout'>)</span><span class='hs-layout'>,</span><span class='hs-layout'>(</span><span class='hs-varop'>~/=</span><span class='hs-layout'>)</span> <a name="line-38"></a> <span class='hs-layout'>)</span> <span class='hs-keyword'>where</span> <a name="line-39"></a> <a name="line-40"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-varop'>.</span><span class='hs-conid'>Type</span> <a name="line-41"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-varop'>.</span><span class='hs-conid'>Parser</span> <a name="line-42"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>HTML</span><span class='hs-varop'>.</span><span class='hs-conid'>TagSoup</span><span class='hs-varop'>.</span><span class='hs-conid'>Render</span> <a name="line-43"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Data</span><span class='hs-varop'>.</span><span class='hs-conid'>Char</span> <a name="line-44"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Data</span><span class='hs-varop'>.</span><span class='hs-conid'>List</span> <a name="line-45"></a><span class='hs-keyword'>import</span> <span class='hs-conid'>Text</span><span class='hs-varop'>.</span><span class='hs-conid'>StringLike</span> <a name="line-46"></a> <a name="line-47"></a> <a name="line-48"></a><a name="canonicalizeTags"></a><span class='hs-comment'>-- | Turns all tag names and attributes to lower case and</span> <a name="line-49"></a><span class='hs-comment'>-- converts DOCTYPE to upper case.</span> <a name="line-50"></a><span class='hs-definition'>canonicalizeTags</span> <span class='hs-keyglyph'>::</span> <span class='hs-conid'>StringLike</span> <span class='hs-varid'>str</span> <span class='hs-keyglyph'>=></span> <span class='hs-keyglyph'>[</span><span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span><span class='hs-keyglyph'>]</span> <span class='hs-keyglyph'>-></span> <span class='hs-keyglyph'>[</span><span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span><span class='hs-keyglyph'>]</span> <a name="line-51"></a><span class='hs-definition'>canonicalizeTags</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>map</span> <span class='hs-varid'>f</span> <a name="line-52"></a> <span class='hs-keyword'>where</span> <a name="line-53"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagOpen</span> <span class='hs-varid'>tag</span> <span class='hs-varid'>attrs</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>|</span> <span class='hs-conid'>Just</span> <span class='hs-layout'>(</span><span class='hs-chr'>'!'</span><span class='hs-layout'>,</span><span class='hs-varid'>name</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'><-</span> <span class='hs-varid'>uncons</span> <span class='hs-varid'>tag</span> <span class='hs-keyglyph'>=</span> <span class='hs-conid'>TagOpen</span> <span class='hs-layout'>(</span><span class='hs-chr'>'!'</span> <span class='hs-varop'>`cons`</span> <span class='hs-varid'>ucase</span> <span class='hs-varid'>name</span><span class='hs-layout'>)</span> <span class='hs-varid'>attrs</span> <a name="line-54"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagOpen</span> <span class='hs-varid'>name</span> <span class='hs-varid'>attrs</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=</span> <span class='hs-conid'>TagOpen</span> <span class='hs-layout'>(</span><span class='hs-varid'>lcase</span> <span class='hs-varid'>name</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>[</span><span class='hs-layout'>(</span><span class='hs-varid'>lcase</span> <span class='hs-varid'>k</span><span class='hs-layout'>,</span> <span class='hs-varid'>v</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>|</span> <span class='hs-layout'>(</span><span class='hs-varid'>k</span><span class='hs-layout'>,</span><span class='hs-varid'>v</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'><-</span> <span class='hs-varid'>attrs</span><span class='hs-keyglyph'>]</span> <a name="line-55"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagClose</span> <span class='hs-varid'>name</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=</span> <span class='hs-conid'>TagClose</span> <span class='hs-layout'>(</span><span class='hs-varid'>lcase</span> <span class='hs-varid'>name</span><span class='hs-layout'>)</span> <a name="line-56"></a> <span class='hs-varid'>f</span> <span class='hs-varid'>a</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>a</span> <a name="line-57"></a> <a name="line-58"></a> <span class='hs-varid'>ucase</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>fromString</span> <span class='hs-varop'>.</span> <span class='hs-varid'>map</span> <span class='hs-varid'>toUpper</span> <span class='hs-varop'>.</span> <span class='hs-varid'>toString</span> <a name="line-59"></a> <span class='hs-varid'>lcase</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>fromString</span> <span class='hs-varop'>.</span> <span class='hs-varid'>map</span> <span class='hs-varid'>toLower</span> <span class='hs-varop'>.</span> <span class='hs-varid'>toString</span> <a name="line-60"></a> <a name="line-61"></a> <a name="line-62"></a><span class='hs-comment'>-- | Define a class to allow String's or Tag str's to be used as matches</span> <a name="line-63"></a><span class='hs-keyword'>class</span> <span class='hs-conid'>TagRep</span> <span class='hs-varid'>a</span> <span class='hs-keyword'>where</span> <a name="line-64"></a> <span class='hs-varid'>toTagRep</span> <span class='hs-keyglyph'>::</span> <span class='hs-conid'>StringLike</span> <span class='hs-varid'>str</span> <span class='hs-keyglyph'>=></span> <span class='hs-varid'>a</span> <span class='hs-keyglyph'>-></span> <span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span> <a name="line-65"></a> <a name="line-66"></a><span class='hs-keyword'>instance</span> <span class='hs-conid'>StringLike</span> <span class='hs-varid'>str</span> <span class='hs-keyglyph'>=></span> <span class='hs-conid'>TagRep</span> <span class='hs-layout'>(</span><span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span><span class='hs-layout'>)</span> <span class='hs-keyword'>where</span> <span class='hs-varid'>toTagRep</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>fmap</span> <span class='hs-varid'>castString</span> <a name="line-67"></a> <a name="line-68"></a><span class='hs-keyword'>instance</span> <span class='hs-conid'>TagRep</span> <span class='hs-conid'>String</span> <span class='hs-keyword'>where</span> <a name="line-69"></a> <span class='hs-varid'>toTagRep</span> <span class='hs-varid'>x</span> <span class='hs-keyglyph'>=</span> <span class='hs-keyword'>case</span> <span class='hs-varid'>parseTags</span> <span class='hs-varid'>x</span> <span class='hs-keyword'>of</span> <a name="line-70"></a> <span class='hs-keyglyph'>[</span><span class='hs-varid'>a</span><span class='hs-keyglyph'>]</span> <span class='hs-keyglyph'>-></span> <span class='hs-varid'>toTagRep</span> <span class='hs-varid'>a</span> <a name="line-71"></a> <span class='hs-keyword'>_</span> <span class='hs-keyglyph'>-></span> <span class='hs-varid'>error</span> <span class='hs-varop'>$</span> <span class='hs-str'>"When using a TagRep it must be exactly one tag, you gave: "</span> <span class='hs-varop'>++</span> <span class='hs-varid'>x</span> <a name="line-72"></a> <a name="line-73"></a> <a name="line-74"></a> <a name="line-75"></a><a name="~=="></a><span class='hs-comment'>-- | Performs an inexact match, the first item should be the thing to match.</span> <a name="line-76"></a><span class='hs-comment'>-- If the second item is a blank string, that is considered to match anything.</span> <a name="line-77"></a><span class='hs-comment'>-- For example:</span> <a name="line-78"></a><span class='hs-comment'>--</span> <a name="line-79"></a><span class='hs-comment'>-- > (TagText "test" ~== TagText "" ) == True</span> <a name="line-80"></a><span class='hs-comment'>-- > (TagText "test" ~== TagText "test") == True</span> <a name="line-81"></a><span class='hs-comment'>-- > (TagText "test" ~== TagText "soup") == False</span> <a name="line-82"></a><span class='hs-comment'>--</span> <a name="line-83"></a><span class='hs-comment'>-- For 'TagOpen' missing attributes on the right are allowed.</span> <a name="line-84"></a><span class='hs-layout'>(</span><span class='hs-varop'>~==</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>::</span> <span class='hs-layout'>(</span><span class='hs-conid'>StringLike</span> <span class='hs-varid'>str</span><span class='hs-layout'>,</span> <span class='hs-conid'>TagRep</span> <span class='hs-varid'>t</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=></span> <span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span> <span class='hs-keyglyph'>-></span> <span class='hs-varid'>t</span> <span class='hs-keyglyph'>-></span> <span class='hs-conid'>Bool</span> <a name="line-85"></a><span class='hs-layout'>(</span><span class='hs-varop'>~==</span><span class='hs-layout'>)</span> <span class='hs-varid'>a</span> <span class='hs-varid'>b</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>f</span> <span class='hs-varid'>a</span> <span class='hs-layout'>(</span><span class='hs-varid'>toTagRep</span> <span class='hs-varid'>b</span><span class='hs-layout'>)</span> <a name="line-86"></a> <span class='hs-keyword'>where</span> <a name="line-87"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagText</span> <span class='hs-varid'>y</span><span class='hs-layout'>)</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagText</span> <span class='hs-varid'>x</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>strNull</span> <span class='hs-varid'>x</span> <span class='hs-varop'>||</span> <span class='hs-varid'>x</span> <span class='hs-varop'>==</span> <span class='hs-varid'>y</span> <a name="line-88"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagClose</span> <span class='hs-varid'>y</span><span class='hs-layout'>)</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagClose</span> <span class='hs-varid'>x</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>strNull</span> <span class='hs-varid'>x</span> <span class='hs-varop'>||</span> <span class='hs-varid'>x</span> <span class='hs-varop'>==</span> <span class='hs-varid'>y</span> <a name="line-89"></a> <span class='hs-varid'>f</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagOpen</span> <span class='hs-varid'>y</span> <span class='hs-varid'>ys</span><span class='hs-layout'>)</span> <span class='hs-layout'>(</span><span class='hs-conid'>TagOpen</span> <span class='hs-varid'>x</span> <span class='hs-varid'>xs</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=</span> <span class='hs-layout'>(</span><span class='hs-varid'>strNull</span> <span class='hs-varid'>x</span> <span class='hs-varop'>||</span> <span class='hs-varid'>x</span> <span class='hs-varop'>==</span> <span class='hs-varid'>y</span><span class='hs-layout'>)</span> <span class='hs-varop'>&&</span> <span class='hs-varid'>all</span> <span class='hs-varid'>g</span> <span class='hs-varid'>xs</span> <a name="line-90"></a> <span class='hs-keyword'>where</span> <a name="line-91"></a> <span class='hs-varid'>g</span> <span class='hs-layout'>(</span><span class='hs-varid'>name</span><span class='hs-layout'>,</span><span class='hs-varid'>val</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>|</span> <span class='hs-varid'>strNull</span> <span class='hs-varid'>name</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>val</span> <span class='hs-varop'>`elem`</span> <span class='hs-varid'>map</span> <span class='hs-varid'>snd</span> <span class='hs-varid'>ys</span> <a name="line-92"></a> <span class='hs-keyglyph'>|</span> <span class='hs-varid'>strNull</span> <span class='hs-varid'>val</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>name</span> <span class='hs-varop'>`elem`</span> <span class='hs-varid'>map</span> <span class='hs-varid'>fst</span> <span class='hs-varid'>ys</span> <a name="line-93"></a> <span class='hs-varid'>g</span> <span class='hs-varid'>nameval</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>nameval</span> <span class='hs-varop'>`elem`</span> <span class='hs-varid'>ys</span> <a name="line-94"></a> <span class='hs-varid'>f</span> <span class='hs-keyword'>_</span> <span class='hs-keyword'>_</span> <span class='hs-keyglyph'>=</span> <span class='hs-conid'>False</span> <a name="line-95"></a> <a name="line-96"></a><a name="~/="></a><span class='hs-comment'>-- | Negation of '~=='</span> <a name="line-97"></a><span class='hs-layout'>(</span><span class='hs-varop'>~/=</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>::</span> <span class='hs-layout'>(</span><span class='hs-conid'>StringLike</span> <span class='hs-varid'>str</span><span class='hs-layout'>,</span> <span class='hs-conid'>TagRep</span> <span class='hs-varid'>t</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>=></span> <span class='hs-conid'>Tag</span> <span class='hs-varid'>str</span> <span class='hs-keyglyph'>-></span> <span class='hs-varid'>t</span> <span class='hs-keyglyph'>-></span> <span class='hs-conid'>Bool</span> <a name="line-98"></a><span class='hs-layout'>(</span><span class='hs-varop'>~/=</span><span class='hs-layout'>)</span> <span class='hs-varid'>a</span> <span class='hs-varid'>b</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>not</span> <span class='hs-layout'>(</span><span class='hs-varid'>a</span> <span class='hs-varop'>~==</span> <span class='hs-varid'>b</span><span class='hs-layout'>)</span> <a name="line-99"></a> <a name="line-100"></a> <a name="line-101"></a> <a name="line-102"></a><a name="sections"></a><span class='hs-comment'>-- | This function takes a list, and returns all suffixes whose</span> <a name="line-103"></a><span class='hs-comment'>-- first item matches the predicate.</span> <a name="line-104"></a><span class='hs-definition'>sections</span> <span class='hs-keyglyph'>::</span> <span class='hs-layout'>(</span><span class='hs-varid'>a</span> <span class='hs-keyglyph'>-></span> <span class='hs-conid'>Bool</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>-></span> <span class='hs-keyglyph'>[</span><span class='hs-varid'>a</span><span class='hs-keyglyph'>]</span> <span class='hs-keyglyph'>-></span> <span class='hs-keyglyph'>[</span><span class='hs-keyglyph'>[</span><span class='hs-varid'>a</span><span class='hs-keyglyph'>]</span><span class='hs-keyglyph'>]</span> <a name="line-105"></a><span class='hs-definition'>sections</span> <span class='hs-varid'>p</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>filter</span> <span class='hs-layout'>(</span><span class='hs-varid'>p</span> <span class='hs-varop'>.</span> <span class='hs-varid'>head</span><span class='hs-layout'>)</span> <span class='hs-varop'>.</span> <span class='hs-varid'>init</span> <span class='hs-varop'>.</span> <span class='hs-varid'>tails</span> <a name="line-106"></a> <a name="line-107"></a><a name="partitions"></a><span class='hs-comment'>-- | This function is similar to 'sections', but splits the list</span> <a name="line-108"></a><span class='hs-comment'>-- so no element appears in any two partitions.</span> <a name="line-109"></a><span class='hs-definition'>partitions</span> <span class='hs-keyglyph'>::</span> <span class='hs-layout'>(</span><span class='hs-varid'>a</span> <span class='hs-keyglyph'>-></span> <span class='hs-conid'>Bool</span><span class='hs-layout'>)</span> <span class='hs-keyglyph'>-></span> <span class='hs-keyglyph'>[</span><span class='hs-varid'>a</span><span class='hs-keyglyph'>]</span> <span class='hs-keyglyph'>-></span> <span class='hs-keyglyph'>[</span><span class='hs-keyglyph'>[</span><span class='hs-varid'>a</span><span class='hs-keyglyph'>]</span><span class='hs-keyglyph'>]</span> <a name="line-110"></a><span class='hs-definition'>partitions</span> <span class='hs-varid'>p</span> <span class='hs-keyglyph'>=</span> <a name="line-111"></a> <span class='hs-keyword'>let</span> <span class='hs-varid'>notp</span> <span class='hs-keyglyph'>=</span> <span class='hs-varid'>not</span> <span class='hs-varop'>.</span> <span class='hs-varid'>p</span> <a name="line-112"></a> <span class='hs-keyword'>in</span> <span class='hs-varid'>groupBy</span> <span class='hs-layout'>(</span><span class='hs-varid'>const</span> <span class='hs-varid'>notp</span><span class='hs-layout'>)</span> <span class='hs-varop'>.</span> <span class='hs-varid'>dropWhile</span> <span class='hs-varid'>notp</span> </pre></body> </html>