<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>3.4. Further information</title><link rel="stylesheet" href="fptools.css" type="text/css"><meta name="generator" content="DocBook XSL Stylesheets V1.73.2"><link rel="start" href="index.html" title="Happy User Guide"><link rel="up" href="sec-glr.html" title="Chapter 3. Generalized LR Parsing"><link rel="prev" href="sec-glr-semantics.html" title="3.3. Including semantic results"><link rel="next" href="sec-AttributeGrammar.html" title="Chapter 4. Attribute Grammars"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">3.4. Further information</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="sec-glr-semantics.html">Prev</a> </td><th width="60%" align="center">Chapter 3. Generalized LR Parsing</th><td width="20%" align="right"> <a accesskey="n" href="sec-AttributeGrammar.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="sec-glr-misc"></a>3.4. Further information</h2></div></div></div><p> Other useful information... </p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-examples"></a>3.4.1. The GLR examples</h3></div></div></div><p> The directory <code class="literal">examples/glr</code> contains several examples from the small to the large. Please consult these or use them as a base for your experiments. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-graphs"></a>3.4.2. Viewing forests as graphs</h3></div></div></div><p> If you run the examples with <span class="application">GHC</span>, each run will produce a file <code class="literal">out.daVinci</code>. This is a graph in the format expected by the <span class="emphasis"><em>daVinci</em></span> graph visualization tool. (See <a class="ulink" href="http://www.informatik.uni-bremen.de/~davinci/" target="_top">http://www.informatik.uni-bremen.de/~davinci/</a> for more information. Educational use licenses are currently available without charge.) </p><p> We highly recommend looking at graphs of parse results - it really helps to understand the results. The graphs files are created with Sven Panne's library for communicating with <span class="emphasis"><em>daVinci</em></span>, supplemented with some extensions due to Callaghan. Copies of this code are included in the examples directory, for convenience. If you are trying to view large and complex graphs, contact Paul Callaghan (there are tools and techniques to make the graphs more manageable). </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-applications"></a>3.4.3. Some Applications of GLR parsing</h3></div></div></div><p> GLR parsing (and related techniques) aren't just for badly written grammars or for things like natural language (NL) where ambiguity is inescapable. There are applications where ambiguity can represent possible alternatives in pattern-matching tasks, and the flexibility of these parsing techniques and the resulting graphs support deep analyses. Below, we briefly discuss some examples, a mixture from our recent work and from the literature. </p><div class="variablelist"><dl><dt><span class="term">Gene sequence analysis</span></dt><dd><p> Combinations of structures within gene sequences can be expressed as a grammar, for example a "start" combination followed by a "promoter" combination then the gene proper. A recent undergraduate project has used this GLR implementation to detect candiate matches in data, and then to filter these matches with a mixture of local and global information. </p></dd><dt><span class="term">Rhythmic structure in poetry</span></dt><dd><p> Rhythmic patterns in (English) poetry obey certain rules, and in more modern poetry can break rules in particular ways to achieve certain effects. The standard rhythmic patterns (eg. iambic pentameter) can be encoded as a grammar, and deviations from the patterns also encoded as rules. The neutral reading can be parsed with this grammar, to give a forest of alternative matches. The forest can be analysed to give a preferred reading, and to highlight certain technical features of the poetry. An undergraduate project in Durham has used this implementation for this purpose, with promising results. </p></dd><dt><span class="term">Compilers -- instruction selection</span></dt><dd><p> Recent work has phrased the translation problem in compilers from intermediate representation to an instruction set for a given processor as a matching problem. Different constructs at the intermediate level can map to several combinations of machine instructions. This knowledge can be expressed as a grammar, and instances of the problem solved by parsing. The parse forest represents competing solutions, and allows selection of optimum solutions according to various measures. </p></dd><dt><span class="term">Robust parsing of ill-formed input</span></dt><dd><p> The extra flexibility of GLR parsing can simplify parsing of formal languages where a degree of `informality' is allowed. For example, Html parsing. Modern browsers contain complex parsers which are designed to try to extract useful information from Html text which doesn't follow the rules precisely, eg missing start tags or missing end tags. Html with missing tags can be written as an ambiguous grammar, and it should be a simple matter to extract a usable interpretation from a forest of parses. Notice the technique: we widen the scope of the grammar, parse with GLR, then extract a reasonable solution. This is arguably simpler than pushing an LR(1) or LL(1) parser past its limits, and also more maintainable. </p></dd><dt><span class="term">Natural Language Processing</span></dt><dd><p> Ambiguity is inescapable in the syntax of most human languages. In realistic systems, parse forests are useful to encode competing analyses in an efficient way, and they also provide a framework for further analysis and disambiguation. Note that ambiguity can have many forms, from simple phrase attachment uncertainty to more subtle forms involving mixtures of word senses. If some degree of ungrammaticality is to be tolerated in a system, which can be done by extending the grammar with productions incorporating common forms of infelicity, the degree of ambiguity increases further. For systems used on arbitrary text, such as on newspapers, it is not uncommon that many sentences permit several hundred or more analyses. With such grammars, parse forest techniques are essential. Many recent NLP systems use such techniques, including the Durham's earlier LOLITA system - which was mostly written in Haskell. </p></dd></dl></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-workings"></a>3.4.4. Technical details</h3></div></div></div><p> The original implementation was developed by Ben Medlock, as his undergraduate final year project, using ideas from Peter Ljungloef's Licentiate thesis (see <a class="ulink" href="http://www.cs.chalmers.se/~peb/parsing" target="_top">http://www.cs.chalmers.se/~peb/parsing</a>, and we recommend the thesis for its clear analysis of parsing algorithms). Ljungloef's version produces lists of parse trees, but Medlock adapted this to produce an explicit graph containing parse structure information. He also incorporated the code into <span class="application">Happy</span>. </p><p> After Medlock's graduation, Callaghan extended the code to incorporate semantic information, and made several improvements to the original code, such as improved local packing and support for hidden left recursion. The performance of the code was significantly improved, after changes of representation (eg to a chart-style data structure) and technique. Medlock's code was also used in several student projects, including analysis of gene sequences (Fischer) and analysis of rhythmic patterns in poetry (Henderson). </p><p> The current code implements the standard GLR algorithm extended to handle hidden left recursion. Such recursion, as in the grammar below from Rekers [1992], causes the standard algorithm to loop because the empty reduction <code class="literal">A -> </code> is always possible and the LR parser will not change state. Alternatively, there is a problem because an unknown (at the start of parsing) number of <code class="literal">A</code> items are required, to match the number of <code class="literal">i</code> tokens in the input. </p><pre class="programlisting"> S -> A Q i | + A -> </pre><p> The solution to this is not surprising. Problematic recursions are detected as zero-span reductions in a state which has a <code class="literal">goto</code> table entry looping to itself. A special symbol is pushed to the stack on the first such reduction, and such reductions are done at most once for any token alternative for any input position. When popping from the stack, if the last token being popped is such a special symbol, then two stack tails are returned: one corresponding to a conventional pop (which removes the symbol) and the other to a duplication of the special symbol (the stack is not changed, but a copy of the symbol is returned). This allows sufficient copies of the empty symbol to appear on some stack, hence allowing the parse to complete. </p><p> The forest is held in a chart-style data structure, and this supports local ambiguity packing (chart parsing is discussed in Ljungloef's thesis, among other places). A limited amount of packing of live stacks is also done, to avoid some repetition of work. </p><p> [Rekers 1992] Parser Generation for Interactive Environments, PhD thesis, University of Amsterdam, 1992. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-filter"></a>3.4.5. The <code class="option">--filter</code> option</h3></div></div></div><p> You might have noticed this GLR-related option. It is an experimental feature intended to restrict the amount of structure retained in the forest by discarding everything not required for the semantic results. It may or it may not work, and may be fixed in a future release. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-limitations"></a>3.4.6. Limitations and future work</h3></div></div></div><p> The parser supports hidden left recursion, but makes no attempt to handle cyclic grammars that have rules which do not consume any input. If you have a grammar like this, for example with rules like <code class="literal">S -> S</code> or <code class="literal">S -> A S | x; A -> empty</code>, the implementation will loop until you run out of stack - but if it will happen, it often happens quite quickly! </p><p> The code has been used and tested frequently over the past few years, including being used in several undergraduate projects. It should be fairly stable, but as usual, can't be guaranteed bug-free. One day I will write it in Epigram! </p><p> If you have suggestions for improvements, or requests for features, please contact Paul Callaghan. There are some changes I am considering, and some views and/or encouragement from users will be much appreciated. Further information can be found on Callaghan's <a class="ulink" href="http://www.dur.ac.uk/p.c.callaghan/happy-glr" target="_top">GLR parser page</a>. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sec-glr-misc-acknowledgements"></a>3.4.7. Thanks and acknowledgements</h3></div></div></div><p> Many thanks to the people who have used and tested this software in its various forms, including Julia Fischer, James Henderson, and Aarne Ranta. </p></div></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="sec-glr-semantics.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="sec-glr.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="sec-AttributeGrammar.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">3.3. Including semantic results </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> Chapter 4. Attribute Grammars</td></tr></table></div></body></html>