Sophie

Sophie

distrib > Fedora > 19 > i386 > by-pkgid > 6beacea4c4bc1b8f238481a6fa680433 > files > 495

python3-whoosh-2.5.7-1.fc19.noarch.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>How to create highlighted search result excerpts &mdash; Whoosh 2.5.7 documentation</title>
    
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '2.5.7',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="Whoosh 2.5.7 documentation" href="index.html" />
    <link rel="next" title="Query expansion and Key word extraction" href="keywords.html" />
    <link rel="prev" title="Sorting and faceting" href="facets.html" /> 
  </head>
  <body>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="keywords.html" title="Query expansion and Key word extraction"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="facets.html" title="Sorting and faceting"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">Whoosh 2.5.7 documentation</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="how-to-create-highlighted-search-result-excerpts">
<h1>How to create highlighted search result excerpts<a class="headerlink" href="#how-to-create-highlighted-search-result-excerpts" title="Permalink to this headline">¶</a></h1>
<div class="section" id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2>
<p>The highlighting system works as a pipeline, with four component types.</p>
<ul class="simple">
<li><strong>Fragmenters</strong> chop up the original text into __fragments__, based on the
locations of matched terms in the text.</li>
<li><strong>Scorers</strong> assign a score to each fragment, allowing the system to rank the
best fragments by whatever criterion.</li>
<li><strong>Order functions</strong> control in what order the top-scoring fragments are
presented to the user. For example, you can show the fragments in the order
they appear in the document (FIRST) or show higher-scoring fragments first
(SCORE)</li>
<li><strong>Formatters</strong> turn the fragment objects into human-readable output, such as
an HTML string.</li>
</ul>
</div>
<div class="section" id="requirements">
<h2>Requirements<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2>
<p>Highlighting requires that you have the text of the indexed document available.
You can keep the text in a stored field, or if the  original text is available
in a file, database column, etc, just reload it on the fly. Note that you might
need to process the text to remove e.g. HTML tags, wiki markup, etc.</p>
</div>
<div class="section" id="how-to">
<h2>How to<a class="headerlink" href="#how-to" title="Permalink to this headline">¶</a></h2>
<p>Get search results:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">mysearcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">)</span>
<span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="p">[</span><span class="s">&quot;title&quot;</span><span class="p">])</span>
</pre></div>
</div>
<p>You can use the <a class="reference internal" href="api/searching.html#whoosh.searching.Hit.highlights" title="whoosh.searching.Hit.highlights"><tt class="xref py py-meth docutils literal"><span class="pre">highlights()</span></tt></a> method on the
<a class="reference internal" href="api/searching.html#whoosh.searching.Hit" title="whoosh.searching.Hit"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Hit</span></tt></a> object to get highlighted snippets from the
document containing the search terms.</p>
<p>The first argument is the name of the field to highlight. If the field is
stored, this is the only argument you need to supply:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">mysearcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">)</span>
<span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="p">[</span><span class="s">&quot;title&quot;</span><span class="p">])</span>
    <span class="c"># Assume &quot;content&quot; field is stored</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">))</span>
</pre></div>
</div>
<p>If the field is not stored, you need to retrieve the text of the field some
other way. For example, reading it from the original file or a database. Then
you can supply the text to highlight with the <tt class="docutils literal"><span class="pre">text</span></tt> argument:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">mysearcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">)</span>
<span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="p">[</span><span class="s">&quot;title&quot;</span><span class="p">])</span>

    <span class="c"># Assume the &quot;path&quot; stored field contains a path to the original file</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">hit</span><span class="p">[</span><span class="s">&quot;path&quot;</span><span class="p">])</span> <span class="k">as</span> <span class="n">fileobj</span><span class="p">:</span>
        <span class="n">filecontents</span> <span class="o">=</span> <span class="n">fileobj</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>

    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="n">filecontents</span><span class="p">))</span>
</pre></div>
</div>
</div>
<div class="section" id="the-character-limit">
<h2>The character limit<a class="headerlink" href="#the-character-limit" title="Permalink to this headline">¶</a></h2>
<p>By default, Whoosh only pulls fragments from the first 32K characters of the
text. This prevents very long texts from bogging down the highlighting process
too much, and is usually justified since important/summary information is
usually at the start of a document. However, if you find the highlights are
missing information (for example, very long encyclopedia articles where the
terms appear in a later section), you can increase the fragmenter&#8217;s character
limit.</p>
<p>You can change the character limit on the results object like this:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">mysearcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">)</span>
<span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span><span class="o">.</span><span class="n">charlimit</span> <span class="o">=</span> <span class="mi">100000</span>
</pre></div>
</div>
<p>To turn off the character limit:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span><span class="o">.</span><span class="n">charlimit</span> <span class="o">=</span> <span class="bp">None</span>
</pre></div>
</div>
<p>If you instantiate a custom fragmenter, you can set the character limit on it
directly:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">sf</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">SentenceFragmenter</span><span class="p">(</span><span class="n">charlimit</span><span class="o">=</span><span class="mi">100000</span><span class="p">)</span>
<span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span> <span class="o">=</span> <span class="n">sf</span>
</pre></div>
</div>
<p>See below for information on customizing the highlights.</p>
<p>If you increase or disable the character limit to highlight long documents, you
may need to use the tips in the &#8220;speeding up highlighting&#8221; section below to
make highlighting faster.</p>
</div>
<div class="section" id="customizing-the-highlights">
<h2>Customizing the highlights<a class="headerlink" href="#customizing-the-highlights" title="Permalink to this headline">¶</a></h2>
<div class="section" id="number-of-fragments">
<h3>Number of fragments<a class="headerlink" href="#number-of-fragments" title="Permalink to this headline">¶</a></h3>
<p>You can use the <tt class="docutils literal"><span class="pre">top</span></tt> keyword argument to control the number of fragments
returned in each snippet:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Show a maximum of 5 fragments from the document</span>
<span class="k">print</span> <span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">top</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="section" id="fragment-size">
<h3>Fragment size<a class="headerlink" href="#fragment-size" title="Permalink to this headline">¶</a></h3>
<p>The default fragmenter has a <tt class="docutils literal"><span class="pre">maxchars</span></tt> attribute (default 200) controlling
the maximum length of a fragment, and a <tt class="docutils literal"><span class="pre">surround</span></tt> attribute (default 20)
controlling the maximum number of characters of context to add at the beginning
and end of a fragment:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Allow larger fragments</span>
<span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span><span class="o">.</span><span class="n">maxchars</span> <span class="o">=</span> <span class="mi">300</span>

<span class="c"># Show more context before and after</span>
<span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span><span class="o">.</span><span class="n">surround</span> <span class="o">=</span> <span class="mi">50</span>
</pre></div>
</div>
</div>
<div class="section" id="fragmenter">
<h3>Fragmenter<a class="headerlink" href="#fragmenter" title="Permalink to this headline">¶</a></h3>
<p>A fragmenter controls how to extract excerpts from the original text.</p>
<p>The <tt class="docutils literal"><span class="pre">highlight</span></tt> module has the following pre-made fragmenters:</p>
<dl class="docutils">
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.ContextFragmenter" title="whoosh.highlight.ContextFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.ContextFragmenter</span></tt></a> (the default)</dt>
<dd>This is a &#8220;smart&#8221; fragmenter that finds matched terms and then pulls
in surround text to form fragments. This fragmenter only yields
fragments that contain matched terms.</dd>
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.SentenceFragmenter" title="whoosh.highlight.SentenceFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.SentenceFragmenter</span></tt></a></dt>
<dd>Tries to break the text into fragments based on sentence punctuation
(&#8221;.&#8221;, &#8221;!&#8221;, and &#8221;?&#8221;). This object works by looking in the original
text for a sentence end as the next character after each token&#8217;s
&#8216;endchar&#8217;. Can be fooled by e.g. source code, decimals, etc.</dd>
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.WholeFragmenter" title="whoosh.highlight.WholeFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.WholeFragmenter</span></tt></a></dt>
<dd>Returns the entire text as one &#8220;fragment&#8221;. This can be useful if you
are highlighting a short bit of text and don&#8217;t need to fragment it.</dd>
</dl>
<p>The different fragmenters have different options. For example, the default
<a class="reference internal" href="api/highlight.html#whoosh.highlight.ContextFragmenter" title="whoosh.highlight.ContextFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">ContextFragmenter</span></tt></a> lets you set the maximum
fragment size and the size of the context to add on either side:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">my_cf</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">ContextFragmenter</span><span class="p">(</span><span class="n">maxchars</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">surround</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</pre></div>
</div>
<p>See the <a class="reference internal" href="api/highlight.html#module-whoosh.highlight" title="whoosh.highlight"><tt class="xref py py-mod docutils literal"><span class="pre">whoosh.highlight</span></tt></a> docs for more information.</p>
<p>To use a different fragmenter:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span> <span class="o">=</span> <span class="n">my_cf</span>
</pre></div>
</div>
</div>
<div class="section" id="scorer">
<h3>Scorer<a class="headerlink" href="#scorer" title="Permalink to this headline">¶</a></h3>
<p>A scorer is a callable that takes a <a class="reference internal" href="api/highlight.html#whoosh.highlight.Fragment" title="whoosh.highlight.Fragment"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.Fragment</span></tt></a> object and
returns a sortable value (where higher values represent better fragments).
The default scorer adds up the number of matched terms in the fragment, and
adds a &#8220;bonus&#8221; for the number of __different__ matched terms. The highlighting
system uses this score to select the best fragments to show to the user.</p>
<p>As an example of a custom scorer, to rank fragments by lowest standard
deviation of the positions of matched terms in the fragment:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">StandardDeviationScorer</span><span class="p">(</span><span class="n">fragment</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Gives higher scores to fragments where the matched terms are close</span>
<span class="sd">    together.</span>
<span class="sd">    &quot;&quot;&quot;</span>

    <span class="c"># Since lower values are better in this case, we need to negate the</span>
    <span class="c"># value</span>
    <span class="k">return</span> <span class="mi">0</span> <span class="o">-</span> <span class="n">stddev</span><span class="p">([</span><span class="n">t</span><span class="o">.</span><span class="n">pos</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">fragment</span><span class="o">.</span><span class="n">matched</span><span class="p">])</span>
</pre></div>
</div>
<p>To use a different scorer:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span><span class="o">.</span><span class="n">scorer</span> <span class="o">=</span> <span class="n">StandardDeviationScorer</span>
</pre></div>
</div>
</div>
<div class="section" id="order">
<h3>Order<a class="headerlink" href="#order" title="Permalink to this headline">¶</a></h3>
<p>The order is a function that takes a fragment and returns a sortable value used
to sort the highest-scoring fragments before presenting them to the user (where
fragments with lower values appear before fragments with higher values).</p>
<p>The <tt class="docutils literal"><span class="pre">highlight</span></tt> module has the following order functions.</p>
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">FIRST</span></tt> (the default)</dt>
<dd>Show fragments in the order they appear in the document.</dd>
<dt><tt class="docutils literal"><span class="pre">SCORE</span></tt></dt>
<dd>Show highest scoring fragments first.</dd>
</dl>
<p>The <tt class="docutils literal"><span class="pre">highlight</span></tt> module also includes <tt class="docutils literal"><span class="pre">LONGER</span></tt> (longer fragments first) and
<tt class="docutils literal"><span class="pre">SHORTER</span></tt> (shorter fragments first), but they probably aren&#8217;t as generally
useful.</p>
<p>To use a different order:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span><span class="o">.</span><span class="n">order</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">SCORE</span>
</pre></div>
</div>
</div>
<div class="section" id="formatter">
<h3>Formatter<a class="headerlink" href="#formatter" title="Permalink to this headline">¶</a></h3>
<p>A formatter contols how the highest scoring fragments are turned into a
formatted bit of text for display to the user. It can return anything
(e.g. plain text, HTML, a Genshi event stream, a SAX event generator,
or anything else useful to the calling system).</p>
<p>The <tt class="docutils literal"><span class="pre">highlight</span></tt> module contains the following pre-made formatters.</p>
<dl class="docutils">
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.HtmlFormatter" title="whoosh.highlight.HtmlFormatter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.HtmlFormatter</span></tt></a></dt>
<dd>Outputs a string containing HTML tags (with a class attribute)
around the matched terms.</dd>
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.UppercaseFormatter" title="whoosh.highlight.UppercaseFormatter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.UppercaseFormatter</span></tt></a></dt>
<dd>Converts the matched terms to UPPERCASE.</dd>
<dt><a class="reference internal" href="api/highlight.html#whoosh.highlight.GenshiFormatter" title="whoosh.highlight.GenshiFormatter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.GenshiFormatter</span></tt></a></dt>
<dd>Outputs a Genshi event stream, with the matched terms wrapped in a
configurable element.</dd>
</dl>
<p>The easiest way to create a custom formatter is to subclass
<tt class="docutils literal"><span class="pre">highlight.Formatter</span></tt> and override the <tt class="docutils literal"><span class="pre">format_token</span></tt> method:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">class</span> <span class="nc">BracketFormatter</span><span class="p">(</span><span class="n">highlight</span><span class="o">.</span><span class="n">Formatter</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Puts square brackets around the matched terms.</span>
<span class="sd">    &quot;&quot;&quot;</span>

    <span class="k">def</span> <span class="nf">format_token</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">,</span> <span class="n">token</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="c"># Use the get_text function to get the text corresponding to the</span>
        <span class="c"># token</span>
        <span class="n">tokentext</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">get_text</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">token</span><span class="p">)</span>

        <span class="c"># Return the text as you want it to appear in the highlighted</span>
        <span class="c"># string</span>
        <span class="k">return</span> <span class="s">&quot;[</span><span class="si">%s</span><span class="s">]&quot;</span> <span class="o">%</span> <span class="n">tokentext</span>
</pre></div>
</div>
<p>To use a different formatter:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">brf</span> <span class="o">=</span> <span class="n">BracketFormatter</span><span class="p">()</span>
<span class="n">results</span><span class="o">.</span><span class="n">formatter</span> <span class="o">=</span> <span class="n">brf</span>
</pre></div>
</div>
<p>If you need more control over the formatting (or want to output something other
than strings), you will need to override other methods. See the documentation
for the <tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.Formatter</span></tt> class.</p>
</div>
</div>
<div class="section" id="highlighter-object">
<h2>Highlighter object<a class="headerlink" href="#highlighter-object" title="Permalink to this headline">¶</a></h2>
<p>Rather than setting attributes on the results object, you can create a
reusable <a class="reference internal" href="api/highlight.html#whoosh.highlight.Highlighter" title="whoosh.highlight.Highlighter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.Highlighter</span></tt></a> object. Keyword arguments let
you change the <tt class="docutils literal"><span class="pre">fragmenter</span></tt>, <tt class="docutils literal"><span class="pre">scorer</span></tt>, <tt class="docutils literal"><span class="pre">order</span></tt>, and/or <tt class="docutils literal"><span class="pre">formatter</span></tt>:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">hi</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">Highlighter</span><span class="p">(</span><span class="n">fragmenter</span><span class="o">=</span><span class="n">my_cf</span><span class="p">,</span> <span class="n">scorer</span><span class="o">=</span><span class="n">sds</span><span class="p">)</span>
</pre></div>
</div>
<p>You can then use the <tt class="xref py py-meth docutils literal"><span class="pre">whoosh.highlight.Highlighter.highlight_hit()</span></tt> method
to get highlights for a <tt class="docutils literal"><span class="pre">Hit</span></tt> object:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="p">[</span><span class="s">&quot;title&quot;</span><span class="p">])</span>
    <span class="k">print</span><span class="p">(</span><span class="n">hi</span><span class="o">.</span><span class="n">highlight_hit</span><span class="p">(</span><span class="n">hit</span><span class="p">))</span>
</pre></div>
</div>
<p>(When you assign to a <tt class="docutils literal"><span class="pre">Results</span></tt> object&#8217;s <tt class="docutils literal"><span class="pre">fragmenter</span></tt>, <tt class="docutils literal"><span class="pre">scorer</span></tt>, <tt class="docutils literal"><span class="pre">order</span></tt>,
or <tt class="docutils literal"><span class="pre">formatter</span></tt> attributes, you&#8217;re actually changing the values on the
results object&#8217;s default <tt class="docutils literal"><span class="pre">Highlighter</span></tt> object.)</p>
</div>
<div class="section" id="speeding-up-highlighting">
<h2>Speeding up highlighting<a class="headerlink" href="#speeding-up-highlighting" title="Permalink to this headline">¶</a></h2>
<p>Recording which terms matched in which documents during the search may make
highlighting faster, since it will skip documents it knows don&#8217;t contain any
matching terms in the given field:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Record per-document term matches</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">searcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">,</span> <span class="n">terms</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</div>
<div class="section" id="pinpointfragmenter">
<h3>PinpointFragmenter<a class="headerlink" href="#pinpointfragmenter" title="Permalink to this headline">¶</a></h3>
<p>Usually the highlighting system uses the field&#8217;s analyzer to re-tokenize the
document&#8217;s text to find the matching terms in context. If you have long
documents and have increased/disabled the character limit, and/or if the field
has a very complex analyzer, re-tokenizing may be slow.</p>
<p>Instead of retokenizing, Whoosh can look up the character positions of the
matched terms in the index. Looking up the character positions is not
instantaneous, but is usually faster than analyzing large amounts of text.</p>
<p>To use <a class="reference internal" href="api/highlight.html#whoosh.highlight.PinpointFragmenter" title="whoosh.highlight.PinpointFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.PinpointFragmenter</span></tt></a> and avoid re-tokenizing the
document text, you must do all of the following:</p>
<p>Index the field with character information (this will require re-indexing an
existing index):</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Index the start and end chars of each term</span>
<span class="n">schema</span> <span class="o">=</span> <span class="n">fields</span><span class="o">.</span><span class="n">Schema</span><span class="p">(</span><span class="n">content</span><span class="o">=</span><span class="n">fields</span><span class="o">.</span><span class="n">TEXT</span><span class="p">(</span><span class="n">stored</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">chars</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
</pre></div>
</div>
<p>Record per-document term matches in the results:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Record per-document term matches</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">searcher</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">myquery</span><span class="p">,</span> <span class="n">terms</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</div>
<p>Set a <a class="reference internal" href="api/highlight.html#whoosh.highlight.PinpointFragmenter" title="whoosh.highlight.PinpointFragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.PinpointFragmenter</span></tt></a> as the fragmenter:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">PinpointFragmenter</span><span class="p">()</span>
</pre></div>
</div>
</div>
<div class="section" id="pinpointfragmenter-limitations">
<h3>PinpointFragmenter limitations<a class="headerlink" href="#pinpointfragmenter-limitations" title="Permalink to this headline">¶</a></h3>
<p>When the highlighting system does not re-tokenize the text, it doesn&#8217;t know
where any other words are in the text except the matched terms it looked up in
the index. Therefore when the fragmenter adds surrounding context, it just adds
or a certain number of characters blindly, and so doesn&#8217;t distinguish between
content and whitespace, or break on word boundaries, for example:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">)</span>
<span class="go">&#39;re when the &lt;b&gt;fragmenter&lt;/b&gt;\n       ad&#39;</span>
</pre></div>
</div>
<p>(This can be embarassing when the word fragments form dirty words!)</p>
<p>One way to avoid this is to not show any surrounding context, but then
fragments containing one matched term will contain ONLY that matched term:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">)</span>
<span class="go">&#39;&lt;b&gt;fragmenter&lt;/b&gt;&#39;</span>
</pre></div>
</div>
<p>Alternatively, you can normalize whitespace in the text before passing it to
the highlighting system:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">text</span> <span class="o">=</span> <span class="n">searcher</span><span class="o">.</span><span class="n">stored_</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">&quot;[</span><span class="se">\t\r\n</span><span class="s"> ]+&quot;</span><span class="p">,</span> <span class="s">&quot; &quot;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">text</span><span class="o">=</span><span class="n">text</span><span class="p">)</span>
</pre></div>
</div>
<p>...and use the <tt class="docutils literal"><span class="pre">autotrim</span></tt> option of <tt class="docutils literal"><span class="pre">PinpointFragmenter</span></tt> to automatically
strip text before the first space and after the last space in the fragments:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="n">results</span><span class="o">.</span><span class="n">fragmenter</span> <span class="o">=</span> <span class="n">highlight</span><span class="o">.</span><span class="n">PinpointFragmenter</span><span class="p">(</span><span class="n">autotrim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">hit</span><span class="o">.</span><span class="n">highlights</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">)</span>
<span class="go">&#39;when the &lt;b&gt;fragmenter&lt;/b&gt;&#39;</span>
</pre></div>
</div>
</div>
</div>
<div class="section" id="using-the-low-level-api">
<h2>Using the low-level API<a class="headerlink" href="#using-the-low-level-api" title="Permalink to this headline">¶</a></h2>
<div class="section" id="usage">
<h3>Usage<a class="headerlink" href="#usage" title="Permalink to this headline">¶</a></h3>
<p>The following function lets you retokenize and highlight a piece of text using
an analyzer:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">whoosh.highlight</span> <span class="kn">import</span> <span class="n">highlight</span>

<span class="n">excerpts</span> <span class="o">=</span> <span class="n">highlight</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">terms</span><span class="p">,</span> <span class="n">analyzer</span><span class="p">,</span> <span class="n">fragmenter</span><span class="p">,</span> <span class="n">formatter</span><span class="p">,</span> <span class="n">top</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
                     <span class="n">scorer</span><span class="o">=</span><span class="n">BasicFragmentScorer</span><span class="p">,</span> <span class="n">minscore</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">order</span><span class="o">=</span><span class="n">FIRST</span><span class="p">)</span>
</pre></div>
</div>
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">text</span></tt></dt>
<dd>The original text of the document.</dd>
<dt><tt class="docutils literal"><span class="pre">terms</span></tt></dt>
<dd>A sequence or set containing the query words to match, e.g. (&#8220;render&#8221;,
&#8220;shader&#8221;).</dd>
<dt><tt class="docutils literal"><span class="pre">analyzer</span></tt></dt>
<dd>The analyzer to use to break the document text into tokens for matching
against the query terms. This is usually the analyzer for the field the
query terms are in.</dd>
<dt><tt class="docutils literal"><span class="pre">fragmenter</span></tt></dt>
<dd>A <a class="reference internal" href="api/highlight.html#whoosh.highlight.Fragmenter" title="whoosh.highlight.Fragmenter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.Fragmenter</span></tt></a> object, see below.</dd>
<dt><tt class="docutils literal"><span class="pre">formatter</span></tt></dt>
<dd>A <tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.Formatter</span></tt> object, see below.</dd>
<dt><tt class="docutils literal"><span class="pre">top</span></tt></dt>
<dd>The number of fragments to include in the output.</dd>
<dt><tt class="docutils literal"><span class="pre">scorer</span></tt></dt>
<dd>A <a class="reference internal" href="api/highlight.html#whoosh.highlight.FragmentScorer" title="whoosh.highlight.FragmentScorer"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.highlight.FragmentScorer</span></tt></a> object. The only scorer currently
included with Whoosh is <a class="reference internal" href="api/highlight.html#whoosh.highlight.BasicFragmentScorer" title="whoosh.highlight.BasicFragmentScorer"><tt class="xref py py-class docutils literal"><span class="pre">BasicFragmentScorer</span></tt></a>, the
default.</dd>
<dt><tt class="docutils literal"><span class="pre">minscore</span></tt></dt>
<dd>The minimum score a fragment must have to be considered for inclusion.</dd>
<dt><tt class="docutils literal"><span class="pre">order</span></tt></dt>
<dd>An ordering function that determines the order of the &#8220;top&#8221; fragments in the
output text.</dd>
</dl>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">How to create highlighted search result excerpts</a><ul>
<li><a class="reference internal" href="#overview">Overview</a></li>
<li><a class="reference internal" href="#requirements">Requirements</a></li>
<li><a class="reference internal" href="#how-to">How to</a></li>
<li><a class="reference internal" href="#the-character-limit">The character limit</a></li>
<li><a class="reference internal" href="#customizing-the-highlights">Customizing the highlights</a><ul>
<li><a class="reference internal" href="#number-of-fragments">Number of fragments</a></li>
<li><a class="reference internal" href="#fragment-size">Fragment size</a></li>
<li><a class="reference internal" href="#fragmenter">Fragmenter</a></li>
<li><a class="reference internal" href="#scorer">Scorer</a></li>
<li><a class="reference internal" href="#order">Order</a></li>
<li><a class="reference internal" href="#formatter">Formatter</a></li>
</ul>
</li>
<li><a class="reference internal" href="#highlighter-object">Highlighter object</a></li>
<li><a class="reference internal" href="#speeding-up-highlighting">Speeding up highlighting</a><ul>
<li><a class="reference internal" href="#pinpointfragmenter">PinpointFragmenter</a></li>
<li><a class="reference internal" href="#pinpointfragmenter-limitations">PinpointFragmenter limitations</a></li>
</ul>
</li>
<li><a class="reference internal" href="#using-the-low-level-api">Using the low-level API</a><ul>
<li><a class="reference internal" href="#usage">Usage</a></li>
</ul>
</li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="facets.html"
                        title="previous chapter">Sorting and faceting</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="keywords.html"
                        title="next chapter">Query expansion and Key word extraction</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/highlight.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="keywords.html" title="Query expansion and Key word extraction"
             >next</a> |</li>
        <li class="right" >
          <a href="facets.html" title="Sorting and faceting"
             >previous</a> |</li>
        <li><a href="index.html">Whoosh 2.5.7 documentation</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2007-2012 Matt Chaput.
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.1.3.
    </div>
  </body>
</html>