Sophie

Sophie

distrib > Fedora > 19 > i386 > by-pkgid > 6beacea4c4bc1b8f238481a6fa680433 > files > 517

python3-whoosh-2.5.7-1.fc19.noarch.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>How to search &mdash; Whoosh 2.5.7 documentation</title>
    
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '2.5.7',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="top" title="Whoosh 2.5.7 documentation" href="index.html" />
    <link rel="next" title="Parsing user queries" href="parsing.html" />
    <link rel="prev" title="How to index documents" href="indexing.html" /> 
  </head>
  <body>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="parsing.html" title="Parsing user queries"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="indexing.html" title="How to index documents"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">Whoosh 2.5.7 documentation</a> &raquo;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="how-to-search">
<h1>How to search<a class="headerlink" href="#how-to-search" title="Permalink to this headline">¶</a></h1>
<p>Once you&#8217;ve created an index and added documents to it, you can search for those
documents.</p>
<div class="section" id="the-searcher-object">
<h2>The <tt class="docutils literal"><span class="pre">Searcher</span></tt> object<a class="headerlink" href="#the-searcher-object" title="Permalink to this headline">¶</a></h2>
<p>To get a <a class="reference internal" href="api/searching.html#whoosh.searching.Searcher" title="whoosh.searching.Searcher"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Searcher</span></tt></a> object, call <tt class="docutils literal"><span class="pre">searcher()</span></tt> on your
<tt class="docutils literal"><span class="pre">Index</span></tt> object:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">searcher</span> <span class="o">=</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span>
</pre></div>
</div>
<p>You&#8217;ll usually want to open the searcher using a <tt class="docutils literal"><span class="pre">with</span></tt> statement so the
searcher is automatically closed when you&#8217;re done with it (searcher objects
represent a number of open files, so if you don&#8217;t explicitly close them and the
system is slow to collect them, you can run out of file handles):</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">with</span> <span class="n">ix</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">searcher</span><span class="p">:</span>
    <span class="o">...</span>
</pre></div>
</div>
<p>This is of course equivalent to:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">try</span><span class="p">:</span>
    <span class="n">searcher</span> <span class="o">=</span> <span class="n">ix</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span>
    <span class="o">...</span>
<span class="k">finally</span><span class="p">:</span>
    <span class="n">searcher</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
</div>
<p>The <tt class="docutils literal"><span class="pre">Searcher</span></tt> object is the main high-level interface for reading the index. It
has lots of useful methods for getting information about the index, such as
<tt class="docutils literal"><span class="pre">lexicon(fieldname)</span></tt>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">searcher</span><span class="o">.</span><span class="n">lexicon</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">))</span>
<span class="go">[u&quot;document&quot;, u&quot;index&quot;, u&quot;whoosh&quot;]</span>
</pre></div>
</div>
<p>However, the most important method on the <tt class="docutils literal"><span class="pre">Searcher</span></tt> object is
<a class="reference internal" href="api/searching.html#whoosh.searching.Searcher.search" title="whoosh.searching.Searcher.search"><tt class="xref py py-meth docutils literal"><span class="pre">search()</span></tt></a>, which takes a
<a class="reference internal" href="api/query.html#whoosh.query.Query" title="whoosh.query.Query"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.query.Query</span></tt></a> object and returns a
<a class="reference internal" href="api/searching.html#whoosh.searching.Results" title="whoosh.searching.Results"><tt class="xref py py-class docutils literal"><span class="pre">Results</span></tt></a> object:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">whoosh.qparser</span> <span class="kn">import</span> <span class="n">QueryParser</span>

<span class="n">qp</span> <span class="o">=</span> <span class="n">QueryParser</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">schema</span><span class="o">=</span><span class="n">myindex</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s">u&quot;hello world&quot;</span><span class="p">)</span>

<span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">q</span><span class="p">)</span>
</pre></div>
</div>
<p>By default the results contains at most the first 10 matching documents. To get
more results, use the <tt class="docutils literal"><span class="pre">limit</span></tt> keyword:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</pre></div>
</div>
<p>If you want all results, use <tt class="docutils literal"><span class="pre">limit=None</span></tt>. However, setting the limit whenever
possible makes searches faster because Whoosh doesn&#8217;t need to examine and score
every document.</p>
<p>Since displaying a page of results at a time is a common pattern, the
<tt class="docutils literal"><span class="pre">search_page</span></tt> method lets you conveniently retrieve only the results on a
given page:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search_page</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</pre></div>
</div>
<p>The default page length is 10 hits. You can use the <tt class="docutils literal"><span class="pre">pagelen</span></tt> keyword argument
to set a different page length:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search_page</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">pagelen</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="section" id="results-object">
<h2>Results object<a class="headerlink" href="#results-object" title="Permalink to this headline">¶</a></h2>
<p>The <a class="reference internal" href="api/searching.html#whoosh.searching.Results" title="whoosh.searching.Results"><tt class="xref py py-class docutils literal"><span class="pre">Results</span></tt></a> object acts like a list of the matched
documents. You can use it to access the stored fields of each hit document, to
display to the user.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="c"># Show the best hit&#39;s stored fields</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="go">{&quot;title&quot;: u&quot;Hello World in Python&quot;, &quot;path&quot;: u&quot;/a/b/c&quot;}</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>
<span class="go">[{&quot;title&quot;: u&quot;Hello World in Python&quot;, &quot;path&quot;: u&quot;/a/b/c&quot;},</span>
<span class="go">{&quot;title&quot;: u&quot;Foo&quot;, &quot;path&quot;: u&quot;/bar&quot;}]</span>
</pre></div>
</div>
<p>By default, <tt class="docutils literal"><span class="pre">Searcher.search(myquery)</span></tt> limits the number of hits to 20, So the
number of scored hits in the <tt class="docutils literal"><span class="pre">Results</span></tt> object may be less than the number of
matching documents in the index.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="c"># How many documents in the entire index would have matched?</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
<span class="go">27</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c"># How many scored and sorted documents in this Results object?</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c"># This will often be less than len() if the number of hits was limited</span>
<span class="gp">&gt;&gt;&gt; </span><span class="c"># (the default).</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">results</span><span class="o">.</span><span class="n">scored_length</span><span class="p">()</span>
<span class="go">10</span>
</pre></div>
</div>
<p>Calling <tt class="docutils literal"><span class="pre">len(Results)</span></tt> runs a fast (unscored) version of the query again to
figure out the total number of matching documents. This is usually very fast
but for large indexes it can cause a noticeable delay. If you want to avoid
this delay on very large indexes, you can use the
<a class="reference internal" href="api/searching.html#whoosh.searching.Results.has_exact_length" title="whoosh.searching.Results.has_exact_length"><tt class="xref py py-meth docutils literal"><span class="pre">has_exact_length()</span></tt></a>,
<a class="reference internal" href="api/searching.html#whoosh.searching.Results.estimated_length" title="whoosh.searching.Results.estimated_length"><tt class="xref py py-meth docutils literal"><span class="pre">estimated_length()</span></tt></a>, and
<a class="reference internal" href="api/searching.html#whoosh.searching.Results.estimated_min_length" title="whoosh.searching.Results.estimated_min_length"><tt class="xref py py-meth docutils literal"><span class="pre">estimated_min_length()</span></tt></a> methods to estimate the
number of matching documents without calling <tt class="docutils literal"><span class="pre">len()</span></tt>:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">found</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">scored_length</span><span class="p">()</span>
<span class="k">if</span> <span class="n">results</span><span class="o">.</span><span class="n">has_exact_length</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">&quot;Scored&quot;</span><span class="p">,</span> <span class="n">found</span><span class="p">,</span> <span class="s">&quot;of exactly&quot;</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">),</span> <span class="s">&quot;documents&quot;</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">low</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">estimated_min_length</span><span class="p">()</span>
    <span class="n">high</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">estimated_length</span><span class="p">()</span>

    <span class="k">print</span><span class="p">(</span><span class="s">&quot;Scored&quot;</span><span class="p">,</span> <span class="n">found</span><span class="p">,</span> <span class="s">&quot;of between&quot;</span><span class="p">,</span> <span class="n">low</span><span class="p">,</span> <span class="s">&quot;and&quot;</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="s">&quot;documents&quot;</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="section" id="scoring-and-sorting">
<h2>Scoring and sorting<a class="headerlink" href="#scoring-and-sorting" title="Permalink to this headline">¶</a></h2>
<div class="section" id="scoring">
<h3>Scoring<a class="headerlink" href="#scoring" title="Permalink to this headline">¶</a></h3>
<p>Normally the list of result documents is sorted by <em>score</em>. The
<a class="reference internal" href="api/scoring.html#module-whoosh.scoring" title="whoosh.scoring"><tt class="xref py py-mod docutils literal"><span class="pre">whoosh.scoring</span></tt></a> module contains implementations of various scoring
algorithms. The default is <a class="reference internal" href="api/scoring.html#whoosh.scoring.BM25F" title="whoosh.scoring.BM25F"><tt class="xref py py-class docutils literal"><span class="pre">BM25F</span></tt></a>.</p>
<p>You can set the scoring object to use when you create the searcher using the
<tt class="docutils literal"><span class="pre">weighting</span></tt> keyword argument:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">whoosh</span> <span class="kn">import</span> <span class="n">scoring</span>

<span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">(</span><span class="n">weighting</span><span class="o">=</span><span class="n">scoring</span><span class="o">.</span><span class="n">TF_IDF</span><span class="p">())</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="o">...</span>
</pre></div>
</div>
<p>A weighting model is a <a class="reference internal" href="api/scoring.html#whoosh.scoring.WeightingModel" title="whoosh.scoring.WeightingModel"><tt class="xref py py-class docutils literal"><span class="pre">WeightingModel</span></tt></a> subclass with a
<tt class="docutils literal"><span class="pre">scorer()</span></tt> method that produces a &#8220;scorer&#8221; instance. This instance has a
method that takes the current matcher and returns a floating point score.</p>
</div>
<div class="section" id="sorting">
<h3>Sorting<a class="headerlink" href="#sorting" title="Permalink to this headline">¶</a></h3>
<p>See <a class="reference internal" href="facets.html"><em>Sorting and faceting</em></a>.</p>
</div>
</div>
<div class="section" id="highlighting-snippets-and-more-like-this">
<h2>Highlighting snippets and More Like This<a class="headerlink" href="#highlighting-snippets-and-more-like-this" title="Permalink to this headline">¶</a></h2>
<p>See <a class="reference internal" href="highlight.html"><em>How to create highlighted search result excerpts</em></a> and <a class="reference internal" href="keywords.html"><em>Query expansion and Key word extraction</em></a> for information on these topics.</p>
</div>
<div class="section" id="filtering-results">
<h2>Filtering results<a class="headerlink" href="#filtering-results" title="Permalink to this headline">¶</a></h2>
<p>You can use the <tt class="docutils literal"><span class="pre">filter</span></tt> keyword argument to <tt class="docutils literal"><span class="pre">search()</span></tt> to specify a set of
documents to permit in the results. The argument can be a
<a class="reference internal" href="api/query.html#whoosh.query.Query" title="whoosh.query.Query"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.query.Query</span></tt></a> object, a <a class="reference internal" href="api/searching.html#whoosh.searching.Results" title="whoosh.searching.Results"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Results</span></tt></a> object,
or a set-like object containing document numbers. The searcher caches filters
so if for example you use the same query filter with a searcher multiple times,
the additional searches will be faster because the searcher will cache the
results of running the filter query</p>
<p>You can also specify a <tt class="docutils literal"><span class="pre">mask</span></tt> keyword argument to specify a set of documents
that are not permitted in the results.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="n">qp</span> <span class="o">=</span> <span class="n">qparser</span><span class="o">.</span><span class="n">QueryParser</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">myindex</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span>
    <span class="n">user_q</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">query_string</span><span class="p">)</span>

    <span class="c"># Only show documents in the &quot;rendering&quot; chapter</span>
    <span class="n">allow_q</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">Term</span><span class="p">(</span><span class="s">&quot;chapter&quot;</span><span class="p">,</span> <span class="s">&quot;rendering&quot;</span><span class="p">)</span>
    <span class="c"># Don&#39;t show any documents where the &quot;tag&quot; field contains &quot;todo&quot;</span>
    <span class="n">restrict_q</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">Term</span><span class="p">(</span><span class="s">&quot;tag&quot;</span><span class="p">,</span> <span class="s">&quot;todo&quot;</span><span class="p">)</span>

    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">user_q</span><span class="p">,</span> <span class="nb">filter</span><span class="o">=</span><span class="n">allow_q</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">restrict_q</span><span class="p">)</span>
</pre></div>
</div>
<p>(If you specify both a <tt class="docutils literal"><span class="pre">filter</span></tt> and a <tt class="docutils literal"><span class="pre">mask</span></tt>, and a matching document
appears in both, the <tt class="docutils literal"><span class="pre">mask</span></tt> &#8220;wins&#8221; and the document is not permitted.)</p>
<p>To find out how many results were filtered out of the results, use
<tt class="docutils literal"><span class="pre">results.filtered_count</span></tt> (or <tt class="docutils literal"><span class="pre">resultspage.results.filtered_count</span></tt>):</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="n">qp</span> <span class="o">=</span> <span class="n">qparser</span><span class="o">.</span><span class="n">QueryParser</span><span class="p">(</span><span class="s">&quot;content&quot;</span><span class="p">,</span> <span class="n">myindex</span><span class="o">.</span><span class="n">schema</span><span class="p">)</span>
    <span class="n">user_q</span> <span class="o">=</span> <span class="n">qp</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">query_string</span><span class="p">)</span>

    <span class="c"># Filter documents older than 7 days</span>
    <span class="n">old_q</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="n">DateRange</span><span class="p">(</span><span class="s">&quot;created&quot;</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">days</span><span class="o">=</span><span class="mi">7</span><span class="p">))</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">user_q</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">old_q</span><span class="p">)</span>

    <span class="k">print</span><span class="p">(</span><span class="s">&quot;Filtered out </span><span class="si">%d</span><span class="s"> older documents&quot;</span> <span class="o">%</span> <span class="n">results</span><span class="o">.</span><span class="n">filtered_count</span><span class="p">)</span>
</pre></div>
</div>
</div>
<div class="section" id="which-terms-from-my-query-matched">
<h2>Which terms from my query matched?<a class="headerlink" href="#which-terms-from-my-query-matched" title="Permalink to this headline">¶</a></h2>
<p>You can use the <tt class="docutils literal"><span class="pre">terms=True</span></tt> keyword argument to <tt class="docutils literal"><span class="pre">search()</span></tt> to have the
search record which terms in the query matched which documents:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">seach</span><span class="p">(</span><span class="n">myquery</span><span class="p">,</span> <span class="n">terms</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
</div>
<p>You can then get information about which terms matched from the
<a class="reference internal" href="api/searching.html#whoosh.searching.Results" title="whoosh.searching.Results"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Results</span></tt></a> and <a class="reference internal" href="api/searching.html#whoosh.searching.Hit" title="whoosh.searching.Hit"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Hit</span></tt></a> objects:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Was this results object created with terms=True?</span>
<span class="k">if</span> <span class="n">results</span><span class="o">.</span><span class="n">has_matched_terms</span><span class="p">():</span>
    <span class="c"># What terms matched in the results?</span>
    <span class="k">print</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">matched_terms</span><span class="p">())</span>

    <span class="c"># What terms matched in each hit?</span>
    <span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">hit</span><span class="o">.</span><span class="n">matched_terms</span><span class="p">())</span>
</pre></div>
</div>
</div>
<div class="section" id="collapsing-results">
<span id="collapsing"></span><h2>Collapsing results<a class="headerlink" href="#collapsing-results" title="Permalink to this headline">¶</a></h2>
<p>Whoosh lets you eliminate all but the top N documents with the same facet key
from the results. This can be useful in a few situations:</p>
<ul class="simple">
<li>Eliminating duplicates at search time.</li>
<li>Restricting the number of matches per source. For example, in a web search
application, you might want to show at most three matches from any website.</li>
</ul>
<p>Whether a document should be collapsed is determined by the value of a &#8220;collapse
facet&#8221;. If a document has an empty collapse key, it will never be collapsed,
but otherwise only the top N documents with the same collapse key will appear
in the results.</p>
<p>See <a class="reference internal" href="facets.html"><em>Sorting and faceting</em></a> for information on facets.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="c"># Set the facet to collapse on and the maximum number of documents per</span>
    <span class="c"># facet value (default is 1)</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">collector</span><span class="p">(</span><span class="n">collapse</span><span class="o">=</span><span class="s">&quot;hostname&quot;</span><span class="p">,</span> <span class="n">collapse_limit</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>

    <span class="c"># Dictionary mapping collapse keys to the number of documents that</span>
    <span class="c"># were filtered out by collapsing on that key</span>
    <span class="k">print</span><span class="p">(</span><span class="n">results</span><span class="o">.</span><span class="n">collapsed_counts</span><span class="p">)</span>
</pre></div>
</div>
<p>Collapsing works with both scored and sorted results. You can use any of the
facet types available in the <a class="reference internal" href="api/sorting.html#module-whoosh.sorting" title="whoosh.sorting"><tt class="xref py py-mod docutils literal"><span class="pre">whoosh.sorting</span></tt></a> module.</p>
<p>By default, Whoosh uses the results order (score or sort key) to determine the
documents to collapse. For example, in scored results, the best scoring
documents would be kept. You can optionally specify a <tt class="docutils literal"><span class="pre">collapse_order</span></tt> facet
to control which documents to keep when collapsing.</p>
<p>For example, in a product search you could display results sorted by decreasing
price, and eliminate all but the highest rated item of each product type:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">whoosh</span> <span class="kn">import</span> <span class="n">sorting</span>

<span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="n">price_facet</span> <span class="o">=</span> <span class="n">sorting</span><span class="o">.</span><span class="n">FieldFacet</span><span class="p">(</span><span class="s">&quot;price&quot;</span><span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">type_facet</span> <span class="o">=</span> <span class="n">sorting</span><span class="o">.</span><span class="n">FieldFacet</span><span class="p">(</span><span class="s">&quot;type&quot;</span><span class="p">)</span>
    <span class="n">rating_facet</span> <span class="o">=</span> <span class="n">sorting</span><span class="o">.</span><span class="n">FieldFacet</span><span class="p">(</span><span class="s">&quot;rating&quot;</span><span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">collector</span><span class="p">(</span><span class="n">sortedby</span><span class="o">=</span><span class="n">price_facet</span><span class="p">,</span>  <span class="c"># Sort by reverse price</span>
                          <span class="n">collapse</span><span class="o">=</span><span class="n">type_facet</span><span class="p">,</span>  <span class="c"># Collapse on product type</span>
                          <span class="n">collapse_order</span><span class="o">=</span><span class="n">rating_facet</span>  <span class="c"># Collapse to highest rated</span>
                          <span class="p">)</span>
</pre></div>
</div>
<p>The collapsing happens during the search, so it is usually more efficient than
finding everything and post-processing the results. However, if the collapsing
eliminates a large number of documents, collapsed search can take longer
because the search has to consider more documents and remove many
already-collected documents.</p>
<p>Since this collector must sometimes go back and remove already-collected
documents, if you use it in combination with
<a class="reference internal" href="api/collectors.html#whoosh.collectors.TermsCollector" title="whoosh.collectors.TermsCollector"><tt class="xref py py-class docutils literal"><span class="pre">TermsCollector</span></tt></a> and/or
<a class="reference internal" href="api/collectors.html#whoosh.collectors.FacetCollector" title="whoosh.collectors.FacetCollector"><tt class="xref py py-class docutils literal"><span class="pre">FacetCollector</span></tt></a>, those collectors may contain
information about documents that were filtered out of the final results by
collapsing.</p>
</div>
<div class="section" id="time-limited-searches">
<h2>Time limited searches<a class="headerlink" href="#time-limited-searches" title="Permalink to this headline">¶</a></h2>
<p>To limit the amount of time a search can take:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="kn">from</span> <span class="nn">whoosh.collectors</span> <span class="kn">import</span> <span class="n">TimeLimitCollector</span><span class="p">,</span> <span class="n">TimeLimit</span>

<span class="k">with</span> <span class="n">myindex</span><span class="o">.</span><span class="n">searcher</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
    <span class="c"># Get a collector object</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">collector</span><span class="p">(</span><span class="n">limit</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">sortedby</span><span class="o">=</span><span class="s">&quot;title_exact&quot;</span><span class="p">)</span>
    <span class="c"># Wrap it in a TimeLimitedCollector and set the time limit to 10 seconds</span>
    <span class="n">tlc</span> <span class="o">=</span> <span class="n">TimeLimitedCollector</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">timelimit</span><span class="o">=</span><span class="mf">10.0</span><span class="p">)</span>

    <span class="c"># Try searching</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">s</span><span class="o">.</span><span class="n">search_with_collector</span><span class="p">(</span><span class="n">myquery</span><span class="p">,</span> <span class="n">tlc</span><span class="p">)</span>
    <span class="k">except</span> <span class="n">TimeLimit</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">&quot;Search took too long, aborting!&quot;</span><span class="p">)</span>

    <span class="c"># You can still get partial results from the collector</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">tlc</span><span class="o">.</span><span class="n">results</span><span class="p">()</span>
</pre></div>
</div>
</div>
<div class="section" id="convenience-methods">
<h2>Convenience methods<a class="headerlink" href="#convenience-methods" title="Permalink to this headline">¶</a></h2>
<p>The <a class="reference internal" href="api/searching.html#whoosh.searching.Searcher.document" title="whoosh.searching.Searcher.document"><tt class="xref py py-meth docutils literal"><span class="pre">document()</span></tt></a> and
<a class="reference internal" href="api/searching.html#whoosh.searching.Searcher.documents" title="whoosh.searching.Searcher.documents"><tt class="xref py py-meth docutils literal"><span class="pre">documents()</span></tt></a> methods on the <tt class="docutils literal"><span class="pre">Searcher</span></tt> object let
you retrieve the stored fields of documents matching terms you pass in keyword
arguments.</p>
<p>This is especially useful for fields such as dates/times, identifiers, paths,
and so on.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">searcher</span><span class="o">.</span><span class="n">documents</span><span class="p">(</span><span class="n">indexeddate</span><span class="o">=</span><span class="s">u&quot;20051225&quot;</span><span class="p">))</span>
<span class="go">[{&quot;title&quot;: u&quot;Christmas presents&quot;}, {&quot;title&quot;: u&quot;Turkey dinner report&quot;}]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="k">print</span> <span class="n">searcher</span><span class="o">.</span><span class="n">document</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="s">u&quot;/a/b/c&quot;</span><span class="p">)</span>
<span class="go">{&quot;title&quot;: &quot;Document C&quot;}</span>
</pre></div>
</div>
<p>These methods have some limitations:</p>
<ul class="simple">
<li>The results are not scored.</li>
<li>Multiple keywords are always AND-ed together.</li>
<li>The entire value of each keyword argument is considered a single term; you
can&#8217;t search for multiple terms in the same field.</li>
</ul>
</div>
<div class="section" id="combining-results-objects">
<h2>Combining Results objects<a class="headerlink" href="#combining-results-objects" title="Permalink to this headline">¶</a></h2>
<p>It is sometimes useful to use the results of another query to influence the
order of a <a class="reference internal" href="api/searching.html#whoosh.searching.Results" title="whoosh.searching.Results"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.searching.Results</span></tt></a> object.</p>
<p>For example, you might have a &#8220;best bet&#8221; field. This field contains hand-picked
keywords for documents. When the user searches for those keywords, you want
those documents to be placed at the top of the results list. You could try to
do this by boosting the &#8220;bestbet&#8221; field tremendously, but that can have
unpredictable effects on scoring. It&#8217;s much easier to simply run the query
twice and combine the results:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="c"># Parse the user query</span>
<span class="n">userquery</span> <span class="o">=</span> <span class="n">queryparser</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">querystring</span><span class="p">)</span>

<span class="c"># Get the terms searched for</span>
<span class="n">termset</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="n">userquery</span><span class="o">.</span><span class="n">existing_terms</span><span class="p">(</span><span class="n">termset</span><span class="p">)</span>

<span class="c"># Formulate a &quot;best bet&quot; query for the terms the user</span>
<span class="c"># searched for in the &quot;content&quot; field</span>
<span class="n">bbq</span> <span class="o">=</span> <span class="n">Or</span><span class="p">([</span><span class="n">Term</span><span class="p">(</span><span class="s">&quot;bestbet&quot;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span> <span class="k">for</span> <span class="n">fieldname</span><span class="p">,</span> <span class="n">text</span>
          <span class="ow">in</span> <span class="n">termset</span> <span class="k">if</span> <span class="n">fieldname</span> <span class="o">==</span> <span class="s">&quot;content&quot;</span><span class="p">])</span>

<span class="c"># Find documents matching the searched for terms</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">bbq</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>

<span class="c"># Find documents that match the original query</span>
<span class="n">allresults</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">userquery</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

<span class="c"># Add the user query results on to the end of the &quot;best bet&quot;</span>
<span class="c"># results. If documents appear in both result sets, push them</span>
<span class="c"># to the top of the combined results.</span>
<span class="n">results</span><span class="o">.</span><span class="n">upgrade_and_extend</span><span class="p">(</span><span class="n">allresults</span><span class="p">)</span>
</pre></div>
</div>
<p>The <tt class="docutils literal"><span class="pre">Results</span></tt> object supports the following methods:</p>
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">Results.extend(results)</span></tt></dt>
<dd>Adds the documents in &#8216;results&#8217; on to the end of the list of result
documents.</dd>
<dt><tt class="docutils literal"><span class="pre">Results.filter(results)</span></tt></dt>
<dd>Removes the documents in &#8216;results&#8217; from the list of result documents.</dd>
<dt><tt class="docutils literal"><span class="pre">Results.upgrade(results)</span></tt></dt>
<dd>Any result documents that also appear in &#8216;results&#8217; are moved to the top
of the list of result documents.</dd>
<dt><tt class="docutils literal"><span class="pre">Results.upgrade_and_extend(results)</span></tt></dt>
<dd>Any result documents that also appear in &#8216;results&#8217; are moved to the top
of the list of result documents. Then any other documents in &#8216;results&#8217; are
added on to the list of result documents.</dd>
</dl>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">How to search</a><ul>
<li><a class="reference internal" href="#the-searcher-object">The <tt class="docutils literal"><span class="pre">Searcher</span></tt> object</a></li>
<li><a class="reference internal" href="#results-object">Results object</a></li>
<li><a class="reference internal" href="#scoring-and-sorting">Scoring and sorting</a><ul>
<li><a class="reference internal" href="#scoring">Scoring</a></li>
<li><a class="reference internal" href="#sorting">Sorting</a></li>
</ul>
</li>
<li><a class="reference internal" href="#highlighting-snippets-and-more-like-this">Highlighting snippets and More Like This</a></li>
<li><a class="reference internal" href="#filtering-results">Filtering results</a></li>
<li><a class="reference internal" href="#which-terms-from-my-query-matched">Which terms from my query matched?</a></li>
<li><a class="reference internal" href="#collapsing-results">Collapsing results</a></li>
<li><a class="reference internal" href="#time-limited-searches">Time limited searches</a></li>
<li><a class="reference internal" href="#convenience-methods">Convenience methods</a></li>
<li><a class="reference internal" href="#combining-results-objects">Combining Results objects</a></li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="indexing.html"
                        title="previous chapter">How to index documents</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="parsing.html"
                        title="next chapter">Parsing user queries</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/searching.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="parsing.html" title="Parsing user queries"
             >next</a> |</li>
        <li class="right" >
          <a href="indexing.html" title="How to index documents"
             >previous</a> |</li>
        <li><a href="index.html">Whoosh 2.5.7 documentation</a> &raquo;</li> 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2007-2012 Matt Chaput.
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.1.3.
    </div>
  </body>
</html>