<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Indexing and searching N-grams — Whoosh 2.5.1 documentation</title> <link rel="stylesheet" href="_static/default.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: '', VERSION: '2.5.1', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <link rel="top" title="Whoosh 2.5.1 documentation" href="index.html" /> <link rel="next" title="Sorting and faceting" href="facets.html" /> <link rel="prev" title="Stemming, variations, and accent folding" href="stemming.html" /> </head> <body> <div class="related"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="right" > <a href="facets.html" title="Sorting and faceting" accesskey="N">next</a> |</li> <li class="right" > <a href="stemming.html" title="Stemming, variations, and accent folding" accesskey="P">previous</a> |</li> <li><a href="index.html">Whoosh 2.5.1 documentation</a> »</li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="indexing-and-searching-n-grams"> <h1>Indexing and searching N-grams<a class="headerlink" href="#indexing-and-searching-n-grams" title="Permalink to this headline">¶</a></h1> <div class="section" id="overview"> <h2>Overview<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2> <p>N-gram indexing is a powerful method for getting fast, “search as you type” functionality like iTunes. It is also useful for quick and effective indexing of languages such as Chinese and Japanese without word breaks.</p> <p>N-grams refers to groups of N characters... bigrams are groups of two characters, trigrams are groups of three characters, and so on.</p> <p>Whoosh includes two methods for analyzing N-gram fields: an N-gram tokenizer, and a filter that breaks tokens into N-grams.</p> <p><a class="reference internal" href="api/analysis.html#whoosh.analysis.NgramTokenizer" title="whoosh.analysis.NgramTokenizer"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.analysis.NgramTokenizer</span></tt></a> tokenizes the entire field into N-grams. This is more useful for Chinese/Japanese/Korean languages, where it’s useful to index bigrams of characters rather than individual characters. Using this tokenizer with roman languages leads to spaces in the tokens.</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">ngt</span> <span class="o">=</span> <span class="n">NgramTokenizer</span><span class="p">(</span><span class="n">minsize</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">maxsize</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="gp">>>> </span><span class="p">[</span><span class="n">token</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">ngt</span><span class="p">(</span><span class="s">u"hi there"</span><span class="p">)]</span> <span class="go">[u'hi', u'hi ', u'hi t',u'i ', u'i t', u'i th', u' t', u' th', u' the', u'th',</span> <span class="go">u'the', u'ther', u'he', u'her', u'here', u'er', u'ere', u're']</span> </pre></div> </div> <p><a class="reference internal" href="api/analysis.html#whoosh.analysis.NgramFilter" title="whoosh.analysis.NgramFilter"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.analysis.NgramFilter</span></tt></a> breaks individual tokens into N-grams as part of an analysis pipeline. This is more useful for languages with word separation.</p> <div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="n">my_analyzer</span> <span class="o">=</span> <span class="n">StandardAnalyzer</span><span class="p">()</span> <span class="o">|</span> <span class="n">NgramFilter</span><span class="p">(</span><span class="n">minsize</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">maxsize</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> <span class="gp">>>> </span><span class="p">[</span><span class="n">token</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">my_analyzer</span><span class="p">(</span><span class="s">u"rendering shaders"</span><span class="p">)]</span> <span class="go">[u'ren', u'rend', u'end', u'ende', u'nde', u'nder', u'der', u'deri', u'eri',</span> <span class="go">u'erin', u'rin', u'ring', u'ing', u'sha', u'shad', u'had', u'hade', u'ade',</span> <span class="go">u'ader', u'der', u'ders', u'ers']</span> </pre></div> </div> <p>Whoosh includes two pre-configured field types for N-grams: <a class="reference internal" href="api/fields.html#whoosh.fields.NGRAM" title="whoosh.fields.NGRAM"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.fields.NGRAM</span></tt></a> and <a class="reference internal" href="api/fields.html#whoosh.fields.NGRAMWORDS" title="whoosh.fields.NGRAMWORDS"><tt class="xref py py-class docutils literal"><span class="pre">whoosh.fields.NGRAMWORDS</span></tt></a>. The only difference is that <tt class="docutils literal"><span class="pre">NGRAM</span></tt> runs all text through the N-gram filter, including whitespace and punctuation, while <tt class="docutils literal"><span class="pre">NGRAMWORDS</span></tt> extracts words from the text using a tokenizer, then runs each word through the N-gram filter.</p> <p>TBD.</p> </div> </div> </div> </div> </div> <div class="sphinxsidebar"> <div class="sphinxsidebarwrapper"> <h3><a href="index.html">Table Of Contents</a></h3> <ul> <li><a class="reference internal" href="#">Indexing and searching N-grams</a><ul> <li><a class="reference internal" href="#overview">Overview</a></li> </ul> </li> </ul> <h4>Previous topic</h4> <p class="topless"><a href="stemming.html" title="previous chapter">Stemming, variations, and accent folding</a></p> <h4>Next topic</h4> <p class="topless"><a href="facets.html" title="next chapter">Sorting and faceting</a></p> <h3>This Page</h3> <ul class="this-page-menu"> <li><a href="_sources/ngrams.txt" rel="nofollow">Show Source</a></li> </ul> <div id="searchbox" style="display: none"> <h3>Quick search</h3> <form class="search" action="search.html" method="get"> <input type="text" name="q" /> <input type="submit" value="Go" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> <p class="searchtip" style="font-size: 90%"> Enter search terms or a module, class or function name. </p> </div> <script type="text/javascript">$('#searchbox').show(0);</script> </div> </div> <div class="clearer"></div> </div> <div class="related"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li class="right" > <a href="facets.html" title="Sorting and faceting" >next</a> |</li> <li class="right" > <a href="stemming.html" title="Stemming, variations, and accent folding" >previous</a> |</li> <li><a href="index.html">Whoosh 2.5.1 documentation</a> »</li> </ul> </div> <div class="footer"> © Copyright 2007-2012 Matt Chaput. Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.1.3. </div> </body> </html>