<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>8.3.29. tokenize — groonga v3.0.5 documentation</title> <link rel="stylesheet" href="../../_static/groonga.css" type="text/css" /> <link rel="stylesheet" href="../../_static/pygments.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: '../../', VERSION: '3.0.5', COLLAPSE_INDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="../../_static/jquery.js"></script> <script type="text/javascript" src="../../_static/underscore.js"></script> <script type="text/javascript" src="../../_static/doctools.js"></script> <link rel="shortcut icon" href="../../_static/favicon.ico"/> <link rel="top" title="groonga v3.0.5 documentation" href="../../index.html" /> <link rel="up" title="8.3. Command" href="../command.html" /> <link rel="next" title="8.3.30. truncate" href="truncate.html" /> <link rel="prev" title="8.3.28. table_remove" href="table_remove.html" /> </head> <body> <div class="header"> <h1 class="title"> <a id="top-link" href="../../index.html"> <span class="project">groonga</span> <span class="separator">-</span> <span class="description">An open-source fulltext search engine and column store.</span> </a> </h1> <div class="other-language-links"> <ul> <li><a href="../../../../ja/html/reference/commands/tokenize.html"><img src="../../_static/jp.png" alt="日本語">日本語版はこちら</a></li> </ul> </div> </div> <div class="related"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="../../genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="truncate.html" title="8.3.30. truncate" accesskey="N">next</a> |</li> <li class="right" > <a href="table_remove.html" title="8.3.28. table_remove" accesskey="P">previous</a> |</li> <li><a href="../../index.html">groonga v3.0.5 documentation</a> »</li> <li><a href="../../reference.html" >8. リファレンスマニュアル</a> »</li> <li><a href="../command.html" accesskey="U">8.3. Command</a> »</li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="tokenize"> <h1>8.3.29. <tt class="docutils literal"><span class="pre">tokenize</span></tt><a class="headerlink" href="#tokenize" title="Permalink to this headline">¶</a></h1> <div class="section" id="summary"> <h2>8.3.29.1. Summary<a class="headerlink" href="#summary" title="Permalink to this headline">¶</a></h2> <p><tt class="docutils literal"><span class="pre">tokenize</span></tt> command tokenizes text by the specified tokenizer. It is useful to debug tokenization.</p> </div> <div class="section" id="syntax"> <h2>8.3.29.2. Syntax<a class="headerlink" href="#syntax" title="Permalink to this headline">¶</a></h2> <p><tt class="docutils literal"><span class="pre">tokenize</span></tt> command has required parameters and optional parameters. <tt class="docutils literal"><span class="pre">tokenizer</span></tt> and <tt class="docutils literal"><span class="pre">string</span></tt> are required parameters. Others are optional:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize tokenizer string [normalizer=null] [flags=NONE] </pre></div> </div> </div> <div class="section" id="usage"> <h2>8.3.29.3. Usage<a class="headerlink" href="#usage" title="Permalink to this headline">¶</a></h2> <p>Here is a simple example.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenBigram "Fulltext Search" # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "Fu" # }, # { # "position": 1, # "value": "ul" # }, # { # "position": 2, # "value": "ll" # }, # { # "position": 3, # "value": "lt" # }, # { # "position": 4, # "value": "te" # }, # { # "position": 5, # "value": "ex" # }, # { # "position": 6, # "value": "xt" # }, # { # "position": 7, # "value": "t " # }, # { # "position": 8, # "value": " S" # }, # { # "position": 9, # "value": "Se" # }, # { # "position": 10, # "value": "ea" # }, # { # "position": 11, # "value": "ar" # }, # { # "position": 12, # "value": "rc" # }, # { # "position": 13, # "value": "ch" # }, # { # "position": 14, # "value": "h" # } # ] # ] </pre></div> </div> <p>It has only required parameters. <tt class="docutils literal"><span class="pre">tokenizer</span></tt> is <tt class="docutils literal"><span class="pre">TokenBigram</span></tt> and <tt class="docutils literal"><span class="pre">string</span></tt> is <tt class="docutils literal"><span class="pre">"Fulltext</span> <span class="pre">Search"</span></tt>. It returns tokens that is generated by tokenizing <tt class="docutils literal"><span class="pre">"Fulltext</span> <span class="pre">Search"</span></tt> with <tt class="docutils literal"><span class="pre">TokenBigram</span></tt> tokenizer. It doesn't normalize <tt class="docutils literal"><span class="pre">"Fulltext</span> <span class="pre">Search"</span></tt>.</p> </div> <div class="section" id="parameters"> <h2>8.3.29.4. Parameters<a class="headerlink" href="#parameters" title="Permalink to this headline">¶</a></h2> <p>This section describes all parameters. Parameters are categorized.</p> <div class="section" id="required-parameters"> <h3>8.3.29.4.1. Required parameters<a class="headerlink" href="#required-parameters" title="Permalink to this headline">¶</a></h3> <p>There are required parameters, <tt class="docutils literal"><span class="pre">tokenizer</span></tt> and <tt class="docutils literal"><span class="pre">string</span></tt>.</p> <div class="section" id="tokenizer"> <h4>8.3.29.4.1.1. <tt class="docutils literal"><span class="pre">tokenizer</span></tt><a class="headerlink" href="#tokenizer" title="Permalink to this headline">¶</a></h4> <p>It specifies the tokenizer name. <tt class="docutils literal"><span class="pre">tokenize</span></tt> command uses the tokenizer that is named <tt class="docutils literal"><span class="pre">tokenizer</span></tt>.</p> <p>See <a class="reference internal" href="../tokenizers.html"><em>Tokenizers</em></a> about built-in tokenizers.</p> <p>Here is an example to use <tt class="docutils literal"><span class="pre">TokenTrigram</span></tt> tokenizer.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenTrigram "Fulltext Search" # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "Ful" # }, # { # "position": 1, # "value": "ull" # }, # { # "position": 2, # "value": "llt" # }, # { # "position": 3, # "value": "lte" # }, # { # "position": 4, # "value": "tex" # }, # { # "position": 5, # "value": "ext" # }, # { # "position": 6, # "value": "xt " # }, # { # "position": 7, # "value": "t S" # }, # { # "position": 8, # "value": " Se" # }, # { # "position": 9, # "value": "Sea" # }, # { # "position": 10, # "value": "ear" # }, # { # "position": 11, # "value": "arc" # }, # { # "position": 12, # "value": "rch" # }, # { # "position": 13, # "value": "ch" # }, # { # "position": 14, # "value": "h" # } # ] # ] </pre></div> </div> <p>If you want to use other tokenizers, you need to register additional tokenizer plugin by <a class="reference internal" href="register.html"><em>register</em></a> command. For example, you can use MySQL compatible normalizer by registering <a class="reference external" href="https://github.com/groonga/groonga-normalizer-mysql">groonga-normalizer-mysql</a>.</p> </div> <div class="section" id="string"> <h4>8.3.29.4.1.2. <tt class="docutils literal"><span class="pre">string</span></tt><a class="headerlink" href="#string" title="Permalink to this headline">¶</a></h4> <p>It specifies any string which you want to tokenize. If you want to include spaces in <tt class="docutils literal"><span class="pre">string</span></tt>, you need to quote <tt class="docutils literal"><span class="pre">string</span></tt> by single quotation (<tt class="docutils literal"><span class="pre">'</span></tt>) or double quotation (<tt class="docutils literal"><span class="pre">"</span></tt>).</p> <p>Here is an example to use spaces in <tt class="docutils literal"><span class="pre">string</span></tt>.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenBigram "Groonga is a fast fulltext earch engine!" # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "Gr" # }, # { # "position": 1, # "value": "ro" # }, # { # "position": 2, # "value": "oo" # }, # { # "position": 3, # "value": "on" # }, # { # "position": 4, # "value": "ng" # }, # { # "position": 5, # "value": "ga" # }, # { # "position": 6, # "value": "a " # }, # { # "position": 7, # "value": " i" # }, # { # "position": 8, # "value": "is" # }, # { # "position": 9, # "value": "s " # }, # { # "position": 10, # "value": " a" # }, # { # "position": 11, # "value": "a " # }, # { # "position": 12, # "value": " f" # }, # { # "position": 13, # "value": "fa" # }, # { # "position": 14, # "value": "as" # }, # { # "position": 15, # "value": "st" # }, # { # "position": 16, # "value": "t " # }, # { # "position": 17, # "value": " f" # }, # { # "position": 18, # "value": "fu" # }, # { # "position": 19, # "value": "ul" # }, # { # "position": 20, # "value": "ll" # }, # { # "position": 21, # "value": "lt" # }, # { # "position": 22, # "value": "te" # }, # { # "position": 23, # "value": "ex" # }, # { # "position": 24, # "value": "xt" # }, # { # "position": 25, # "value": "t " # }, # { # "position": 26, # "value": " e" # }, # { # "position": 27, # "value": "ea" # }, # { # "position": 28, # "value": "ar" # }, # { # "position": 29, # "value": "rc" # }, # { # "position": 30, # "value": "ch" # }, # { # "position": 31, # "value": "h " # }, # { # "position": 32, # "value": " e" # }, # { # "position": 33, # "value": "en" # }, # { # "position": 34, # "value": "ng" # }, # { # "position": 35, # "value": "gi" # }, # { # "position": 36, # "value": "in" # }, # { # "position": 37, # "value": "ne" # }, # { # "position": 38, # "value": "e!" # }, # { # "position": 39, # "value": "!" # } # ] # ] </pre></div> </div> </div> </div> <div class="section" id="optional-parameters"> <h3>8.3.29.4.2. Optional parameters<a class="headerlink" href="#optional-parameters" title="Permalink to this headline">¶</a></h3> <p>There are optional parameters.</p> <div class="section" id="normalizer"> <h4>8.3.29.4.2.1. <tt class="docutils literal"><span class="pre">normalizer</span></tt><a class="headerlink" href="#normalizer" title="Permalink to this headline">¶</a></h4> <p>It specifies the normalizer name. <tt class="docutils literal"><span class="pre">tokenize</span></tt> command uses the normalizer that is named <tt class="docutils literal"><span class="pre">normalizer</span></tt>. Normalizer is important for N-gram family tokenizers such as <tt class="docutils literal"><span class="pre">TokenBigram</span></tt>.</p> <p>Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing.</p> <p>Here is an example that doesn't use normalizer.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenBigram "Fulltext Search" # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "Fu" # }, # { # "position": 1, # "value": "ul" # }, # { # "position": 2, # "value": "ll" # }, # { # "position": 3, # "value": "lt" # }, # { # "position": 4, # "value": "te" # }, # { # "position": 5, # "value": "ex" # }, # { # "position": 6, # "value": "xt" # }, # { # "position": 7, # "value": "t " # }, # { # "position": 8, # "value": " S" # }, # { # "position": 9, # "value": "Se" # }, # { # "position": 10, # "value": "ea" # }, # { # "position": 11, # "value": "ar" # }, # { # "position": 12, # "value": "rc" # }, # { # "position": 13, # "value": "ch" # }, # { # "position": 14, # "value": "h" # } # ] # ] </pre></div> </div> <p>All alphabets are tokenized by two characters. For example, <tt class="docutils literal"><span class="pre">Fu</span></tt> is a token.</p> <p>Here is an example that uses normalizer.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenBigram "Fulltext Search" NormalizerAuto # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "fulltext" # }, # { # "position": 1, # "value": "search" # } # ] # ] </pre></div> </div> <p>Continuous alphabets are tokenized as one token. For example, <tt class="docutils literal"><span class="pre">fulltext</span></tt> is a token.</p> <p>If you want to tokenize by two characters with noramlizer, use <tt class="docutils literal"><span class="pre">TokenBigramSplitSymbolAlpha</span></tt>.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "fu" # }, # { # "position": 1, # "value": "ul" # }, # { # "position": 2, # "value": "ll" # }, # { # "position": 3, # "value": "lt" # }, # { # "position": 4, # "value": "te" # }, # { # "position": 5, # "value": "ex" # }, # { # "position": 6, # "value": "xt" # }, # { # "position": 7, # "value": "t" # }, # { # "position": 8, # "value": "se" # }, # { # "position": 9, # "value": "ea" # }, # { # "position": 10, # "value": "ar" # }, # { # "position": 11, # "value": "rc" # }, # { # "position": 12, # "value": "ch" # }, # { # "position": 13, # "value": "h" # } # ] # ] </pre></div> </div> <p>All alphabets are tokenized by two characters. And they are normalized to lower case characters. For example, <tt class="docutils literal"><span class="pre">fu</span></tt> is a token.</p> </div> <div class="section" id="flags"> <h4>8.3.29.4.2.2. <tt class="docutils literal"><span class="pre">flags</span></tt><a class="headerlink" href="#flags" title="Permalink to this headline">¶</a></h4> <p>It specifies a tokenization customize options. You can specify multiple options separated by "<tt class="docutils literal"><span class="pre">|</span></tt>". For example, <tt class="docutils literal"><span class="pre">NONE|ENABLE_TOKENIZED_DELIMITER</span></tt>.</p> <p>Here are available flags.</p> <table border="1" class="docutils"> <colgroup> <col width="50%" /> <col width="50%" /> </colgroup> <thead valign="bottom"> <tr class="row-odd"><th class="head">Flag</th> <th class="head">Description</th> </tr> </thead> <tbody valign="top"> <tr class="row-even"><td><tt class="docutils literal"><span class="pre">NONE</span></tt></td> <td>Just ignored.</td> </tr> <tr class="row-odd"><td><tt class="docutils literal"><span class="pre">ENABLE_TOKENIZED_DELIMITER</span></tt></td> <td>Enables tokenized delimiter. See <a class="reference internal" href="../tokenizers.html"><em>Tokenizers</em></a> about tokenized delimiter details.</td> </tr> </tbody> </table> <p>Here is an example that uses <tt class="docutils literal"><span class="pre">ENABLE_TOKENIZED_DELIMITER</span></tt>.</p> <p>Execution example:</p> <div class="highlight-none"><div class="highlight"><pre>tokenize TokenDelimit "Fulltext Seacrch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # { # "position": 0, # "value": "full" # }, # { # "position": 1, # "value": "text sea" # }, # { # "position": 2, # "value": "crch" # } # ] # ] </pre></div> </div> <p><tt class="docutils literal"><span class="pre">TokenDelimit</span></tt> tokenizer is one of tokenized delimiter supported tokenizer. <tt class="docutils literal"><span class="pre">ENABLE_TOKENIZED_DELIMITER</span></tt> enables tokenized delimiter. Tokenized delimiter is special character that indicates token border. It is <tt class="docutils literal"><span class="pre">U+FFFE</span></tt>. The character is not assigned any character. It means that the character is not appeared in normal string. So the character is good character for this puropose. If <tt class="docutils literal"><span class="pre">ENABLE_TOKENIZED_DELIMITER</span></tt> is enabled, the target string is treated as already tokenized string. Tokenizer just tokenizes by tokenized delimiter.</p> </div> </div> </div> <div class="section" id="return-value"> <h2>8.3.29.5. Return value<a class="headerlink" href="#return-value" title="Permalink to this headline">¶</a></h2> <p><tt class="docutils literal"><span class="pre">tokenize</span></tt> command returns tokenized tokens. Each token has some attributes except token itself. The attributes will be increased in the feature:</p> <div class="highlight-none"><div class="highlight"><pre>[HEADER, tokens] </pre></div> </div> <p><tt class="docutils literal"><span class="pre">HEADER</span></tt></p> <blockquote> <div>See <a class="reference internal" href="../command/output_format.html"><em>Output format</em></a> about <tt class="docutils literal"><span class="pre">HEADER</span></tt>.</div></blockquote> <p><tt class="docutils literal"><span class="pre">tokens</span></tt></p> <blockquote> <div><p><tt class="docutils literal"><span class="pre">tokens</span></tt> is an array of token. Token is an object that has the following attributes.</p> <table border="1" class="docutils"> <colgroup> <col width="50%" /> <col width="50%" /> </colgroup> <thead valign="bottom"> <tr class="row-odd"><th class="head">Name</th> <th class="head">Description</th> </tr> </thead> <tbody valign="top"> <tr class="row-even"><td><tt class="docutils literal"><span class="pre">value</span></tt></td> <td>Token itself.</td> </tr> <tr class="row-odd"><td><tt class="docutils literal"><span class="pre">position</span></tt></td> <td>The N-th token.</td> </tr> </tbody> </table> </div></blockquote> </div> <div class="section" id="see-also"> <h2>8.3.29.6. See also<a class="headerlink" href="#see-also" title="Permalink to this headline">¶</a></h2> <ul class="simple"> <li><a class="reference internal" href="../tokenizers.html"><em>Tokenizers</em></a></li> </ul> </div> </div> </div> </div> </div> <div class="sphinxsidebar"> <div class="sphinxsidebarwrapper"> <h3><a href="../../index.html">Table Of Contents</a></h3> <ul> <li><a class="reference internal" href="#">8.3.29. <tt class="docutils literal"><span class="pre">tokenize</span></tt></a><ul> <li><a class="reference internal" href="#summary">8.3.29.1. Summary</a></li> <li><a class="reference internal" href="#syntax">8.3.29.2. Syntax</a></li> <li><a class="reference internal" href="#usage">8.3.29.3. Usage</a></li> <li><a class="reference internal" href="#parameters">8.3.29.4. Parameters</a><ul> <li><a class="reference internal" href="#required-parameters">8.3.29.4.1. Required parameters</a><ul> <li><a class="reference internal" href="#tokenizer">8.3.29.4.1.1. <tt class="docutils literal"><span class="pre">tokenizer</span></tt></a></li> <li><a class="reference internal" href="#string">8.3.29.4.1.2. <tt class="docutils literal"><span class="pre">string</span></tt></a></li> </ul> </li> <li><a class="reference internal" href="#optional-parameters">8.3.29.4.2. Optional parameters</a><ul> <li><a class="reference internal" href="#normalizer">8.3.29.4.2.1. <tt class="docutils literal"><span class="pre">normalizer</span></tt></a></li> <li><a class="reference internal" href="#flags">8.3.29.4.2.2. <tt class="docutils literal"><span class="pre">flags</span></tt></a></li> </ul> </li> </ul> </li> <li><a class="reference internal" href="#return-value">8.3.29.5. Return value</a></li> <li><a class="reference internal" href="#see-also">8.3.29.6. See also</a></li> </ul> </li> </ul> <h4>Previous topic</h4> <p class="topless"><a href="table_remove.html" title="previous chapter">8.3.28. table_remove</a></p> <h4>Next topic</h4> <p class="topless"><a href="truncate.html" title="next chapter">8.3.30. <tt class="docutils literal"><span class="pre">truncate</span></tt></a></p> <h3>This Page</h3> <ul class="this-page-menu"> <li><a href="../../_sources/reference/commands/tokenize.txt" rel="nofollow">Show Source</a></li> </ul> <div id="searchbox" style="display: none"> <h3>Quick search</h3> <form class="search" action="../../search.html" method="get"> <input type="text" name="q" /> <input type="submit" value="Go" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> <p class="searchtip" style="font-size: 90%"> Enter search terms or a module, class or function name. </p> </div> <script type="text/javascript">$('#searchbox').show(0);</script> </div> </div> <div class="clearer"></div> </div> <div class="related"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="../../genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="truncate.html" title="8.3.30. truncate" >next</a> |</li> <li class="right" > <a href="table_remove.html" title="8.3.28. table_remove" >previous</a> |</li> <li><a href="../../index.html">groonga v3.0.5 documentation</a> »</li> <li><a href="../../reference.html" >8. リファレンスマニュアル</a> »</li> <li><a href="../command.html" >8.3. Command</a> »</li> </ul> </div> <div class="footer"> © Copyright 2009-2013, Brazil, Inc. </div> </body> </html>