Sophie: groonga-doc-3.0.5-1.fc18 x86

groonga-doc-3.0.5-1.fc18.x86_64.rpm

.. -*- rst -*-

.. highlightlang:: none

.. groonga-command
.. database: commands_tokenize

``tokenize``
============

Summary
-------

``tokenize`` command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.

Syntax
------

``tokenize`` command has required parameters and optional parameters.
``tokenizer`` and ``string`` are required parameters. Others are
optional::

  tokenize tokenizer
           string
           [normalizer=null]
           [flags=NONE]

Usage
-----

Here is a simple example.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/simple_example.log
.. tokenize TokenBigram "Fulltext Search"

It has only required parameters. ``tokenizer`` is ``TokenBigram`` and
``string`` is ``"Fulltext Search"``. It returns tokens that is
generated by tokenizing ``"Fulltext Search"`` with ``TokenBigram``
tokenizer. It doesn't normalize ``"Fulltext Search"``.

Parameters
----------

This section describes all parameters. Parameters are categorized.

Required parameters
^^^^^^^^^^^^^^^^^^^

There are required parameters, ``tokenizer`` and ``string``.

``tokenizer``
"""""""""""""

It specifies the tokenizer name. ``tokenize`` command uses the
tokenizer that is named ``tokenizer``.

See :doc:`/reference/tokenizers` about built-in tokenizers.

Here is an example to use ``TokenTrigram`` tokenizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/tokenizer_token_trigram.log
.. tokenize TokenTrigram "Fulltext Search"

If you want to use other tokenizers, you need to register additional
tokenizer plugin by :doc:`register` command. For example, you can use
MySQL compatible normalizer by registering `groonga-normalizer-mysql
<https://github.com/groonga/groonga-normalizer-mysql>`_.

``string``
""""""""""

It specifies any string which you want to tokenize. If you want to
include spaces in ``string``, you need to quote ``string`` by
single quotation (``'``) or double quotation (``"``).

Here is an example to use spaces in ``string``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/string_include_spaces.log
.. tokenize TokenBigram "Groonga is a fast fulltext earch engine!"

Optional parameters
^^^^^^^^^^^^^^^^^^^

There are optional parameters.

``normalizer``
""""""""""""""

It specifies the normalizer name. ``tokenize`` command uses the
normalizer that is named ``normalizer``. Normalizer is important for
N-gram family tokenizers such as ``TokenBigram``.

Normalizer detects character type for each character while
normalizing. N-gram family tokenizers use character types while
tokenizing.

Here is an example that doesn't use normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_none.log
.. tokenize TokenBigram "Fulltext Search"

All alphabets are tokenized by two characters. For example, ``Fu`` is
a token.

Here is an example that uses normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use.log
.. tokenize TokenBigram "Fulltext Search" NormalizerAuto

Continuous alphabets are tokenized as one token. For example,
``fulltext`` is a token.

If you want to tokenize by two characters with noramlizer, use
``TokenBigramSplitSymbolAlpha``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use_with_split_symbol_alpha.log
.. tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto

All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, ``fu`` is a token.

``flags``
"""""""""

It specifies a tokenization customize options. You can specify
multiple options separated by "``|``". For example,
``NONE|ENABLE_TOKENIZED_DELIMITER``.

Here are available flags.

.. list-table::
   :header-rows: 1

   * - Flag
     - Description
   * - ``NONE``
     - Just ignored.
   * - ``ENABLE_TOKENIZED_DELIMITER``
     - Enables tokenized delimiter. See :doc:`/reference/tokenizers` about
       tokenized delimiter details.

Here is an example that uses ``ENABLE_TOKENIZED_DELIMITER``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/flags_enable_tokenized_delimiter.log
.. tokenize TokenDelimit "Fullï¿¾text Seaï¿¾crch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER

``TokenDelimit`` tokenizer is one of tokenized delimiter supported
tokenizer. ``ENABLE_TOKENIZED_DELIMITER`` enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is ``U+FFFE``. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this puropose. If
``ENABLE_TOKENIZED_DELIMITER`` is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.

Return value
------------

``tokenize`` command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature::

  [HEADER, tokens]

``HEADER``

  See :doc:`/reference/command/output_format` about ``HEADER``.

``tokens``

  ``tokens`` is an array of token. Token is an object that has the following
  attributes.

  .. list-table::
     :header-rows: 1

     * - Name
       - Description
     * - ``value``
       - Token itself.
     * - ``position``
       - The N-th token.

See also
--------

* :doc:`/reference/tokenizers`