Sophie: festival-speechtools-devel-1.2.96-18.fc14 i686

festival-speechtools-devel-1.2.96-18.fc14.i686.rpm

  <sect1 id='ngram-build-manual'>
	<title><command>ngram_build</command> <emphasis>Train n-gram language model</emphasis></title>

    <toc depth='1'></toc>
    <para>
    </para>
    <sect2>
      <title>Synopsis</title>
      <para>
      </para>
        <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_synopsis -->
        <para>
<cmdsynopsis><command>ngram_build</command>[input file0] [input file1] ... -o [output file]<arg>-p <replaceable>ifile</replaceable></arg>
<arg>-order <replaceable>int</replaceable></arg>
<arg>-smooth <replaceable>int</replaceable></arg>
<arg>-o <replaceable>ofile</replaceable></arg>
<arg>-input_format <replaceable>string</replaceable></arg>
<arg>-otype <replaceable>string</replaceable></arg>
<arg>-sparse </arg>
<arg>-dense </arg>
<arg>-backoff <replaceable>int</replaceable></arg>
<arg>-floor <replaceable>double</replaceable></arg>
<arg>-freqsmooth <replaceable>int</replaceable></arg>
<arg>-trace </arg>
<arg>-save_compressed </arg>
<arg>-oov_mode <replaceable>string</replaceable></arg>
<arg>-oov_marker <replaceable>string</replaceable></arg>
<arg>-prev_tag <replaceable>string</replaceable></arg>
<arg>-prev_prev_tag <replaceable>string</replaceable></arg>
<arg>-last_tag <replaceable>string</replaceable></arg>
<arg>-default_tags </arg>
</cmdsynopsis>
        </para>
        <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_synopsis -->
      <para>

ngram_build offers basic ngram language model estimation. 
<formalpara>
<para><title>Input data format</title></para>
<para> Two input formats are supported. In sentence_per_line format,
the program will deal with start and end of sentence (if required) by
using special vocabulary items specified by -prev_tag, -prev_prev_tag
and -last_tag. For example, the input sentence: </para>
<screen>
the cat sat on the mat
</screen>
would be treated as
<screen>
... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag
</screen>
where prev_prev_tag is the argument to -prev_prev_tag, and so on. A
default set of tag names is also available. This input format is only
useful for sliding-window type applications (e.g. language modelling
for speech recognition).
The second input format is ngram_per_line which is useful for either
non-sliding-window applications, or where the user requires an
alternative treatment of start/end of sentence to that provided
above. Now the input file simply contains a complete ngram per
line. For the same example as above (to build a trigram model) this
would be:
<para>
<screen>
prev_prev_tag prev_tag the
prev_tag the cat
the cat sat
cat sat on
sat on the
on the mat
the mat last_tag
</screen>
</para>
</formalpara>
<formalpara>
<para><title>Representation</title></para>
\[V^N\]
<para> The internal representation of the model becomes important for
higher values of N where, if V is the vocabulary size, \(V^N\) becomes
very large. In such cases, we cannot explicitly hold pobabilities for
all possible ngrams, and a sparse representation must be used
(i.e. only non-zero probabilities are stored).</para> 
</formalpara>
<formalpara>
<para><title>Getting more robust probability estimates</title></para>
The common techniques for getting better estimates of the low/zero
frequency ngrams are provided: namely smoothing and backing-off</para>
</formalpara>
<formalpara>
<para><title>Testing an ngram model</title></para>
Use the <link linkend=ngram-test-manual>ngram_test</link> program.
</formalpara>
      </para>
    </sect2>
    <sect2>
      <title>OPTIONS</title>
      <para>
      </para>
        <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_options -->
        <para>
<variablelist>
<varlistentry><term>-w</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

filename containing word list (required) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-p</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

filename containing predictee word list 
(default is to use wordlist given by -w) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-order</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>

order, 1=unigram, 2=bigram etc. (default 2) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-smooth</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>

Good-Turing smooth the grammar up to the 
given frequency 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-o</term>
<LISTITEM><PARA>
<replaceable>ofile</replaceable>

Output file for constructed ngram 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-input_format</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

format of input data (default sentence_per_line) 
may be sentence_per_file, ngram_per_line. 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-otype</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

format of output file, one of cstr_ascii 
cstr_bin or htk_ascii 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-sparse</term>
<LISTITEM><PARA>

build ngram in sparse representation 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-dense</term>
<LISTITEM><PARA>

build ngram in dense representation (default) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-backoff</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>

build backoff ngram (requires -smooth) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-floor</term>
<LISTITEM><PARA>
<replaceable>double</replaceable>

frequency floor value used with some ngrams 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-freqsmooth</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>

build frequency backed off smoothed ngram, this 
requires -smooth option 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-trace</term>
<LISTITEM><PARA>

give verbose outout about build process 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-save_compressed</term>
<LISTITEM><PARA>

save ngram in gzipped format 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-oov_mode</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

what to do about out-of-vocabulary words, 
one of skip_ngram, skip_sentence (default), 
skip_file, or use_oov_marker 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-oov_marker</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

special word for oov words (default !OOV) 
(use in conjunction with '-oov_mode use_oov_marker' 
Pseudo-words : 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-prev_tag</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

tag before sentence start 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-prev_prev_tag</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

all words before 'prev_tag' 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-last_tag</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

after sentence end 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-default_tags</term>
<LISTITEM><PARA>

use default tags of !ENTER,!EXIT and !EXIT 
respectively </PARA></LISTITEM>
</varlistentry>
</variablelist>
        </para>
        <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_options -->
    </sect2>
  </sect1>