<sect1 id='ngram-build-manual'> <title><command>ngram_build</command> <emphasis>Train n-gram language model</emphasis></title> <toc depth='1'></toc> <para> </para> <sect2> <title>Synopsis</title> <para> </para> <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_synopsis --> <para> <cmdsynopsis><command>ngram_build</command>[input file0] [input file1] ... -o [output file]<arg>-p <replaceable>ifile</replaceable></arg> <arg>-order <replaceable>int</replaceable></arg> <arg>-smooth <replaceable>int</replaceable></arg> <arg>-o <replaceable>ofile</replaceable></arg> <arg>-input_format <replaceable>string</replaceable></arg> <arg>-otype <replaceable>string</replaceable></arg> <arg>-sparse </arg> <arg>-dense </arg> <arg>-backoff <replaceable>int</replaceable></arg> <arg>-floor <replaceable>double</replaceable></arg> <arg>-freqsmooth <replaceable>int</replaceable></arg> <arg>-trace </arg> <arg>-save_compressed </arg> <arg>-oov_mode <replaceable>string</replaceable></arg> <arg>-oov_marker <replaceable>string</replaceable></arg> <arg>-prev_tag <replaceable>string</replaceable></arg> <arg>-prev_prev_tag <replaceable>string</replaceable></arg> <arg>-last_tag <replaceable>string</replaceable></arg> <arg>-default_tags </arg> </cmdsynopsis> </para> <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_synopsis --> <para> ngram_build offers basic ngram language model estimation. <formalpara> <para><title>Input data format</title></para> <para> Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence: </para> <screen> the cat sat on the mat </screen> would be treated as <screen> ... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag </screen> where prev_prev_tag is the argument to -prev_prev_tag, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition). The second input format is ngram_per_line which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be: <para> <screen> prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag </screen> </para> </formalpara> <formalpara> <para><title>Representation</title></para> \[V^N\] <para> The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).</para> </formalpara> <formalpara> <para><title>Getting more robust probability estimates</title></para> The common techniques for getting better estimates of the low/zero frequency ngrams are provided: namely smoothing and backing-off</para> </formalpara> <formalpara> <para><title>Testing an ngram model</title></para> Use the <link linkend=ngram-test-manual>ngram_test</link> program. </formalpara> </para> </sect2> <sect2> <title>OPTIONS</title> <para> </para> <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_options --> <para> <variablelist> <varlistentry><term>-w</term> <LISTITEM><PARA> <replaceable>ifile</replaceable> filename containing word list (required) </PARA></LISTITEM> </varlistentry> <varlistentry><term>-p</term> <LISTITEM><PARA> <replaceable>ifile</replaceable> filename containing predictee word list (default is to use wordlist given by -w) </PARA></LISTITEM> </varlistentry> <varlistentry><term>-order</term> <LISTITEM><PARA> <replaceable>int</replaceable> order, 1=unigram, 2=bigram etc. (default 2) </PARA></LISTITEM> </varlistentry> <varlistentry><term>-smooth</term> <LISTITEM><PARA> <replaceable>int</replaceable> Good-Turing smooth the grammar up to the given frequency </PARA></LISTITEM> </varlistentry> <varlistentry><term>-o</term> <LISTITEM><PARA> <replaceable>ofile</replaceable> Output file for constructed ngram </PARA></LISTITEM> </varlistentry> <varlistentry><term>-input_format</term> <LISTITEM><PARA> <replaceable>string</replaceable> format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line. </PARA></LISTITEM> </varlistentry> <varlistentry><term>-otype</term> <LISTITEM><PARA> <replaceable>string</replaceable> format of output file, one of cstr_ascii cstr_bin or htk_ascii </PARA></LISTITEM> </varlistentry> <varlistentry><term>-sparse</term> <LISTITEM><PARA> build ngram in sparse representation </PARA></LISTITEM> </varlistentry> <varlistentry><term>-dense</term> <LISTITEM><PARA> build ngram in dense representation (default) </PARA></LISTITEM> </varlistentry> <varlistentry><term>-backoff</term> <LISTITEM><PARA> <replaceable>int</replaceable> build backoff ngram (requires -smooth) </PARA></LISTITEM> </varlistentry> <varlistentry><term>-floor</term> <LISTITEM><PARA> <replaceable>double</replaceable> frequency floor value used with some ngrams </PARA></LISTITEM> </varlistentry> <varlistentry><term>-freqsmooth</term> <LISTITEM><PARA> <replaceable>int</replaceable> build frequency backed off smoothed ngram, this requires -smooth option </PARA></LISTITEM> </varlistentry> <varlistentry><term>-trace</term> <LISTITEM><PARA> give verbose outout about build process </PARA></LISTITEM> </varlistentry> <varlistentry><term>-save_compressed</term> <LISTITEM><PARA> save ngram in gzipped format </PARA></LISTITEM> </varlistentry> <varlistentry><term>-oov_mode</term> <LISTITEM><PARA> <replaceable>string</replaceable> what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker </PARA></LISTITEM> </varlistentry> <varlistentry><term>-oov_marker</term> <LISTITEM><PARA> <replaceable>string</replaceable> special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words : </PARA></LISTITEM> </varlistentry> <varlistentry><term>-prev_tag</term> <LISTITEM><PARA> <replaceable>string</replaceable> tag before sentence start </PARA></LISTITEM> </varlistentry> <varlistentry><term>-prev_prev_tag</term> <LISTITEM><PARA> <replaceable>string</replaceable> all words before 'prev_tag' </PARA></LISTITEM> </varlistentry> <varlistentry><term>-last_tag</term> <LISTITEM><PARA> <replaceable>string</replaceable> after sentence end </PARA></LISTITEM> </varlistentry> <varlistentry><term>-default_tags</term> <LISTITEM><PARA> use default tags of !ENTER,!EXIT and !EXIT respectively </PARA></LISTITEM> </varlistentry> </variablelist> </para> <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/ngram_build -sgml_options --> </sect2> </sect1>