Sophie: festival-speechtools-devel-1.2.96-18.fc14 i686

festival-speechtools-devel-1.2.96-18.fc14.i686.rpm

  <sect1 id='wagon-manual'>
	<title><command>wagon</command> <emphasis>CART building program</emphasis></title>

    <toc depth='1'></toc>
    <para>
    </para>
    <sect2>
      <title>Synopsis</title>
      <para>
      </para>
        <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/wagon -sgml_synopsis -->
        <para>
<cmdsynopsis><command>wagon</command>[options]<arg>-desc <replaceable>ifile</replaceable></arg>
<arg>-data <replaceable>ifile</replaceable></arg>
<arg>-stop <replaceable>int</replaceable> " {50}"</arg>
<arg>-test <replaceable>ifile</replaceable></arg>
<arg>-frs <replaceable>float</replaceable> " {10}"</arg>
<arg>-dlist </arg>
<arg>-dtree </arg>
<arg>-output <replaceable>ofile</replaceable></arg>
<arg>-o <replaceable>ofile</replaceable></arg>
<arg>-distmatrix <replaceable>ifile</replaceable></arg>
<arg>-quiet </arg>
<arg>-verbose </arg>
<arg>-predictee <replaceable>string</replaceable></arg>
<arg>-ignore <replaceable>string</replaceable></arg>
<arg>-count_field <replaceable>string</replaceable></arg>
<arg>-stepwise </arg>
<arg>-swlimit <replaceable>float</replaceable> " {0.0}"</arg>
<arg>-swopt <replaceable>string</replaceable></arg>
<arg>-balance <replaceable>float</replaceable></arg>
<arg>-held_out <replaceable>int</replaceable></arg>
<arg>-heap <replaceable>int</replaceable> " {210000}"</arg>
<arg>-noprune </arg>
</cmdsynopsis>
        </para>
        <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/wagon -sgml_synopsis -->
      <para>

wagon is used to build CART tress from feature data, its basic
features include:
<itemizedlist>
<listitem><para>both decisions trees and decision lists are supported</para></listitem>
<listitem><para>predictees can be discrete or continuous</para></listitem>
<listitem><para>input features may be discrete or continuous</para></listitem>
<listitem><para>many options for controling tree building</para>
<itemizedlist>
<listitem><para>fixed stop value</para></listitem>
<listitem><para>balancing</para></listitem>
<listitem><para>held-out data and pruning</para></listitem>
<listitem><para>stepwise use of input features</para></listitem>
<listitem><para>choice of optimization criteria correct/entropy (for 
classification and rmse/correlation (for regression)</para></listitem>
</itemizedlist>
</listitem>
</itemizedlist>
A detailed description of building CART models can be found in the
<link linkend="cart-overview">CART model overview</link> section.
      </para>
    </sect2>
    <sect2>
      <title>OPTIONS</title>
      <para>
      </para>
        <!-- /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/wagon -sgml_options -->
        <para>
<variablelist>
<varlistentry><term>-desc</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

Field description file 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-data</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

Datafile, one vector per line 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-stop</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>
 " {50}"
Minimum number of examples for leaf nodes 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-test</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

Datafile to test tree on 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-frs</term>
<LISTITEM><PARA>
<replaceable>float</replaceable>
 " {10}"
Float range split, number of partitions to 
split a float feature range into 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-dlist</term>
<LISTITEM><PARA>

Build a decision list (rather than tree) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-dtree</term>
<LISTITEM><PARA>

Build a decision tree (rather than list) default 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-output</term>
<LISTITEM><PARA>
<replaceable>ofile</replaceable>

</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-o</term>
<LISTITEM><PARA>
<replaceable>ofile</replaceable>

File to save output tree in 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-distmatrix</term>
<LISTITEM><PARA>
<replaceable>ifile</replaceable>

A distance matrix for clustering 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-quiet</term>
<LISTITEM><PARA>

No questions printed during building 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-verbose</term>
<LISTITEM><PARA>

Lost of information printing during build 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-predictee</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

name of field to predict (default is first field) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-ignore</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

Filename or bracket list of fields to ignore 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-count_field</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

Name of field containing count weight for samples 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-stepwise</term>
<LISTITEM><PARA>

Incrementally find best features 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-swlimit</term>
<LISTITEM><PARA>
<replaceable>float</replaceable>
 " {0.0}"
Percentage necessary improvement for stepwise 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-swopt</term>
<LISTITEM><PARA>
<replaceable>string</replaceable>

Parameter to optimize for stepwise, for 
classification options are correct or entropy 
for regression options are rmse or correlation 
correct and correlation are the defaults 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-balance</term>
<LISTITEM><PARA>
<replaceable>float</replaceable>

For derived stop size, if dataset at node, divided 
by balance is greater than stop it is used as stop 
if balance is 0 (default) always use stop as is. 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-held_out</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>

Percent to hold out for pruning 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-heap</term>
<LISTITEM><PARA>
<replaceable>int</replaceable>
 " {210000}"
Set size of Lisp heap, should not normally need 
to be changed from its default, only with *very* 
large description files (> 1M) 
</PARA></LISTITEM>
</varlistentry>

<varlistentry><term>-noprune</term>
<LISTITEM><PARA>

No (same class) pruning required </PARA></LISTITEM>
</varlistentry>
</variablelist>
        </para>
        <!-- DONE /amd/projects/festival/versions/v_mpiro/speech_tools_linux/bin/wagon -sgml_options -->
    </sect2>
    <simplesect>
      <title>Building Trees</title>
    <para>
To build a decision tree (or list) Wagon requires data and a desccription
of it.  A data file consists a set of samples, one per line each
consisting of the same set of features.   Features may be categorial
or continuous.  By default the first feature is the predictee and the
others are used as preditors.  A typical data file will look like
this
</para>
<para>
<screen>
0.399 pau sh  0   0     0 1 1 0 0 0 0 0 0 
0.082 sh  iy  pau onset 0 1 0 0 1 1 0 0 1
0.074 iy  hh  sh  coda  1 0 1 0 1 1 0 0 1
0.048 hh  ae  iy  onset 0 1 0 1 1 1 0 1 1
0.062 ae  d   hh  coda  1 0 0 1 1 1 0 1 1
0.020 d   y   ae  coda  2 0 1 1 1 1 0 1 1
0.082 y   ax  d   onset 0 1 0 1 1 1 1 1 1
0.082 ax  r   y   coda  1 0 0 1 1 1 1 1 1
0.036 r   d   ax  coda  2 0 1 1 1 1 1 1 1
...
</screen>
</para>
<para>
The data may come from any source, such as the festival script 
dumpfeats which allos the creation of such files easily from utetrance
files.  
</para><para>
In addition to a data file a description file is also require that 
gives a name and a type to each of the features in the datafile.
For the above example it would look like
</para><para>
<screen>
((segment_duration float)
( name  aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
( n.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
( p.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
(position_type 0 onset coda)
(pos_in_syl float)
(syl_initial 0 1)
(syl_final   0 1)
(R:Sylstructure.parent.R:Syllable.p.syl_break float)
(R:Sylstructure.parent.syl_break float)
(R:Sylstructure.parent.R:Syllable.n.syl_break float)
(R:Sylstructure.parent.R:Syllable.p.stress 0 1)
(R:Sylstructure.parent.stress 0 1)
(R:Sylstructure.parent.R:Syllable.n.stress 0 1)
)
</screen>
</para><para>
The feature names are arbitrary, but as they appear in the generated
trees is most useful if the trees are to be used in prediction of
an utterance that the names are features and/or pathnames.  
</para><para>
Wagon can be used to build a tree with such files with the command
<screen>
wagon -data feats.data -desc fest.desc -stop 10 -output feats.tree
</screen>
A test data set may also be given which must match the given data description.
If specified the built tree will be tested on the test set and results
on that wil be presented on completion, without a test set the
results are given with respect to the training data.  However in
stepwise case the test set is used in the multi-level training process
thus it cannot be considered as true test data and more reasonable 
results should found on applying the generate tree to truely
held out data (via the program wagon_test).
    </para>
    </simplesect>
  </sect1>