Sophie

Sophie

distrib > Fedora > 14 > x86_64 > media > updates > by-pkgid > 0e54ba0ee564ce6063a5e83aa86060c5 > files > 446

festival-speechtools-devel-1.2.96-18.fc14.i686.rpm



<chapter>
    <title>Speech Classes</title>


<sect1><title>Overview</title>
<toc></toc>
<para>

      EST offers two classes for handling and storing speech
      information of all types: Waveforms and Tracks.  Both are
      basically matrices with one dimension representing time, the
      other representing a particular channel and the value at that
      position representing an amplitude. There are signficant
      differences between them, however, that makes the use of two
      separate classes preferable to one.

</para>
<sect2><title>Waveforms</title><para>

Waveforms store digitalled sampled acoustic waveforms. They are
composed of a matrix of shorts, where rows represent individual
samples and columns represent channels. Waves can have arbitrarily
many channels, though 1 (mono) and 2 (stereo) are the most common.
Waves are stored as shorts as this is the most common file format,
which ensures fast compatibility with most file formats and hardware.
As each sample is representing by a 16-bit short, the dynamic range of
a wave is 96dB.
</para>


</sect2><sect2><title>The Track Class</title>
<para>
The track class is used to represent the outcome of a signal
processing operation on a section of speech. It can be thought of as
representing a series of <emphasis>frames</emphasis>, where each frame
represents signal processing information at a specified time point.
</para>



<sect3><title>The Amplitude Matrix</title>
<para>
Each frame is a set of ordered coefficients, which represent the
output of a signal processing operation on a single section of
speech. For example, a frame may represent a spectrum, a cepstrum, or
a set of linear predication coefficients. An alternative view is to
visualise the track as a set of channels, where each channel
represents a how a particular type of information varies with
time. For instance, a channel might represent how the energy between
500Hz and 600Hz varies over the course of an utterance.

</para><para>
Frames and Channels are stored as a matrix of floats, where each point
in the matrix represents the amplitude of a given frame and a given
channel.
</para>

</sect3><sect3><title>The Time Array</title>
<para>
In addition to the amplitude matrix, tracks also contain a
<emphasis>time</emphasis>array, which has the same number of elements
as frames in the amplitude matrix. The time array is aligned
one-to-one with the frames. Each position in the time array represents
the time of its frame. In many forms of signal processing, frames are
at fixed intervals (often 10ms), and in such cases it would be
possible to store this as a single global value. However, the track
class is extremely general in terms of time positions and allows
frames to be spaced irregularly, which is particularly useful when
dealing with pitch-synchronous processing.
</para>

</sect3><sect3><title>The Break/Value Array</title> 
<para>

The track class also contains a <emphasis>break/value</emphasis> array,
each element of which also as a one-to-one correspondence with a
frame. In many representations some frames have undefined values, and
the break/value array is used to represent this. For example, F0
contours and formants do not have values during unvoiced sections of
speech, and hence frames representing unvoiced sections may be tagged
as breaks in the break/value array. By default, it is assumed that all
amplitudes are defined and hence no breaks are set at contour
initialisation or resizing.
</para>
<para>
In time, this will be replaced by the more general Auxiliary Matrix.
</para>
</sect3><sect3><title>Channel names</title> 
<para>
trackmaps etc.
</para>

<sect3><title>The Auxiliary Matrix</title>
<para>
It is inappropriate to store certain information in the amplitude array.
</para>



</sect3><sect3><title>Sub-tracks, channel and frame extaction</title> 
<para>
The track class provides an easy mechanism for dealing with a single
portion of the track at a time. If we have say a track with 10
channels and 500 frames, it is possible to assign a vector to any
single frame or channel, or to assign a sub-track to any contiguous
set of frames or channels. Any values that are changed in the frame or
channel vectors or sub-track, will affect the underlying tack. It is
of course possible to copy values in and out if values need to be
changed without changing the underlying track.
</para>
</sect3>
</sect2>
</sect1>
<sect1><title>Programs</title>
<para>The following programs are available:
</para>

<formalpara><title><link linkend="ch-wave-manual">ch_wave</link<para>performs basic operates on
waveforms, such as adding headers, resampling, rescaling, multi to
single channel conversion etc.
</para></formalpara>

<formalpara><title><link linkend="ch-track-manual">ch_track</link<para>performs basic operates on
coefficient tracks, such as adding headers, resampling, rescaling,
multi to single channel conversion etc. </para></formalpara>

</sect1>


<sect1><title>Classes</title>
<toc depth=2></toc>
&docpp-EST-Wave-2;
&docpp-EST-Track-2;
</sect1>


<sect1><title>Functions</title>

<sect2><title>Auxiliary Track Functions</title>
&docpp-EST-track-aux-h-3;
</sect2>
</sect1>
&trackexamplesection;

</chapter>

<!--
Local Variables:
mode: sgml
sgml-doctype: "speechtools.sgml"
sgml-parent-document: ("speechtools.sgml" "chapter" "book")
End:

<sect1><title>Example Code</title>
&trackexamplesection;
</sect1>
<sect1><title>Class Definitions</title>
<xref linkend='Ref-EST-Wave'>
<xref linkend='Ref-EST-Track'>
</sect1>

-->