Sophie: festival-1.96-9mdv2008.1 x86

festival-1.96-9mdv2008.1.x86_64.rpm

<HTML>
<HEAD>
<!-- This HTML file has been created by texi2html 1.52
     from ../festival.texi on 2 August 2001 -->

<TITLE>Festival Speech Synthesis System - 20  UniSyn synthesizer</TITLE>
</HEAD>
<BODY bgcolor="#ffffff">
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_19.html">previous</A>, <A HREF="festival_21.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
<P><HR><P>


<H1><A NAME="SEC76" HREF="festival_toc.html#TOC76">20  UniSyn synthesizer</A></H1>

<P>
<A NAME="IDX255"></A>
<A NAME="IDX256"></A>
<A NAME="IDX257"></A>
Since 1.3 a new general synthesizer module has been included.  This
designed to replace the older diphone synthesizer described in the
next chapter.   A redesign was made in order to have a generalized
waveform synthesizer, singla processing module that could be used
even when the units being concatenated are not diphones.  Also at
this stage the full diphone (or other) database pre-processing
functions were added to the Speech Tool library.

</P>


<H2><A NAME="SEC77" HREF="festival_toc.html#TOC77">20.1  UniSyn database format</A></H2>

<P>
<A NAME="IDX258"></A>
<A NAME="IDX259"></A>
<A NAME="IDX260"></A>
The Unisyn synthesis modules can use databases in two basic
formats, <EM>separate</EM> and <EM>grouped</EM>.  Separate is when
all files (signal, pitchmark and coefficient files) are accessed
individually during synthesis.  This is the standard use during
databse development.  Group format is when a database is collected
together into a single special file containing all information
necessary for waveform synthesis.  This format is designed to
be used for distribution and general use of the database.

</P>
<P>
A database should consist of a set of waveforms, (which may be
translated into a set of coefficients if the desired the signal
processing method requires it), a set of pitchmarks and an index.  The
pitchmarks are necessary as most of our current signal processing are
pitch synchronous.  

</P>


<H3><A NAME="SEC78" HREF="festival_toc.html#TOC78">20.1.1  Generating pitchmarks</A></H3>

<P>
<A NAME="IDX261"></A>
<A NAME="IDX262"></A>
Pitchmarks may be derived from laryngograph files using the our
proved program <TT>`pitchmark'</TT> distributed with the speech
tools.  The actual parameters to this program are still a bit of
an art form.  The first major issue is which direction the lar
files.  We have seen both, though it does seem to be CSTR's ones
are most often upside down while others (e.g. OGI's) are the right way
up.  The <CODE>-inv</CODE> argument to <TT>`pitchmark'</TT> is specifically
provided to cater for this.  There other issues in getting the
pitchmarks aligned.  The basic command for generating pitchmarks
is

<PRE>
pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \
     -min 0.005 -max 0.012 -fill -def 0.01 -wave_end
</PRE>

<P>
The <TT>`-min'</TT>, <TT>`-max'</TT> and <TT>`-def'</TT> (fill values for unvoiced
regions), may need to be changed depending on the speaker pitch
range.  The above is suitable for a male speaker.  The <TT>`-fill'</TT>
option states that unvoiced sections should be filled with equally
spaced pitchmarks.

</P>


<H3><A NAME="SEC79" HREF="festival_toc.html#TOC79">20.1.2  Generating LPC coefficients</A></H3>

<P>
<A NAME="IDX263"></A>
<A NAME="IDX264"></A>
<A NAME="IDX265"></A>
LPC coefficients are generated using the <TT>`sig2fv'</TT> command.  Two
stages are required, generating the LPC coefficients and generating
the residual.  The prototypical commands for these are

<PRE>
sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \
    -coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \
    -window_type hamming
sigfilter wav/file001.wav -o lpc/file001.res -otype nist \
    -lpcfilter lpc/file001.lpc -inv_filter
</PRE>

<P>
<A NAME="IDX266"></A>
For some databases you may need to normalize the power.  Properly
normalizing power is difficult but we provide a simple function which may
do the jobs acceptably.  You should do this on the waveform before
lpc analysis (and ensure you also do the residual extraction on the normalized
waveform rather than the original.

<PRE>
ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav
</PRE>

<P>
This normalizes the power by maximizing the signal first then multiplying
it by the given factor.  If the database waveforms are clean (i.e.
no clicks) this can give reasonable results.

</P>


<H2><A NAME="SEC80" HREF="festival_toc.html#TOC80">20.2  Generating a diphone index</A></H2>

<P>
<A NAME="IDX267"></A>
The diphone index consists of a short header following by an
ascii list of each diphone, the file it comes from followed by its 
start middle and end times in seconds.  For most databases this
files needs to be generated by some database specific script.

</P>
<P>
An example header is

<PRE>
EST_File index
DataType ascii
NumEntries 2005
IndexName rab_diphone
EST_Header_End
</PRE>

<P>
The most notable part is the number of entries, which you should note
can get out of sync with the actual number of entries if you hand
edit entries.  I.e. if you add an entry and the system still 
can't find it check that the number of entries is right.

</P>
<P>
The entries themselves may take on one of two forms, full
entries or index entries.  Full entries consist of a diphone
name, where the phones are separated by "-"; a file name 
which is used to index into the pitchmark, LPC and waveform file;
and the start, middle (change over point between phones) and end
of the phone in the file in seconds of the diphone.  For example

<PRE>
r-uh    edx_1001        0.225   0.261   0.320
r-e     edx_1002        0.224   0.273   0.326
r-i     edx_1003        0.240   0.280   0.321
r-o     edx_1004        0.212   0.253   0.320
</PRE>

<P>
The second form of entry is an index entry which 
simply states that reference to that diphone should actually be made
to another.  For example

<PRE>
aa-ll   &#38;aa-l
</PRE>

<P>
This states that the diphone <CODE>aa-ll</CODE> should actually use the
diphone <CODE>aa-l</CODE>.  Note they are a number of ways to specify
alternates for missing diphones an this method is best used for fixing
single or small classes of missing or broken diphones.  Index
entries may appear anywhere in the file but can't be nested.

</P>
<P>
Some checks are made one reading this index to ensure times etc
are reasonable but multiple entries for the same diphone are not, in
that case the later one will be selected.

</P>


<H2><A NAME="SEC81" HREF="festival_toc.html#TOC81">20.3  Database declaration</A></H2>

<P>
There two major types of database <EM>grouped</EM> and <EM>ungrouped</EM>.
Grouped databases come as a single file containing the diphone index,
coeficinets and residuals for the diphones.  This is the standard way
databases are distributed as voices in Festoval.  Ungrouped
access diphones from individual files and is designed as a method
for debugging and testing databases before distribution.  Using
ungrouped dataabse is slower but allows quicker changes to the index,
and associated coefficient files and residuals without rebuilding the
group file.

</P>
<P>
<A NAME="IDX268"></A>
A database is declared to the system through the command
<CODE>us_diphone_init</CODE>.  This function takes a parameter list of
various features used for setting up a database.  The features are
<DL COMPACT>

<DT><CODE>name</CODE>
<DD>
An atomic name for this database, used in selecting it from the current
set of laded database.
<DT><CODE>index_file</CODE>
<DD>
A filename name containing either a diphone index, as descripbed above,
or a group file.  The feature <CODE>grouped</CODE> defines the distinction
between this being a group of simple index file.
<DT><CODE>grouped</CODE>
<DD>
Takes the value <CODE>"true"</CODE> or <CODE>"false"</CODE>.  This defined
simple index or if the index file is a grouped file.
<DT><CODE>coef_dir</CODE>
<DD>
The directory containing the coefficients, (LPC or just pitchmarks in
the PSOLA case).  
<DT><CODE>sig_dir</CODE>
<DD>
The directory containing the signal files (residual for LPC, full waveforms
for PSOLA).
<DT><CODE>coef_ext</CODE>
<DD>
The extention for coefficient files, typically <CODE>".lpc"</CODE> for LPC
file and <CODE>".pm"</CODE> for pitchmark files.
<DT><CODE>sig_ext</CODE>
<DD>
The extention for signal files, typically <CODE>".res"</CODE> for LPC residual
files and <CODE>".wav"</CODE> for waveform files.
<DT><CODE>default_diphone</CODE>
<DD>
<A NAME="IDX269"></A>
The diphone to be used when the requested one doesn't exist.  No matter
how careful you are you should always include a default diphone for
distributed diphone database.   Synthesis will throw an error if 
no diphone is found and there is no default.  Although it is usually
an error when this is required its better to fill in something than
stop synthesizing.  Typical values for this are silence to silence
or schwa to schwa.
<DT><CODE>alternates_left</CODE>
<DD>
<A NAME="IDX270"></A>
<A NAME="IDX271"></A>
A list of pairs showing the alternate phone names for the left phone in
a diphone pair.  This is list is used to rewrite the diphone name when
the directly requested one doesn't exist.  This is the recommended
method for dealing with systematic holes in a diphone database.
<DT><CODE>alternates_right</CODE>
<DD>
A list of pairs showing the alternate phone names for the right phone in
a diphone pair.  This is list is used to rewrite the diphone name when
the directly requested one doesn't exist.  This is the recommended
method for dealing with systematic holes in a diphone database.
</DL>

<P>
An example database definition is 

<PRE>
(set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone")
(set! rab_lpc_group 
      (list
       '(name "rab_lpc_group")
       (list 'index_file 
             (path-append rab_diphone_dir "group/rablpc16k.group"))
       '(alternates_left ((i ii) (ll l) (u uu) (i@ ii) (uh @) (a aa)
                                 (u@ uu) (w @) (o oo) (e@ ei) (e ei)
                                 (r @)))
       '(alternates_right ((i ii) (ll l) (u uu) (i@ ii) 
                                  (y i) (uh @) (r @) (w @)))
       '(default_diphone @-@@)
       '(grouped "true")))
(us_dipohone_init rab_lpc_group)
</PRE>



<H2><A NAME="SEC82" HREF="festival_toc.html#TOC82">20.4  Making groupfiles</A></H2>

<P>
<A NAME="IDX272"></A>
<A NAME="IDX273"></A>
The function <CODE>us_make_group_file</CODE> will make a group file 
of the currently selected US diphone database.  It loads in all diphone
sin the dtabaase and saves them in the named file.  An optional
second argument allows specification of how the group file will
be saved.   These options are as a feature list.  There
are three possible options
<DL COMPACT>

<DT><CODE>track_file_format</CODE>
<DD>
The format for the coefficient files.  By default this is
<CODE>est_binary</CODE>, currently the only other alternative is <CODE>est_ascii</CODE>.
<DT><CODE>sig_file_format</CODE>
<DD>
The format for the signal parts of the of the database.  By default
this is <CODE>snd</CODE> (Sun's Audio format).  This was choosen as it has
the smallest header and supports various sample formats.  Any format
supported by the Edinburgh Speech Tools is allowed.
<DT><CODE>sig_sample_format</CODE>
<DD>
The format for the samples in the signal files.  By default this
is <CODE>mulaw</CODE>.  This is suitable when the signal files are LPC
residuals.  LPC residuals have a much smaller dynamic range that 
plain PCM files.  Because <CODE>mulaw</CODE> representation is half the size
(8 bits) of standard PCM files (16bits) this significantly reduces
the size of the group file while only marginally altering the quality of
synthesis (and from experiments the effect is not perceptible).  However
when saving group files where the signals are not LPC residuals (e.g.
in PSOLA) using this default <CODE>mulaw</CODE> is not recommended and
<CODE>short</CODE> should probably be used.
</DL>



<H2><A NAME="SEC83" HREF="festival_toc.html#TOC83">20.5  UniSyn module selection</A></H2>

<P>
In a voice selection a UniSyn database may be selected as follows

<PRE>
  (set! UniSyn_module_hooks (list rab_diphone_const_clusters ))
  (set! us_abs_offset 0.0)
  (set! window_factor 1.0)
  (set! us_rel_offset 0.0)
  (set! us_gain 0.9)

  (Parameter.set 'Synth_Method 'UniSyn)
  (Parameter.set 'us_sigpr 'lpc)
  (us_db_select rab_db_name)
</PRE>

<P>
The <CODE>UniSyn_module_hooks</CODE> are run before synthesis, see the next
selection about diphone name selection.  At present only <CODE>lpc</CODE>
is supported by the UniSyn module, though potentially there may be
others.

</P>
<P>
<A NAME="IDX274"></A>
<A NAME="IDX275"></A>
An optional implementation of TD-PSOLA <CITE>moulines90</CITE> has been
written but fear of legal problems unfortunately prevents it being in
the public distribution, but this policy should not be taken as
acknowledging or not acknowledging any alleged patent violation.

</P>


<H2><A NAME="SEC84" HREF="festival_toc.html#TOC84">20.6  Diphone selection</A></H2>

<P>
<A NAME="IDX276"></A>
<A NAME="IDX277"></A>
Diphone names are constructed for each phone-phone pair in the Segment
relation in an utterance.  If a segment has the feature in forming a
diphone name UniSyn first checks for the feature <CODE>us_diphone_left</CODE>
(or <CODE>us_diphone_right</CODE> for the right hand part of the diphone) then
if that doesn't exist the feature <CODE>us_diphone</CODE> then if that doesn't
exist the feature <CODE>name</CODE>.  Thus is is possible to specify diphone
names which are not simply the concatenation of two segment names.

</P>
<P>
This feature is used to specify consonant cluster diphone names
for our English voices.  The hook <CODE>UniSyn_module_hooks</CODE> is run 
before selection and we specify a function to add <CODE>us_diphone_*</CODE>
features as appropriate.  See the function <CODE>rab_diphone_fix_phone_name</CODE>
in <TT>`lib/voices/english/rab_diphone/festvox/rab_diphone.scm'</TT> for
an example.

</P>
<P>
Once the diphone name is created it is used to select the diphone from
the database.  If it is not found the name is converted using the list
of <CODE>alternates_left</CODE> and <CODE>alternates_right</CODE> as specified in
the database declaration.  If that doesn't specify a diphone in the
database.  The <CODE>default_diphone</CODE> is selected, and a warning is
printed.  If no default diphone is specified or the default diphone
doesn't exist in the database an error is thrown.

</P>
<P><HR><P>
Go to the <A HREF="festival_1.html">first</A>, <A HREF="festival_19.html">previous</A>, <A HREF="festival_21.html">next</A>, <A HREF="festival_35.html">last</A> section, <A HREF="festival_toc.html">table of contents</A>.
</BODY>
</HTML>