Sophie: festival-1.96-9mdv2008.1 x86

festival-1.96-9mdv2008.1.x86_64.rpm

This is festival.info, produced by Makeinfo version 3.12h from
festival.texi.

   This file documents the `Festival' Speech Synthesis System a general
text to speech system for making your computer talk and developing new
synthesis techniques.

   Copyright (C) 1996-2001 University of Edinburgh

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the authors.


File: festival.info,  Node: General intonation,  Next: Using ToBI,  Prev: Tilt intonation,  Up: Intonation

General intonation
==================

   As there seems to be a number of intonation theories that predict F0
contours by rule (possibly using trained parameters) this module aids
the external specification of such rules for a wide class of intonation
theories (through primarily those that might be referred to as the ToBI
group).  This is designed to be multi-lingual and offer a quick way to
port often pre-existing rules into Festival without writing new C++
code.

   The accent prediction part uses the same mechanisms as the Simple
intonation method described above, a decision tree for accent
prediction, thus the tree in the variable `int_accent_cart_tree' is
used on each syllable to predict an `IntEvent'.

   The target part calls a specified Scheme function which returns a
list of target points for a syllable.  In this way any arbitrary tests
may be done to produce the target points.  For example here is a
function which returns three target points for each syllable with an
`IntEvent' related to it (i.e.  accented syllables).
     (define (targ_func1 utt syl)
       "(targ_func1 UTT STREAMITEM)
     Returns a list of targets for the given syllable."
       (let ((start (item.feat syl 'syllable_start))
             (end (item.feat syl 'syllable_end)))
         (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")
             (list
              (list start 110)
              (list (/ (+ start end) 2.0) 140)
              (list end 100)))))
   This function may be identified as the function to call by the
following setup parameters.
     (Parameter.set 'Int_Method 'General)
     (Parameter.set 'Int_Target_Method Int_Targets_General)
     
     (set! int_general_params
           (list
            (list 'targ_func targ_func1)))


File: festival.info,  Node: Using ToBI,  Prev: General intonation,  Up: Intonation

Using ToBI
==========

   An example implementation of a ToBI to F0 target module is included
in `lib/tobi_rules.scm' based on the rules described in `jilka96'.
This uses the general intonation method discussed in the previous
section.  This is designed to be useful to people who are experimenting
with ToBI (`silverman92'), rather than general text to speech.

   To use this method you need to load `lib/tobi_rules.scm' and call
`setup_tobi_f0_method'.  The default is in a male's pitch range, i.e.
for `voice_rab_diphone'.  You can change it for other pitch ranges by
changing the folwoing variables.
     (Parameter.set 'Default_Topline 110)
     (Parameter.set 'Default_Start_Baseline 87)
     (Parameter.set 'Default_End_Baseline 83)
     (Parameter.set 'Current_Topline (Parameter.get 'Default_Topline))
     (Parameter.set 'Valley_Dip 75)

   An example using this from STML is given in `examples/tobi.stml'.
But it can also be used from Scheme.  For example before defining an
utterance you should execute the following either from teh command line
on in some setup file
     (voice_rab_diphone)
     (require 'tobi_rules)
     (setup_tobi_f0_method)
   In order to allow specification of accents, tones, and break levels
you must use an utterance type that allows such specification.  For
example
     (Utterance
      Words
      (boy
       (saw ((accent H*)))
        the
        (girl ((accent H*)))
        in the
        (park ((accent H*) (tone H-)))
        with the
        (telescope ((accent H*) (tone H-H%)))))
     
     (Utterance Words
      (The
       (boy ((accent L*)))
       saw
       the
       (girl ((accent H*) (tone L-)))
       with
       the
       (telescope ((accent H*) (tone H-H%))))))
   You can display the the synthesized form of these utterance in
Xwaves.  Start an Xwaves and an Xlabeller and call the function
`display' on the synthesized utterance.


File: festival.info,  Node: Duration,  Next: UniSyn synthesizer,  Prev: Intonation,  Up: Top

Duration
********

   A number of different duration prediction modules are available with
varying levels of sophistication.

   Segmental duration prediction is done by the module `Duration' which
calls different actual methods depending on the parameter
`Duration_Method'.

   All of the following duration methods may be further affected by
both a global duration stretch and a per word one.

   If the parameter `Duration_Stretch' is set, all absolute durations
predicted by any of the duration methods described here are multiplied
by the parameter's value.  For example
     (Parameter.set 'Duration_Stretch 1.2)
will make everything speak more slowly.
   In addition to the global stretch method, if the feature
`dur_stretch' on the related `Token' is set it will also be used as a
multiplicative factor on the duration produced by the selected method.
That is `R:Syllable.parent.parent.R:Token.parent.dur_stretch'.  There
is a lisp function `duration_find_stretch' wchi will return the
combined gloabel and local duration stretch factor for a given segment
item.

   Note these global and local methods of affecting the duration
produced by models are crude and should be considered hacks.  Uniform
modification of durations is not what happens in real speech.  These
parameters are typically used when the underlying duration method is
lacking in some way.  However these can be useful.

   Note it is quite easy to implement new duration methods in Scheme
directly.

* Menu:

* Default durations::   Fixed length durations
* Average durations::
* Klatt durations::     Klatt rules from book.
* CART durations::      Tree based durations


File: festival.info,  Node: Default durations,  Next: Average durations,  Up: Duration

Default durations
=================

   If parameter `Duration_Method' is set to `Default', the simplest
duration model is used.  All segments are 100 milliseconds (this can be
modified by `Duration_Stretch', and/or the localised Token related
`dur_stretch' feature).


File: festival.info,  Node: Average durations,  Next: Klatt durations,  Prev: Default durations,  Up: Duration

Average durations
=================

   If parameter `Duration_Method' is set to `Averages' then segmental
durations are set to their averages.  The variable `phoneme_durations'
should be an a-list of phones and averages in seconds.  The file
`lib/mrpa_durs.scm' has an example for the mrpa phoneset.

   If a segment is found that does not appear in the list a default
duration of 0.1 seconds is assigned, and a warning message generated.


File: festival.info,  Node: Klatt durations,  Next: CART durations,  Prev: Average durations,  Up: Duration

Klatt durations
===============

   If parameter `Duration_Method' is set to `Klatt' the duration rules
from the Klatt book (`allen87', chapter 9).  This method requires
minimum and inherent durations for each phoneme in the phoneset.  This
information is held in the variable `duration_klatt_params'.  Each
member of this list is a three-tuple, of phone name, inherent duration
and minimum duration.  An example for the mrpa phoneset is in
`lib/klatt_durs.scm'.


File: festival.info,  Node: CART durations,  Prev: Klatt durations,  Up: Duration

CART durations
==============

   Two very similar methods of duration prediction by CART tree are
supported.  The first, used when parameter `Duration_Method' is `Tree'
simply predicts durations directly for each segment.  The tree is set
in the variable `duration_cart_tree'.

   The second, which seems to give better results, is used when
parameter `Duration_Method' is `Tree_ZScores'. In this second model the
tree predicts zscores (number of standard deviations from the mean)
rather than duration directly.  (This follows `campbell91', but we
don't deal in syllable durations here.)  This method requires means and
standard deviations for each phone.  The variable `duration_cart_tree'
should contain the zscore prediction tree and the variable
`duration_ph_info' should contain a list of phone, mean duration, and
standard deviation for each phone in the phoneset.

   An example tree trained from 460 sentences spoken by Gordon is in
`lib/gswdurtreeZ'.  Phone means and standard deviations are in
`lib/gsw_durs.scm'.

   After prediction the segmental duration is calculated by the simple
formula
     duration = mean + (zscore * standard deviation)

   For some other duration models that affect an inherent duration by
some factor this method has been used.  If the tree predicts factors
rather than zscores and the `duration_ph_info' entries are phone, 0.0,
inherent duration. The above formula will generate the desired result.
Klatt and Klatt-like rules can be implemented in the this way without
adding a new method.


File: festival.info,  Node: UniSyn synthesizer,  Next: Diphone synthesizer,  Prev: Duration,  Up: Top

UniSyn synthesizer
******************

   Since 1.3 a new general synthesizer module has been included.  This
designed to replace the older diphone synthesizer described in the next
chapter.   A redesign was made in order to have a generalized waveform
synthesizer, singla processing module that could be used even when the
units being concatenated are not diphones.  Also at this stage the full
diphone (or other) database pre-processing functions were added to the
Speech Tool library.

UniSyn database format
======================

   The Unisyn synthesis modules can use databases in two basic formats,
_separate_ and _grouped_.  Separate is when all files (signal,
pitchmark and coefficient files) are accessed individually during
synthesis.  This is the standard use during databse development.  Group
format is when a database is collected together into a single special
file containing all information necessary for waveform synthesis.  This
format is designed to be used for distribution and general use of the
database.

   A database should consist of a set of waveforms, (which may be
translated into a set of coefficients if the desired the signal
processing method requires it), a set of pitchmarks and an index.  The
pitchmarks are necessary as most of our current signal processing are
pitch synchronous.

Generating pitchmarks
---------------------

   Pitchmarks may be derived from laryngograph files using the our
proved program `pitchmark' distributed with the speech tools.  The
actual parameters to this program are still a bit of an art form.  The
first major issue is which direction the lar files.  We have seen both,
though it does seem to be CSTR's ones are most often upside down while
others (e.g. OGI's) are the right way up.  The `-inv' argument to
`pitchmark' is specifically provided to cater for this.  There other
issues in getting the pitchmarks aligned.  The basic command for
generating pitchmarks is
     pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \
          -min 0.005 -max 0.012 -fill -def 0.01 -wave_end
   The `-min', `-max' and `-def' (fill values for unvoiced regions),
may need to be changed depending on the speaker pitch range.  The above
is suitable for a male speaker.  The `-fill' option states that
unvoiced sections should be filled with equally spaced pitchmarks.

Generating LPC coefficients
---------------------------

   LPC coefficients are generated using the `sig2fv' command.  Two
stages are required, generating the LPC coefficients and generating the
residual.  The prototypical commands for these are
     sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \
         -coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \
         -window_type hamming
     sigfilter wav/file001.wav -o lpc/file001.res -otype nist \
         -lpcfilter lpc/file001.lpc -inv_filter
   For some databases you may need to normalize the power.  Properly
normalizing power is difficult but we provide a simple function which
may do the jobs acceptably.  You should do this on the waveform before
lpc analysis (and ensure you also do the residual extraction on the
normalized waveform rather than the original.
     ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav
   This normalizes the power by maximizing the signal first then
multiplying it by the given factor.  If the database waveforms are
clean (i.e.  no clicks) this can give reasonable results.

Generating a diphone index
==========================

   The diphone index consists of a short header following by an ascii
list of each diphone, the file it comes from followed by its start
middle and end times in seconds.  For most databases this files needs
to be generated by some database specific script.

   An example header is
     EST_File index
     DataType ascii
     NumEntries 2005
     IndexName rab_diphone
     EST_Header_End
   The most notable part is the number of entries, which you should note
can get out of sync with the actual number of entries if you hand edit
entries.  I.e. if you add an entry and the system still can't find it
check that the number of entries is right.

   The entries themselves may take on one of two forms, full entries or
index entries.  Full entries consist of a diphone name, where the
phones are separated by "-"; a file name which is used to index into
the pitchmark, LPC and waveform file; and the start, middle (change
over point between phones) and end of the phone in the file in seconds
of the diphone.  For example
     r-uh    edx_1001        0.225   0.261   0.320
     r-e     edx_1002        0.224   0.273   0.326
     r-i     edx_1003        0.240   0.280   0.321
     r-o     edx_1004        0.212   0.253   0.320
   The second form of entry is an index entry which simply states that
reference to that diphone should actually be made to another.  For
example
     aa-ll   &aa-l
   This states that the diphone `aa-ll' should actually use the diphone
`aa-l'.  Note they are a number of ways to specify alternates for
missing diphones an this method is best used for fixing single or small
classes of missing or broken diphones.  Index entries may appear
anywhere in the file but can't be nested.

   Some checks are made one reading this index to ensure times etc are
reasonable but multiple entries for the same diphone are not, in that
case the later one will be selected.

Database declaration
====================

   There two major types of database _grouped_ and _ungrouped_.
Grouped databases come as a single file containing the diphone index,
coeficinets and residuals for the diphones.  This is the standard way
databases are distributed as voices in Festoval.  Ungrouped access
diphones from individual files and is designed as a method for
debugging and testing databases before distribution.  Using ungrouped
dataabse is slower but allows quicker changes to the index, and
associated coefficient files and residuals without rebuilding the group
file.

   A database is declared to the system through the command
`us_diphone_init'.  This function takes a parameter list of various
features used for setting up a database.  The features are
`name'
     An atomic name for this database, used in selecting it from the
     current set of laded database.

`index_file'
     A filename name containing either a diphone index, as descripbed
     above, or a group file.  The feature `grouped' defines the
     distinction between this being a group of simple index file.

`grouped'
     Takes the value `"true"' or `"false"'.  This defined simple index
     or if the index file is a grouped file.

`coef_dir'
     The directory containing the coefficients, (LPC or just pitchmarks
     in the PSOLA case).

`sig_dir'
     The directory containing the signal files (residual for LPC, full
     waveforms for PSOLA).

`coef_ext'
     The extention for coefficient files, typically `".lpc"' for LPC
     file and `".pm"' for pitchmark files.

`sig_ext'
     The extention for signal files, typically `".res"' for LPC residual
     files and `".wav"' for waveform files.

`default_diphone'
     The diphone to be used when the requested one doesn't exist.  No
     matter how careful you are you should always include a default
     diphone for distributed diphone database.   Synthesis will throw
     an error if no diphone is found and there is no default.  Although
     it is usually an error when this is required its better to fill in
     something than stop synthesizing.  Typical values for this are
     silence to silence or schwa to schwa.

`alternates_left'
     A list of pairs showing the alternate phone names for the left
     phone in a diphone pair.  This is list is used to rewrite the
     diphone name when the directly requested one doesn't exist.  This
     is the recommended method for dealing with systematic holes in a
     diphone database.

`alternates_right'
     A list of pairs showing the alternate phone names for the right
     phone in a diphone pair.  This is list is used to rewrite the
     diphone name when the directly requested one doesn't exist.  This
     is the recommended method for dealing with systematic holes in a
     diphone database.

   An example database definition is
     (set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone")
     (set! rab_lpc_group
           (list
            '(name "rab_lpc_group")
            (list 'index_file
                  (path-append rab_diphone_dir "group/rablpc16k.group"))
            '(alternates_left ((i ii) (ll l) (u uu) (i@ ii) (uh @) (a aa)
                                      (u@ uu) (w @) (o oo) (e@ ei) (e ei)
                                      (r @)))
            '(alternates_right ((i ii) (ll l) (u uu) (i@ ii)
                                       (y i) (uh @) (r @) (w @)))
            '(default_diphone @-@@)
            '(grouped "true")))
     (us_dipohone_init rab_lpc_group)

Making groupfiles
=================

   The function `us_make_group_file' will make a group file of the
currently selected US diphone database.  It loads in all diphone sin
the dtabaase and saves them in the named file.  An optional second
argument allows specification of how the group file will be saved.
These options are as a feature list.  There are three possible options
`track_file_format'
     The format for the coefficient files.  By default this is
     `est_binary', currently the only other alternative is `est_ascii'.

`sig_file_format'
     The format for the signal parts of the of the database.  By default
     this is `snd' (Sun's Audio format).  This was choosen as it has
     the smallest header and supports various sample formats.  Any
     format supported by the Edinburgh Speech Tools is allowed.

`sig_sample_format'
     The format for the samples in the signal files.  By default this
     is `mulaw'.  This is suitable when the signal files are LPC
     residuals.  LPC residuals have a much smaller dynamic range that
     plain PCM files.  Because `mulaw' representation is half the size
     (8 bits) of standard PCM files (16bits) this significantly reduces
     the size of the group file while only marginally altering the
     quality of synthesis (and from experiments the effect is not
     perceptible).  However when saving group files where the signals
     are not LPC residuals (e.g.  in PSOLA) using this default `mulaw'
     is not recommended and `short' should probably be used.

UniSyn module selection
=======================

   In a voice selection a UniSyn database may be selected as follows
       (set! UniSyn_module_hooks (list rab_diphone_const_clusters ))
       (set! us_abs_offset 0.0)
       (set! window_factor 1.0)
       (set! us_rel_offset 0.0)
       (set! us_gain 0.9)
     
       (Parameter.set 'Synth_Method 'UniSyn)
       (Parameter.set 'us_sigpr 'lpc)
       (us_db_select rab_db_name)
   The `UniSyn_module_hooks' are run before synthesis, see the next
selection about diphone name selection.  At present only `lpc' is
supported by the UniSyn module, though potentially there may be others.

   An optional implementation of TD-PSOLA `moulines90' has been written
but fear of legal problems unfortunately prevents it being in the
public distribution, but this policy should not be taken as
acknowledging or not acknowledging any alleged patent violation.

Diphone selection
=================

   Diphone names are constructed for each phone-phone pair in the
Segment relation in an utterance.  If a segment has the feature in
forming a diphone name UniSyn first checks for the feature
`us_diphone_left' (or `us_diphone_right' for the right hand part of the
diphone) then if that doesn't exist the feature `us_diphone' then if
that doesn't exist the feature `name'.  Thus is is possible to specify
diphone names which are not simply the concatenation of two segment
names.

   This feature is used to specify consonant cluster diphone names for
our English voices.  The hook `UniSyn_module_hooks' is run before
selection and we specify a function to add `us_diphone_*' features as
appropriate.  See the function `rab_diphone_fix_phone_name' in
`lib/voices/english/rab_diphone/festvox/rab_diphone.scm' for an example.

   Once the diphone name is created it is used to select the diphone
from the database.  If it is not found the name is converted using the
list of `alternates_left' and `alternates_right' as specified in the
database declaration.  If that doesn't specify a diphone in the
database.  The `default_diphone' is selected, and a warning is printed.
If no default diphone is specified or the default diphone doesn't
exist in the database an error is thrown.


File: festival.info,  Node: Diphone synthesizer,  Next: Other synthesis methods,  Prev: UniSyn synthesizer,  Up: Top

Diphone synthesizer
*******************

   _NOTE:_ use of this diphone synthesis is depricated and it will
probably be removed from future versions, all of its functionality has
been replaced by the UniSyn synthesizer.  It is not compiled by
default, if required add `ALSO_INCLUDE += diphone' to your
`festival/config/config' file.

   A basic diphone synthesizer offers a method for making speech from
segments, durations and intonation targets.  This module was mostly
written by Alistair Conkie but the base diphone format is compatible
with previous CSTR diphone synthesizers.

   The synthesizer offers residual excited LPC based synthesis
(`hunt89') and PSOLA (TM) (`moulines90') (PSOLA is not available for
distribution).

* Menu:

* Diphone database format::  Format of basic dbs
* LPC databases::         Building and using LPC files.
* Group files::           Efficient binary formats
* Diphone_Init::          Loading diphone databases
* Access strategies::     Various access methods
* Diphone selection::     Mapping phones to special diphone names


File: festival.info,  Node: Diphone database format,  Next: LPC databases,  Up: Diphone synthesizer

Diphone database format
=======================

   A diphone database consists of a _dictionary file_, a set of
_waveform files_, and a set of _pitch mark files_.  These files are the
same format as the previous CSTR (Osprey) synthesizer.

   The dictionary file consist of one entry per line.  Each entry
consists of five fields: a diphone name of the form P1-P2, a filename
(without extension), a floating point start position in the file in
milliseconds, a mid position in milliseconds (change in phone), and an
end position in milliseconds.  Lines starting with a semi-colon and
blank lines are ignored.  The list may be in any order.

   For example a partial list of phones may look like.
     ch-l  r021   412.035  463.009  518.23
     jh-l  d747   305.841  382.301  446.018
     h-l   d748   356.814  403.54   437.522
     #-@   d404   233.628  297.345  331.327
     @-#   d001   836.814  938.761  1002.48

   Waveform files may be in any form, as long as every file is the same
type, headered or unheadered as long as the format is supported the
speech tools wave reading functions.  These may be standard linear PCM
waveform files in the case of PSOLA or LPC coefficients and residual
when using the residual LPC synthesizer. *Note LPC databases::

   Pitch mark files consist a simple list of positions in milliseconds
(plus places after the point) in order, one per line of each pitch mark
in the file.  For high quality diphone synthesis these should be derived
from laryngograph data.  During unvoiced sections pitch marks should be
artificially created at reasonable intervals (e.g. 10 ms).  In the
current format there is no way to determine the "real" pitch marks from
the "unvoiced" pitch marks.

   It is normal to hold a diphone database in a directory with a number
of sub-directories namely `dic/' contain the dictionary file, `wave/'
for the waveform files, typically of whole nonsense words (sometimes
this directory is called `vox/' for historical reasons) and `pm/' for
the pitch mark files.  The filename in the dictionary entry should be
the same for waveform file and the pitch mark file (with different
extensions).


File: festival.info,  Node: LPC databases,  Next: Group files,  Prev: Diphone database format,  Up: Diphone synthesizer

LPC databases
=============

   The standard method for diphone resynthesis in the released system is
residual excited LPC (`hunt89').  The actual method of resynthesis
isn't important to the database format, but if residual LPC synthesis
is to be used then it is necessary to make the LPC coefficient files
and their corresponding residuals.

   Previous versions of the system used a "host of hacky little scripts"
to this but now that the Edinburgh Speech Tools supports LPC analysis
we can provide a walk through for generating these.

   We assume that the waveform file of nonsense words are in a directory
called `wave/'.  The LPC coefficients and residuals will be, in this
example, stored in `lpc16k/' with extensions `.lpc' and `.res'
respectively.

   Before starting it is worth considering power normalization.  We have
found this important on all of the databases we have collected so far.
The `ch_wave' program, part of the speech tools, with the optional
`-scaleN 0.4' may be used if a more complex method is not available.

   The following shell command generates the files
     for i in wave/*.wav
     do
        fname=`basename $i .wav`
        echo $i
        lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \
            -r lpc16k/$fname.res -otype htk -rtype nist $i
     done
   It is said that the LPC order should be sample rate divided by one
thousand plus 2.  This may or may not be appropriate and if you are
particularly worried about the database size it is worth experimenting.

   The program `lpc_analysis', found in `speech_tools/bin', can be used
to generate the lpc coefficients and residual.  Note these should be
reflection coefficients so they may be quantised (as they are in group
files).

   The coefficients and residual files produced by different LPC
analysis programs may start at different offsets.  For example the
Entropic's ESPS functions generate LPC coefficients that are offset by
one frame shift (e.g. 0.01 seconds).  Our own `lpc_analysis' routine
has no offset.  The `Diphone_Init' parameter list allows these offsets
to be specified.  Using the above function to generate the LPC files the
description parameters should include
       (lpc_frame_offset 0)
       (lpc_res_offset 0.0)
   While when generating using ESPS routines the description should be
       (lpc_frame_offset 1)
       (lpc_res_offset 0.01)
   The defaults actually follow the ESPS form, that is
`lpc_frame_offset' is 1 and `lpc_res_offset' is equal to the frame
shift, if they are not explicitly mentioned.

   Note the biggest problem we have in implementing the residual excited
LPC resynthesizer was getting the right part of the residual to line up
with the right LPC coefficients describing the pitch mark.  Making
errors in this degrades the synthesized waveform notably, but not
seriously, making it difficult to determine if it is an offset problem
or some other bug.

   Although we have started investigating if extracting pitch
synchronous LPC parameters rather than fixed shift parameters gives
better performance, we haven't finished this work.  `lpc_analysis'
supports pitch synchronous analysis but the raw "ungrouped" access
method does not yet.  At present the LPC parameters are extracted at a
particular pitch mark by interpolating over the closest LPC parameters.
The "group" files hold these interpolated parameters pitch
synchronously.

   The American English voice `kd' was created using the speech tools
`lpc_analysis' program and its set up should be looked at if you are
going to copy it.  The British English voice `rb' was constructed using
ESPS routines.


File: festival.info,  Node: Group files,  Next: Diphone_Init,  Prev: LPC databases,  Up: Diphone synthesizer

Group files
===========

   Databases may be accessed directly but this is usually too
inefficient for any purpose except debugging.  It is expected that
_group files_ will be built which contain a binary representation of the
database.  A group file is a compact efficient representation of the
diphone database.  Group files are byte order independent, so may be
shared between machines of different byte orders and word sizes.
Certain information in a group file may be changed at load time so a
database name, access strategy etc. may be changed from what was set
originally in the group file.

   A group file contains the basic parameters, the diphone index, the
signal (original waveform or LPC residual), LPC coefficients, and the
pitch marks.  It is all you need for a run-time synthesizer.  Various
compression mechanisms are supported to allow smaller databases if
desired.  A full English LPC plus residual database at 8k ulaw is about
3 megabytes, while a full 16 bit version at 16k is about 8 megabytes.

   Group files are created with the `Diphone.group' command which takes
a database name and an output filename as an argument.  Making group
files can take some time especially if they are large.  The
`group_type' parameter specifies `raw' or `ulaw' for encoding signal
files.  This can significantly reduce the size of databases.

   Group files may be partially loaded (see access strategies) at run
time for quicker start up and to minimise run-time memory requirements.


File: festival.info,  Node: Diphone_Init,  Next: Access strategies,  Prev: Group files,  Up: Diphone synthesizer

Diphone_Init
============

   The basic method for describing a database is through the
`Diphone_Init' command.  This function takes a single argument, a list
of pairs of parameter name and value.  The parameters are
`name'
     An atomic name for this database.

`group_file'
     The filename of a group file, which may itself contain parameters
     describing itself

`type'
     The default value is `pcm', but for distributed voices this is
     always `lpc'.

`index_file'
     A filename containing the diphone dictionary.

`signal_dir'
     A directory (slash terminated) containing the pcm waveform files.

`signal_ext'
     A dot prefixed extension for the pcm waveform files.

`pitch_dir'
     A directory (slash terminated) containing the pitch mark files.

`pitch_ext'
     A dot prefixed extension for the pitch files

`lpc_dir'
     A directory (slash terminated) containing the LPC coefficient files
     and residual files.

`lpc_ext'
     A dot prefixed extension for the LPC coefficient files

`lpc_type'
     The type of LPC file (as supported by the speech tools)

`lpc_frame_offset'
     The number of frames "missing" from the beginning of the file.
     Often LPC parameters are offset by one frame.

`lpc_res_ext'
     A dot prefixed extension for the residual files

`lpc_res_type'
     The type of the residual files, this is a standard waveform type
     as supported by the speech tools.

`lpc_res_offset'
     Number of seconds "missing" from the beginning of the residual
     file.  Some LPC analysis technique do not generate a residual
     until after one frame.

`samp_freq'
     Sample frequency of signal files

`phoneset'
     Phoneset used, must already be declared.

`num_diphones'
     Total number of diphones in database.  If specified this must be
     equal or bigger than the number of entries in the index file.  If
     it is not specified the square of the number of phones in the
     phoneset is used.

`sig_band'
     number of sample points around actual diphone to take from file.
     This should be larger than any windowing used on the signal,
     and/or up to the pitch marks outside the diphone signal.

`alternates_after'
     List of pairs of phones stating replacements for the second part
     of diphone when the basic diphone is not found in the diphone
     database.

`alternates_before'
     List of pairs of phones stating replacements for the first part of
     diphone when the basic diphone is not found in the diphone
     database.

`default_diphone'
     When unexpected combinations occur and no appropriate diphone can
     be found this diphone should be used.  This should be specified
     for all diphone databases that are to be robust.  We usually us
     the silence to silence diphone.  No mater how carefully you
     designed your diphone set, conditions when an unknown diphone
     occur seem to _always_ happen.  If this is not set and a diphone
     is requested that is not in the database an error occurs and
     synthesis will stop.

   Examples of both general set up, making group files and general use
are in
     `lib/voices/english/rab_diphone/festvox/rab_diphone.scm'


File: festival.info,  Node: Access strategies,  Next: Diphone selection,  Prev: Diphone_Init,  Up: Diphone synthesizer

Access strategies
=================

   Three basic accessing strategies are available when using diphone
databases.  They are designed to optimise access time, start up time
and space requirements.

`direct'
     Load all signals at database init time. This is the slowest
     startup but the fastest to access.  This is ideal for servers.  It
     is also useful for small databases that can be loaded quickly.  It
     is reasonable for many group files.

`dynamic'
     Load signals as they are required.  This has much faster start up
     and will only gradually use up memory as the diphones are actually
     used.  Useful for larger databases, and for non-group file access.

`ondemand'
     Load the signals as they are requested but free them if they are
     not required again immediately.  This is slower access but
     requires low memory usage.  In group files the re-reads are quite
     cheap as the database is well cached and a file description is
     already open for the file.  Note that in group files pitch marks
(and LPC coefficients) are always fully loaded (cf. `direct'), as they
are typically smaller.  Only signals (waveform files or residuals) are
potentially dynamically loaded.


File: festival.info,  Node: Diphone selection,  Prev: Access strategies,  Up: Diphone synthesizer

Diphone selection
=================

   The appropriate diphone is selected based on the name of the phone
identified in the segment stream.  However for better diphone synthesis
it is useful to augment the diphone database with other diphones in
addition to the ones directly from the phoneme set.  For example dark
and light l's, distinguishing consonants from their consonant cluster
form and their isolated form.  There are however two methods to identify
this modification from the basic name.

   When the diphone module is called the hook `diphone_module_hooks' is
applied.  That is a function of list of functions which will be applied
to the utterance.  Its main purpose is to allow the conversion of the
basic name into an augmented one.  For example converting a basic `l'
into a dark l, denoted by `ll'.  The functions given in
`diphone_module_hooks' may set the feature `diphone_phone_name' which
if set will be used rather than the `name' of the segment.

   For example suppose we wish to use a dark l (`ll') rather than a
normal l for all l's that appear in the coda of a syllable.  First we
would define a function to which identifies this condition and adds the
addition feature `diphone_phone_name' identify the name change.  The
following function would achieve this
     (define (fix_dark_ls utt)
     "(fix_dark_ls UTT)
     Identify ls in coda position and relabel them as ll."
       (mapcar
        (lambda (seg)
          (if (and (string-equal "l" (item.name seg))
                   (string-equal "+" (item.feat seg "p.ph_vc"))
                   (item.relation.prev seg "SylStructure"))
           (item.set_feat seg "diphone_phone_name" "ll")))
        (utt.relation.items utt 'Segment))
       utt)
   Then when we wish to use this for a particular voice we need to add
     (set! diphone_module_hooks (list fix_dark_ls))
in the voice selection function.
   For a more complex example including consonant cluster identification
see the American English voice `ked' in
`festival/lib/voices/english/ked/festvox/kd_diphone.scm'.  The function
`ked_diphone_fix_phone_name' carries out a number of mappings.

   The second method for changing a name is during actual look up of a
diphone in the database.  The list of alternates is given by the
`Diphone_Init' function.  These are used when the specified diphone
can't be found.  For example we often allow mappings of dark l, `ll' to
`l' as sometimes the dark l diphone doesn't actually exist in the
database.


File: festival.info,  Node: Other synthesis methods,  Next: Audio output,  Prev: Diphone synthesizer,  Up: Top

Other synthesis methods
***********************

   Festival supports a number of other synthesis systems

* Menu:

* LPC diphone synthesizer::  A small LPC synthesizer (Donovan diphones)
* MBROLA::                   Interface to MBROLA
* Synthesizers in development::


File: festival.info,  Node: LPC diphone synthesizer,  Next: MBROLA,  Up: Other synthesis methods

LPC diphone synthesizer
=======================

   A very simple, and very efficient LPC diphone synthesizer using the
"donovan" diphones is also supported.  This synthesis method is
primarily the work of Steve Isard and later Alistair Conkie.  The
synthesis quality is not as good as the residual excited LPC diphone
synthesizer but has the advantage of being much smaller.  The donovan
diphone database is under 800k.

   The diphones are loaded through the `Donovan_Init' function which
takes the name of the dictionary file and the diphone file as
arguments, see the following for details
     lib/voices/english/don_diphone/festvox/don_diphone.scm


File: festival.info,  Node: MBROLA,  Next: Synthesizers in development,  Prev: LPC diphone synthesizer,  Up: Other synthesis methods

MBROLA
======

   As an example of how Festival may use a completely external synthesis
method we support the free system MBROLA.  MBROLA is both a diphone
synthesis technique and an actual system that constructs waveforms from
segment, duration and F0 target information.  For details see the MBROLA
home page at `http://tcts.fpms.ac.be/synthesis/mbrola.html'.  MBROLA
already supports a number of diphone sets including French, Spanish,
German and Romanian.

   Festival support for MBROLA is in the file `lib/mbrola.scm'.  It is
all in Scheme.  The function `MBROLA_Synth' is called when parameter
`Synth_Method' is `MBROLA'.  The function simply saves the segment,
duration and target information from the utterance, calls the external
`mbrola' program with the selected diphone database, and reloads the
generated waveform back into the utterance.

   An MBROLA-ized version of the Roger diphoneset is available from the
MBROLA site.  The simple Festival end is distributed as part of the
system in `festvox_en1.tar.gz'.  The following variables are used by
the process
`mbrola_progname'
     the pathname of the mbrola executable.

`mbrola_database'
     the name of the database to use.  This variable is switched between
     different speakers.


File: festival.info,  Node: Synthesizers in development,  Prev: MBROLA,  Up: Other synthesis methods

Synthesizers in development
===========================

   In addition to the above synthesizers Festival also supports CSTR's
older PSOLA synthesizer written by Paul Taylor.  But as the newer
diphone synthesizer produces similar quality output and is a newer (and
hence a cleaner) implementation further development of the older module
is unlikely.

   An experimental unit seleciton synthesis module is included in
`modules/clunits/' it is an implementation of `black97c'.  It is
included for people wishing to continue reserach in the area rather
than as a fully usable waveform synthesis engine.  Although it sometimes
gives excellent results it also sometimes gives amazingly bad ones too.
We included this as an example of one possible framework for
selection-based synthesis.

   As one of our funded projects is to specifically develop new
selection based synthesis algorithms we expect to include more models
within later versions of the system.

   Also, now that Festival has been released other groups are working
on new synthesis techniques in the system.  Many of these will become
available and where possible we will give pointers from the Festival
home page to them.  Particularly there is an alternative residual
excited LPC module implemented at the Center for Spoken Language
Understanding (CSLU) at the Oregon Graduate Institute (OGI).


File: festival.info,  Node: Audio output,  Next: Voices,  Prev: Other synthesis methods,  Up: Top

Audio output
************

   If you have never heard any audio ever on your machine then you must
first work out if you have the appropriate hardware.  If you do, you
also need the appropriate software to drive it.  Festival can directly
interface with a number of audio systems or use external methods for
playing audio.

   The currently supported audio methods are
`NAS'
     NCD's NAS, is a network transparent audio system (formerly called
     netaudio).  If you already run servers on your machines you simply
     need to ensure your `AUDIOSERVER' environment variable is set (or
     your `DISPLAY' variable if your audio output device is the same as
     your X Windows display).  You may set NAS as your audio output
     method by the command
          (Parameter.set 'Audio_Method 'netaudio)

`/dev/audio'
     On many systems `/dev/audio' offers a simple low level method for
     audio output.  It is limited to mu-law encoding at 8KHz.  Some
     implementations of `/dev/audio' allow other sample rates and sample
     types but as that is non-standard this method only uses the common
     format.  Typical systems that offer these are Suns, Linux and
     FreeBSD machines.  You may set direct `/dev/audio' access as your
     audio method by the command
          (Parameter.set 'Audio_Method 'sunaudio)

`/dev/audio (16bit)'
     Later Sun Microsystems workstations support 16 bit linear audio at
     various sample rates.  Support for this form of audio output is
     supported.  It is a compile time option (as it requires include
     files that only exist on Sun machines.  If your installation
     supports it (check the members of the list `*modules*') you can
     select 16 bit audio output on Suns by the command
          (Parameter.set 'Audio_Method 'sun16audio)
     Note this will send it to the local machine where the festival
     binary is running, this might not be the one you are sitting next
     to--that's why we recommend netaudio.  A hacky solution to playing
     audio on a local machine from a remote machine without using
     netaudio is described in *Note Installation::

`/dev/dsp (voxware)'
     Both FreeBSD and Linux have a very similar audio interface through
     `/dev/dsp'.  There is compile time support for these in the speech
     tools and when compiled with that option Festival may utilise it.
     Check the value of the variable `*modules*' to see which audio
     devices are directly supported.  On FreeBSD, if supported, you may
     select local 16 bit linear audio by the command
          (Parameter.set 'Audio_Method 'freebsd16audio)
     While under Linux, if supported, you may use the command
          (Parameter.set 'Audio_Method 'linux16audio)
     Some earlier (and smaller machines) only have 8bit audio even
     though they include a `/dev/dsp' (Soundblaster PRO for example).
     This was not dealt with properly in earlier versions of the system
     but now the support automatically checks to see the sample width
     supported and uses it accordingly.  8 bit at higher frequencies
     that 8K sounds better than straight 8k ulaw so this feature is
     useful.

`mplayer'
     Under Windows NT or 95 you can use the `mplayer' command which we
     have found requires special treatement to get its parameters right.
     Rather than using `Audio_Command' you can select this on Windows
     machine with the following command
          (Parameter.set 'Audio_Method 'mplayeraudio)
     Alternatively built-in audio output is available with
          (Parameter.set 'Audio_Method 'win32audio)

`SGI IRIX'
     Builtin audio output is now available for SGI's IRIX 6.2 using the
     command
          (Parameter.set 'Audio_Method 'irixaudio)

`Audio Command'
     Alternatively the user can provide a command that can play an audio
     file.  Festival will execute that command in an environment where
     the shell variables `SR' is set to the sample rate (in Hz) and
     `FILE' which, by default, is the name of an unheadered raw, 16bit
     file containing the synthesized waveform in the byte order of the
     machine Festival is running on.  You can specify your audio play
     command and that you wish Festival to execute that command through
     the following command
          (Parameter.set 'Audio_Command "sun16play -f $SR $FILE")
          (Parameter.set 'Audio_Method 'Audio_Command)
     On SGI machines under IRIX the equivalent would be
          (Parameter.set 'Audio_Command
                     "sfplay -i integer 16 2scomp rate $SR end $FILE")
          (Parameter.set 'Audio_Method 'Audio_Command)
     The `Audio_Command' method of playing waveforms Festival supports
two additional audio parameters. `Audio_Required_Rate' allows you to
use Festival's internal sample rate conversion function to any desired
rate.  Note this may not be as good as playing the waveform at the
sample rate it is originally created in, but as some hardware devices
are restrictive in what sample rates they support, or have naive
resample functions this could be optimal.  The second additional audio
parameter is `Audio_Required_Format' which can be used to specify the
desired output forms of the file.  The default is unheadered raw, but
this may be any of the values supported by the speech tools (including
nist, esps, snd, riff, aiff, audlab, raw and, if you really want it,
ascii).  For example suppose you have a program that only plays sun
headered files at 16000 KHz you can set up audio output as
     (Parameter.set 'Audio_Method 'Audio_Command)
     (Parameter.set 'Audio_Required_Rate 16000)
     (Parameter.set 'Audio_Required_Format 'snd)
     (Parameter.set 'Audio_Command "sunplay $FILE")

   Where the audio method supports it, you can specify alternative audio
device for machine that have more than one audio device.
     (Parameter.set 'Audio_Device "/dev/dsp2")

   If Netaudio is not available and you need to play audio on a machine
different from teh one Festival is running on we have had reports that
`snack' (`http://www.speech.kth.se/snack/') is a possible solution.  It
allows remote play but importnatly also supports Windows 95/NT based
clients.

   Because you do not want to wait for a whole file to be synthesized
before you can play it, Festival also offers an _audio spooler_ that
allows the playing of audio files while continuing to synthesize the
following utterances.  On reasonable workstations this allows the
breaks between utterances to be as short as your hardware allows them
to be.

   The audio spooler may be started by selecting asynchronous mode
     (audio_mode async)
   This is switched on by default be the function `tts'.  You may put
Festival back into synchronous mode (i.e. the `utt.play' command will
wait until the audio has finished playing before returning).  by the
command
     (audio_mode sync)
   Additional related commands are
`(audio_mode 'close)'
     Close the audio server down but wait until it is cleared.  This is
     useful in scripts etc. when you wish to only exit when all audio is
     complete.

`(audio_mode 'shutup)'
     Close the audio down now, stopping the current file being played
     and any in the queue.  Note that this may take some time to take
     effect depending on which audio method you use.  Sometimes there
     can be 100s of milliseconds of audio in the device itself which
     cannot be stopped.

`(audio_mode 'query)'
     Lists the size of each waveform currently in the queue.


File: festival.info,  Node: Voices,  Next: Tools,  Prev: Audio output,  Up: Top

Voices
******

   This chapter gives some general suggestions about adding new voices
to Festival.  Festival attempts to offer an environment where new voices
and languages can easily be slotted in to the system.

* Menu:

* Current voices::                 Currently available voices
* Building a new voice::           Building a new voice
* Defining a new voice::           Defining a new voice