Sophie: festival-1.96-9mdv2008.1 x86

festival-1.96-9mdv2008.1.x86_64.rpm

This is festival.info, produced by Makeinfo version 3.12h from
festival.texi.

   This file documents the `Festival' Speech Synthesis System a general
text to speech system for making your computer talk and developing new
synthesis techniques.

   Copyright (C) 1996-2001 University of Edinburgh

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the authors.


File: festival.info,  Node: Utterance structure,  Next: Utterance types,  Up: Utterances

Utterance structure
===================

   Festival's basic object for synthesis is the _utterance_.  An
represents some chunk of text that is to be rendered as speech.  In
general you may think of it as a sentence but in many cases it wont
actually conform to the standard linguistic syntactic form of a
sentence.  In general the process of text to speech is to take an
utterance which contaisn a simple string of characters and convert it
step by step, filling out the utterance structure with more information
until a waveform is built that says what the text contains.

   The processes involved in convertion are, in general, as follows
_Tokenization_
     Converting the string of characters into a list of tokens.
     Typically this means whitespace separated tokesn of the original
     text string.

_Token identification_
     identification of general types for the tokens, usually this is
     trivial but requires some work to identify tokens of digits as
     years, dates, numbers etc.

_Token to word_
     Convert each tokens to zero or more words, expanding numbers,
     abbreviations etc.

_Part of speech_
     Identify the syntactic part of speech for the words.

_Prosodic phrasing_
     Chunk utterance into prosodic phrases.

_Lexical lookup_
     Find the pronucnation of each word from a lexicon/letter to sound
     rule system including phonetic and syllable structure.

_Intonational accents_
     Assign intonation accents to approrpiate syllables.

_Assign duration_
     Assign duration to each phone in the utterance.

_Generate F0 contour (tune)_
     Generate tune based on accents etc.

_Render waveform_
     Render waveform from phones, duration and F) target values, this
     itself may take several steps including unit selection (be they
     diphones or other sized units), imposition of dsesired prosody
     (duration and F0) and waveform reconstruction.  The number of
steps and what actually happens may vary and is dependent on the
particular voice selected and the utterance's _type_, see below.

   Each of these steps in Festival is achived by a _module_ which will
typically add new information to the utterance structure.

   An utterance structure consists of a set of _items_ which may be
part of one or more _relations_.  Items represent things like words and
phones, though may also be used to represent less concrete objects like
noun phrases, and nodes in metrical trees.  An item contains a set of
features, (name and value).  Relations are typically simple lists of
items or trees of items.  For example the the `Word' relation is a
simple list of items each of which represent a word in the utternace.
Those words will also be in other relations, such as the _SylStructure_
relation where the word will be the top of a tree structure containing
its syllables and segments.

   Unlike previous versions of the system items (then called stream
items) are not in any particular relations (or stream).  And are merely
part of the relations they are within.  Importantly this allows much
more general relations to be made over items that was allowed in the
previous system.  This new architecture is the continuation of our goal
of providing a general efficient structure for representing complex
interrelated utterance objects.

   The architecture is fully general and new items and relations may be
defined at run time, such that new modules may use any relations they
wish. However within our standard English (and other voices) we have
used a specific set of relations ass follows.
_Token_
     a list of trees.  This is first formed as a list of tokens found
     in a character text string.  Each root's daughters are the _Word_'s
     that the token is related to.

_Word_
     a list of words.  These items will also appear as daughters (leaf
     nodes) of the `Token' relation.  They may also appear in the
     `Syntax' relation (as leafs) if the parser is used.  They will
     also be leafs of the `Phrase' relation.

_Phrase_
     a list of trees.  This is a list of phrase roots whose daughters
     are the `Word's' within those phrases.

_Syntax_
     a single tree.  This, if the probabilistic parser is called, is a
     syntactic binary branching tree over the members of the `Word'
     relation.

_SylStructure_
     a list of trees.  This links the `Word', `Syllable' and `Segment'
     relations.  Each `Word' is the root of a tree whose immediate
     daughters are its syllables and their daughters in turn as its
     segments.

_Syllable_
     a list of syllables.  Each member will also be in a the
     `SylStructure' relation.  In that relation its parent will be the
     word it is in and its daughters will be the segments that are in
     it.  Syllables are also in the `Intonation' relation giving links
     to their related intonation events.

_Segment_
     a list of segments (phones).  Each member (except silences) will
     be leaf nodes in the `SylStructure' relation.  These may also be
     in the `Target' relation linking them to F0 target points.

_IntEvent_
     a list of intonation events (accents and bounaries).  These are
     related to syllables through the `Intonation' relation as leafs on
     that relation.  Thus their parent in the `Intonation' relation is
     the syllable these events are attached to.

_Intonation_
     a list of trees relating syllables to intonation events.  Roots of
     the trees in `Intonation' are `Syllables' and their daughters are
     `IntEvents'.

_Wave_
     a single item with a feature called `wave' whose value is the
     generated waveform.  This is a non-exhaustive list some modules
may add other relations and not all utterance may have all these
relations, but the above is the general case.


File: festival.info,  Node: Utterance types,  Next: Example utterance types,  Prev: Utterance structure,  Up: Utterances

Utterance types
===============

   The primary purpose of types is to define which modules are to be
applied to an utterance.  `UttTypes' are defined in
`lib/synthesis.scm'.  The function `defUttType' defines which modules
are to be applied to an utterance of that type.  The function
`utt.synth' is called applies this list of module to an utterance
before waveform synthesis is called.

   For example when a `Segment' type Utterance is synthesized it needs
only have its values loaded into a `Segment' relation and a `Target'
relation, then the low level waveform synthesis module `Wave_Synth' is
called.  This is defined as follows
     (defUttType Segments
       (Initialize utt)
       (Wave_Synth utt))
   A more complex type is `Text' type utterance which requires many
more modules to be called before a waveform can be synthesized
     (defUttType Text
       (Initialize utt)
       (Text utt)
       (Token utt)
       (POS utt)
       (Phrasify utt)
       (Word utt)
       (Intonation utt)
       (Duration utt)
       (Int_Targets utt)
       (Wave_Synth utt)
     )
   The `Initialize' module should normally be called for all types.  It
loads the necessary relations from the input form and deletes all other
relations (if any exist) ready for synthesis.

   Modules may be directly defined as C/C++ functions and declared with
a Lisp name or simple functions in Lisp that check some global parameter
before calling a specific module (e.g. choosing between different
intonation modules).

   These types are used when calling the function `utt.synth' and
individual modules may be called explicitly by hand if required.

   Because we expect waveform synthesis methods to themselves become
complex with a defined set of functions to select, join, and modify
units we now support an addition notion of `SynthTypes' like `UttTypes'
these define a set of functions to apply to an utterance.  These may be
defined using the `defSynthType' function.  For example
     (defSynthType Festival
       (print "synth method Festival")
     
       (print "select")
       (simple_diphone_select utt)
     
       (print "join")
       (cut_unit_join utt)
     
       (print "impose")
       (simple_impose utt)
       (simple_power utt)
     
       (print "synthesis")
       (frames_lpc_synthesis utt)
       )
   A `SynthType' is selected by naming as the value of the parameter
`Synth_Method'.

   Duration the application of the function `utt.synth' there are three
hooks applied.  This allows addition control of the synthesis process.
`before_synth_hooks' is applied before any modules are applied.
`after_analysis_hooks' is applied at the start of `Wave_Synth' when all
text, linguistic and prosodic processing have been done.
`after_synth_hooks' is applied after all modules have been applied.
These are useful for things such as, altering the volume of a voice
that happens to be quieter than others, or for example outputing
information for a talking head before waveform synthesis occurs so
preparation of the facial frames and synthesizing the waveform may be
done in parallel.  (see `festival/examples/th-mode.scm' for an example
use of these hooks for a talking head text mode.)


File: festival.info,  Node: Example utterance types,  Next: Utterance modules,  Prev: Utterance types,  Up: Utterances

Example utterance types
=======================

   A number of utterance types are currently supported.  It is easy to
add new ones but the standard distribution includes the following.

`Text'
     Raw text as a string.
          (Utterance Text "This is an example")

`Words'
     A list of words
          (Utterance Words (this is an example))
     Words may be atomic or lists if further features need to be
     specified.  For example to specify a word and its part of speech
     you can use
          (Utterance Words (I (live (pos v)) in (Reading (pos n) (tone H-H%))))
     Note: the use of the tone feature requires an intonation mode that
     supports it.

     Any feature and value named in the input will be added to the Word
     item.

`Phrase'
     This allows explicit phrasing and features on Tokens to be
     specified.  The input consists of a list of phrases each contains
     a list of tokens.
          (Utterance
           Phrase
           ((Phrase ((name B))
             I saw the man
             (in ((EMPH 1)))
             the park)
            (Phrase ((name BB))
             with the telescope)))
     ToBI tones and accents may also be specified on Tokens but these
     will only take effect if the selected intonation method uses them.

`Segments'
     This allows specification of segments, durations and F0 target
     values.
          (Utterance
           Segments
           ((# 0.19 )
            (h 0.055 (0 115))
            (@ 0.037 (0.018 136))
            (l 0.064 )
            (ou 0.208 (0.0 134) (0.100 135) (0.208 123))
            (# 0.19)))
     Note the times are in _seconds_ NOT milliseconds.  The format of
     each segment entry is segment name, duration in seconds, and list
     of target values.  Each target value consists of a pair of point
     into the segment (in seconds) and F0 value in Hz.

`Phones'
     This allows a simple specification of a list of phones.  Synthesis
     specifies fixed durations (specified in `FP_duration', default 100
     ms) and monotone intonation (specified in `FP_F0', default 120Hz).
     This may be used for simple checks for waveform synthesizers etc.
          (Utterance Phones (# h @ l ou #))
     Note the function `SayPhones' allows synthesis and playing of
     lists of phones through this utterance type.

`Wave'
     A waveform file.  Synthesis here simply involves loading the file.
          (Utterance Wave fred.wav)
     Others are supported, as defined in `lib/synthesis.scm' but are
used internally by various parts of the system.  These include `Tokens'
used in TTS and `SegF0' used by `utt.resynth'.


File: festival.info,  Node: Utterance modules,  Next: Accessing an utterance,  Prev: Example utterance types,  Up: Utterances

Utterance modules
=================

   The module is the basic unit that does the work of synthesis.  Within
Festival there are duration modules, intonation modules, wave synthesis
modules etc.  As stated above the utterance type defines the set of
modules which are to be applied to the utterance.  These modules in turn
will create relations and items so that ultimately a waveform is
generated, if required.

   Many of the chapters in this manual are solely concerned with
particular modules in the system.  Note that many modules have internal
choices, such as which duration method to use or which intonation
method to use.  Such general choices are often done through the
`Parameter' system.  Parameters may be set for different features like
`Duration_Method', `Synth_Method' etc.  Formerly the values for these
parameters were atomic values but now they may be the functions
themselves.  For example, to select the Klatt duration rules
     (Parameter.set 'Duration_Method Duration_Klatt)
   This allows new modules to be added without requiring changes to the
central Lisp functions such as `Duration', `Intonation', and
`Wave_Synth'.


File: festival.info,  Node: Accessing an utterance,  Next: Features,  Prev: Utterance modules,  Up: Utterances

Accessing an utterance
======================

   There are a number of standard functions that allow one to access
parts of an utterance and traverse through it.

   Functions exist in Lisp (and of course C++) for accessing an
utterance.  The Lisp access functions are
`(utt.relationnames UTT)'
     returns a list of the names of the relations currently created in
     `UTT'.

`(utt.relation.items UTT RELATIONNAME)'
     returns a list of all items in `RELATIONNAME' in `UTT'.  This is
     nil if no relation of that name exists.  Note for tree relation
     will give the items in pre-order.

`(utt.relation_tree UTT RELATIONNAME)'
     A Lisp tree presentation of the items `RELATIONNAME' in `UTT'.
     The Lisp bracketing reflects the tree structure in the relation.

`(utt.relation.leafs UTT RELATIONNAME)'
     A list of all the leafs of the items in `RELATIONNAME' in `UTT'.
     Leafs are defined as those items with no daughters within that
     relation.  For simple list relations `utt.relation.leafs' and
     `utt.relation.items' will return the same thing.

`(utt.relation.first UTT RELATIONNAME)'
     returns the first item in `RELATIONNAME'.  Returns `nil' if this
     relation contains no items

`(utt.relation.last UTT RELATIONNAME)'
     returns the last (the most next) item in `RELATIONNAME'.  Returns
     `nil' if this relation contains no items

`(item.feat ITEM FEATNAME)'
     returns the value of feature `FEATNAME' in `ITEM'.  `FEATNAME' may
     be a feature name, feature function name, or pathname (see below).
     allowing reference to other parts of the utterance this item is in.

`(item.features ITEM)'
     Returns an assoc list of feature-value pairs of all local features
     on this item.

`(item.name ITEM)'
     Returns the name of this `ITEM'.  This could also be accessed as
     `(item.feat ITEM 'name)'.

`(item.set_name ITEM NEWNAME)'
     Sets name on `ITEM' to be `NEWNAME'.  This is equivalent to
     `(item.set_feat ITEM 'name NEWNAME)'

`(item.set_feat ITEM FEATNAME FEATVALUE)'
     set the value of `FEATNAME' to `FEATVALUE' in `ITEM'.  `FEATNAME'
     should be a simple name and not refer to next, previous or other
     relations via links.

`(item.relation ITEM RELATIONNAME)'
     Return the item as viewed from `RELATIONNAME', or `nil' if `ITEM'
     is not in that relation.

`(item.relationnames ITEM)'
     Return a list of relation names that this item is in.

`(item.relationname ITEM)'
     Return the relation name that this item is currently being viewed
     as.

`(item.next ITEM)'
     Return the next item in `ITEM''s current relation, or `nil' if
     there is no next.

`(item.prev ITEM)'
     Return the previous item in `ITEM''s current relation, or `nil' if
     there is no previous.

`(item.parent ITEM)'
     Return the parent of `ITEM' in `ITEM''s current relation, or `nil'
     if there is no parent.

`(item.daughter1 ITEM)'
     Return the first daughter of `ITEM' in `ITEM''s current relation,
     or `nil' if there are no daughters.

`(item.daughter2 ITEM)'
     Return the second daughter of `ITEM' in `ITEM''s current relation,
     or `nil' if there is no second daughter.

`(item.daughtern ITEM)'
     Return the last daughter of `ITEM' in `ITEM''s current relation, or
     `nil' if there are no daughters.

`(item.leafs ITEM)'
     Return a list of all lefs items (those with no daughters) dominated
     by this item.

`(item.next_leaf ITEM)'
     Find the next item in this relation that has no daughters.  Note
     this may traverse up the tree from this point to search for such
     an item.

   As from 1.2 the utterance structure may be fully manipulated from
Scheme.  Relations and items may be created and deleted, as easily as
they can in C++;
`(utt.relation.present UTT RELATIONNAME)'
     returns `t' if relation named `RELATIONNAME' is present, `nil'
     otherwise.

`(utt.relation.create UTT RELATIONNAME)'
     Creates a new relation called `RELATIONNAME'.  If this relation
     already exists it is deleted first and items in the relation are
     derefenced from it (deleting the items if they are no longer
     referenced by any relation).  Thus create relation guarantees an
     empty relation.

`(utt.relation.delete UTT RELATIONNAME)'
     Deletes the relation called `RELATIONNAME' in utt.  All items in
     that relation are derefenced from the relation and if they are no
     longer in any relation the items themselves are deleted.

`(utt.relation.append UTT RELATIONNAME ITEM)'
     Append `ITEM' to end of relation named `RELATIONNAME' in `UTT'.
     Returns `nil' if there is not relation named `RELATIONNAME' in
     `UTT' otherwise returns the item appended.  This new item becomes
     the last in the top list.  `ITEM' item may be an item itself (in
     this or another relation) or a LISP description of an item, which
     consist of a list containing a name and a set of feature vale
     pairs.  It `ITEM' is `nil' or inspecified an new empty item is
     added.  If `ITEM' is already in this relation it is dereferenced
     from its current position (and an emtpy item re-inserted).

`(item.insert ITEM1 ITEM2 DIRECTION)'
     Insert `ITEM2' into `ITEM1''s relation in the direction specified
     by `DIRECTION'.  `DIRECTION' may take the value, `before',
     `after', `above' and `below'.  If unspecified, `after' is assumed.
     Note it is not recommended to insert above and below and the
     functions `item.insert_parent' and `item.append_daughter' should
     normally be used for tree building.  Inserting using `before' and
     `after' within daughters is perfectly safe.

`(item.append_daughter PARENT DAUGHTER)'
     Append `DAUGHTER', an item or a description of an item to the item
     `PARENT' in the `PARENT''s relation.

`(item.insert_parent DAUGHTER NEWPARENT)'
     Insert a new parent above `DAUGHTER'.  `NEWPARENT' may be a item
     or the description of an item.

`(item.delete ITEM)'
     Delete this item from all relations it is in.  All daughters of
     this item in each relations are also removed from the relation
     (which may in turn cause them to be deleted if they cease to be
     referenced by any other relation.

`(item.relation.remove ITEM)'
     Remove this item from this relation, and any of its daughters.
     Other relations this item are in remain untouched.

`(item.move_tree FROM TO)'
     Move the item `FROM' to the position of `TO' in `TO''s relation.
     `FROM' will often be in the same relation as `TO' but that isn't
     necessary.  The contents of `TO' are dereferenced.  its daughters
     are saved then descendants of `FROM' are recreated under the new
     `TO', then `TO''s previous daughters are derefenced.   The order
     of this is important as `FROM' may be part of `TO''s descendants.
     Note that if `TO' is part of `FROM''s descendants no moving occurs
     and `nil' is returned.  For example to remove all punction
     terminal nodes in the Syntax relation the call would be something
     like
          (define (syntax_relation_punc p)
            (if (string-equal "punc" (item.feat (item.daughter2 p) "pos"))
                (item.move_tree (item.daughter1 p) p)
            (mapcar syntax_remove_punc (item.daughters p))))

`(item.exchange_trees ITEM1 ITEM2)'
     Exchange `ITEM1' and `ITEM2' and their descendants in `ITEM2''s
     relation.  If `ITEM1' is within `ITEM2''s descendents or vice
     versa `nil' is returns and no exchange takes place.  If `ITEM1' is
     not in `ITEM2''s relation, no exchange takes place.

   Daughters of a node are actually represented as a list whose first
daughter is double linked to the parent.  Although being aware of this
structure may be useful it is recommended that all access go through
the tree specific functions `*.parent' and `*.daughter*' which properly
deal with the structure, thus is the internal structure ever changes in
the future only these tree access function need be updated.

   With the above functions quite elaborate utterance manipulations can
be performed.  For example in post-lexical rules where modifications to
the segments are required based on the words and their context.  *Note
Post-lexical rules:: for an example of using various utterance access
functions.


File: festival.info,  Node: Features,  Next: Utterance I/O,  Prev: Accessing an utterance,  Up: Utterances

Features
========

   In previous versions items had a number of predefined features.
This is no longer the case and all features are optional.  Particularly
the `start' and `end' features are no longer fixed, though those names
are still used in the relations where yjeu are appropriate.  Specific
functions are provided for the `name' feature but they are just short
hand for normal feature access.  Simple features directly access the
features in the underlying `EST_Feature' class in an item.

   In addition to simple features there is a mechanism for relating
functions to names, thus accessing a feature may actually call a
function.  For example the features `num_syls' is defined as a feature
function which will count the number of syllables in the given word,
rather than simple access a pre-existing feature.  Feature functions
are usually dependent on the particular realtion the item is in, e.g.
some feature functions are only appropriate for items in the `Word'
relation, or only appropriate for those in the `IntEvent' relation.

   The third aspect of feature names is a path component.  These are
parts of the name (preceding in `.') that indicated some trversal of
the utterance structure.  For example the features `name' will access
the name feature on the given item.  The feature `n.name' will return
the name feature on the next item (in that item's relation).  A number
of basic direction operators are defined.
`n.'
     next

`p.'
     previous

`nn.'
     next next

`pp.'
     previous

`parent.'

`daughter1.'
     first daughter

`daughter2.'
     second daughter

`daughtern.'
     last daughter

`first.'
     most previous item

`last.'
     most next item Also you may specific traversal to another relation
relation, though the `R:<relationame>.' operator.  For example given an
Item in the syllable relation `R:SylStructure.parent.name' would give
the name of word the syllable is in.

   Some more complex examples are as follows, assuming we are starting
form an item in the `Syllable' relation.
`stress'
     This item's lexical stress

`n.stress'
     The next syllable's lexical stress

`p.stress'
     The previous syllable's lexical stress

`R:SylStructure.parent.name'
     The word this syllable is in

`R:SylStructure.parent.R:Word.n.name'
     The word next to the word this syllable is in

`n.R:SylStructure.parent.name'
     The word the next syllable is in

`R:SylStructure.daughtern.ph_vc'
     The phonetic feature `vc' of the final segment in this syllable.
A list of all feature functions is given in an appendix of this
document. *Note Feature functions::.  New functions may also be added
in Lisp.

   In C++ feature values are of class _EST_Val_ which may be a string,
int, or a float (or any arbitrary object).  In Scheme this distinction
cannot not always be made and sometimes when you expect an int you
actually get a string.  Care should be take to ensure the right matching
functions are use in Scheme.  It is recommended you use `string-append'
or `string-match' as they will always work.

   If a pathname does not identify a valid path for the particular item
(e.g. there is no next) `"0"' is returned.

   When collecting data from speech databases it is often useful to
collect a whole set of features from all utterances in a database.
These features can then be used for building various models (both CART
tree models and linear regression modules use these feature names),

   A number of functions exist to help in this task.  For example
     (utt.features utt1 'Word '(name pos p.pos n.pos))
   will return a list of word, and part of speech context for each word
in the utterance.

   *Note Extracting features:: for an example of extracting sets of
features from a database for use in building stochastic models.


File: festival.info,  Node: Utterance I/O,  Prev: Features,  Up: Utterances

Utterance I/O
=============

   A number of functions are available to allow an utterance's
structure to be made available for other programs.

   The whole structure, all relations, items and features may be saved
in an ascii format using the function `utt.save'.  This file may be
reloaded using the `utt.load' function.  Note the waveform is not saved
using the form.

   Individual aspects of an utterance may be selectively saved.  The
waveform itself may be saved using the function `utt.save.wave'.  This
will save the waveform in the named file in the format specified in the
`Parameter' `Wavefiletype'.  All formats supported by the Edinburgh
Speech Tools are valid including `nist', `esps', `sun', `riff', `aiff',
`raw' and `ulaw'.  Note the functions `utt.wave.rescale' and
`utt.wave.resample' may be used to change the gain and sample frequency
of the waveform before saving it.  A waveform may be imported into an
existing utterance with the function `utt.import.wave'.  This is
specifically designed to allow external methods of waveform synthesis.
However if you just wish to play an external wave or make it into an
utterance you should consider the utterance `Wave' type.

   The segments of an utterance may be saved in a file using the
function `utt.save.segs' which saves the segments of the named
utterance in xlabel format.  Any other stream may also be saved using
the more general `utt.save.relation' which takes the additional
argument of a relation name.  The names of each item and the end
feature of each item are saved in the named file, again in Xlabel
format, other features are saved in extra fields.  For more elaborated
saving methods you can easily write a Scheme function to save data in
an utterance in whatever format is required.  See the file
`lib/mbrola.scm' for an example.

   A simple function to allow the displaying of an utterance in
Entropic's Xwaves tool is provided by the function `display'.  It
simply saves the waveform and the segments and sends appropriate
commands to (the already running) Xwaves and xlabel programs.

   A function to synthesize an externally specified utterance is
provided for by `utt.resynth' which takes two filename arguments, an
xlabel segment file and an F0 file.  This function loads, synthesizes
and plays an utterance synthesized from these files.  The loading is
provided by the underlying function `utt.load.segf0'.


File: festival.info,  Node: Text analysis,  Next: POS tagging,  Prev: Utterances,  Up: Top

Text analysis
*************

* Menu:

* Tokenizing::    Splitting text into tokens
* Token to word rules::
* Homograph disambiguation::  "Wed 5 may wind US Sen up"


File: festival.info,  Node: Tokenizing,  Next: Token to word rules,  Up: Text analysis

Tokenizing
==========

   A crucial stage in text processing is the initial tokenization of
text.  A _token_ in Festival is an atom separated with whitespace from a
text file (or string).  If punctuation for the current language is
defined, characters matching that punctuation are removed from the
beginning and end of a token and held as features of the token.  The
default list of characters to be treated as white space is defined as
     (defvar token.whitespace " \t\n\r")
   While the default set of punctuation characters is
     (defvar token.punctuation "\"'`.,:;!?(){}[]")
     (defvar token.prepunctuation "\"'`({[")
   These are declared in `lib/token.scm' but may be changed for
different languages, text modes etc.


File: festival.info,  Node: Token to word rules,  Next: Homograph disambiguation,  Prev: Tokenizing,  Up: Text analysis

Token to word rules
===================

   Tokens are further analysed into lists of words.  A word is an atom
that can be given a pronunciation by the lexicon (or letter to sound
rules).  A token may give rise to a number of words or none at all.

   For example the basic tokens
     This pocket-watch was made in 1983.
   would give a word relation of
     this pocket watch was made in nineteen eighty three

   Becuase the relationship between tokens and word in some cases is
complex, a user function may be specified for translating tokens into
words.  This is designed to deal with things like numbers, email
addresses, and other non-obvious pronunciations of tokens as zero or
more words.  Currently a builtin function
`builtin_english_token_to_words' offers much of the necessary
functionality for English but a user may further customize this.

   If the user defines a function `token_to_words' which takes two
arguments: a token item and a token name, it will be called by the
`Token_English' and `Token_Any' modules.  A substantial example is
given as `english_token_to_words' in `festival/lib/token.scm'.

   An example of this function is in `lib/token.scm'.  It is quite
elaborate and covers most of the common multi-word tokens in English
including, numbers, money symbols, Roman numerals, dates, times,
plurals of symbols, number ranges, telephone number and various other
symbols.

   Let us look at the treatment of one particular phenomena which shows
the use of these rules.  Consider the expression "$12 million" which
should be rendered as the words "twelve million dollars".  Note the word
"dollars" which is introduced by the "$" sign, ends up after the end of
the expression.  There are two cases we need to deal with as there are
two tokens.  The first condition in the `cond' checks if the current
token name is a money symbol, while the second condition check that
following word is a magnitude (million, billion, trillion, zillion
etc.)  If that is the case the "$" is removed and the remaining numbers
are pronounced, by calling the builtin token to word function.  The
second condition deals with the second token.  It confirms the previous
is a money value (the same regular expression as before) and then
returns the word followed by the word "dollars".  If it is neither of
these forms then the builtin function is called.
     (define (token_to_words token name)
     "(token_to_words TOKEN NAME)
     Returns a list of words for NAME from TOKEN."
      (cond
       ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
             (string-matches (item.feat token "n.name") ".*illion.?"))
        (builtin_english_token_to_words token (string-after name "$")))
       ((and (string-matches (item.feat token "p.name")
                               "\\$[0-9,]+\\(\\.[0-9]+\\)?")
             (string-matches name ".*illion.?"))
        (list
         name
         "dollars"))
       (t
        (builtin_english_token_to_words token name))))
   It is valid to make some conditions return no words, though some care
should be taken with that, as punctuation information may no longer be
available to later processing if there are no words related to a token.


File: festival.info,  Node: Homograph disambiguation,  Prev: Token to word rules,  Up: Text analysis

Homograph disambiguation
========================

   Not all tokens can be rendered as words easily.  Their context may
affect the way they are to be pronounced.  For example in the utterance
     On May 5 1985, 1985 people moved to Livingston.
the tokens "1985" should be pronounced differently, the first as
   a year, "nineteen eighty five" while the second as a quantity "one
thousand nine hundred and eighty five".  Numbers may also be pronounced
as ordinals as in the "5" above, it should be "fifth" rather than
"five".

   Also, the pronunciation of certain words cannot simply be found from
their orthographic form alone.  Linguistic part of speech tags help to
disambiguate a large class of homographs, e.g. "lives".  A part of
speech tagger is included in Festival and discussed in *Note POS
tagging::.  But even part of speech isn't sufficient in a number of
cases.  Words such as "bass", "wind", "bow" etc cannot by distinguished
by part of speech alone, some semantic information is also required.  As
full semantic analysis of text is outwith the realms of Festival's
capabilities some other method for disambiguation is required.

   Following the work of `yarowsky96' we have included a method for
identified tokens to be further labelled with extra tags to help
identify their type.  Yarowsky uses _decision lists_ to identify
different types for homographs.  Decision lists are a restricted form
of decision trees which have some advantages over full trees, they are
easier to build and Yarowsky has shown them to be adequate for typical
homograph resolution.

Using disambiguators
--------------------

   Festival offers a method for assigning a `token_pos' feature to each
token.  It does so using Yarowsky-type disambiguation techniques.  A
list of disambiguators can be provided in the variable
`token_pos_cart_trees'.  Each disambiguator consists of a regular
expression and a CART tree (which may be a decision list as they have
the same format).  If a token matches the regular expression the CART
tree is applied to the token and the resulting class is assigned to the
token via the feature `token_pos'.  This is done by the `Token_POS'
module.

   For example, the follow disambiguator distinguishes "St" (street and
saint) and "Dr" (doctor and drive).
        ("\\([dD][Rr]\\|[Ss][tT]\\)"
         ((n.name is 0)
          ((p.cap is 1)
           ((street))
           ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
            ((street))
            ((title))))
          ((punc matches ".*,.*")
           ((street))
           ((p.punc matches ".*,.*")
            ((title))
            ((n.cap is 0)
             ((street))
             ((p.cap is 0)
              ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
               ((street))
               ((title)))
              ((pp.name matches "[1-9][0-9]+")
               ((street))
               ((title)))))))))
   Note that these only assign values for the feature `token_pos' and
do nothing more.  You must have a related token to word rule that
interprets this feature value and does the required translation.  For
example the corresponding token to word rule for the above disambiguator
is
       ((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)")
        (if (string-equal (item.feat token "token_pos") "street")
            (if (string-matches name "[dD][rR]")
                (list "drive")
                (list "street"))
            (if (string-matches name "[dD][rR]")
                (list "doctor")
                (list "saint"))))

Building disambiguators
-----------------------

   Festival offers some support for building disambiguation trees.  The
basic method is to find all occurrences of a homographic token in a
large text database, label each occurrence into classes, extract
appropriate context features for these tokens and finally build an
classification tree or decision list based on the extracted features.

   The extraction and building of trees is not yet a fully automated
process in Festival but the file `festival/examples/toksearch.scm'
shows some basic Scheme code we use for extracting tokens from very
large collections of text.

   The function `extract_tokens' does the real work.  It reads the
given file, token by token into a token stream.  Each token is tested
against the desired tokens and if there is a match the named features
are extracted.  The token stream will be extended to provide the
necessary context.  Note that only some features will make any sense in
this situation.  There is only a token relation so referring to words,
syllables etc. is not productive.

   In this example databases are identified by a file that lists all
the files in the text databases.  Its name is expected to be
`bin/DBNAME.files' where `DBNAME' is the name of the database.  The
file should contain a list of filenames in the database e.g for the
Gutenberg texts the file `bin/Gutenberg.files' contains
     gutenberg/etext90/bill11.txt
     gutenberg/etext90/const11.txt
     gutenberg/etext90/getty11.txt
     gutenberg/etext90/jfk11.txt
     ...

   Extracting the tokens is typically done in two passes.  The first
pass extracts the context (I've used 5 tokens either side).  It extracts
the file and position, so the token is identified, and the word in
context.

   Next those examples should be labelled with a small set of classes
which identify the type of the token.  For example for a token like
"Dr" whether it is a person's title or a street identifier.  Note that
hand-labelling can be laborious, though it is surprising how few tokens
of particular types actually exist in 62 million words.

   The next task is to extract the tokens with the features that will
best distinguish the particular token.  In our "Dr" case this will
involve punctuation around the token, capitalisation of surrounding
tokens etc.  After extracting the distinguishing tokens you must line
up the labels with these extracted features.  It would be easier to
extract both the context and the desired features at the same time but
experience shows that in labelling, more appropriate features come to
mind that will distinguish classes better and you don't want to have to
label twice.

   Once a set of features consisting of the label and features is
created it is easy to use `wagon' to create the corresponding decision
tree or decision list.  `wagon' supports both decision trees and
decision lists, it may be worth experimenting to find out which give
the best results on some held out test data.  It appears that decision
trees are typically better, but are often much larger, and the size
does not always justify the the sometimes only slightly better results.


File: festival.info,  Node: POS tagging,  Next: Phrase breaks,  Prev: Text analysis,  Up: Top

POS tagging
***********

   Part of speech tagging is a fairly well-defined process.  Festival
includes a part of speech tagger following the HMM-type taggers as found
in the Xerox tagger and others (e.g. `DeRose88').  Part of speech tags
are assigned, based on the probability distribution of tags given a
word, and from ngrams of tags.  These models are externally specified
and a Viterbi decoder is used to assign part of speech tags at run time.

   So far this tagger has only been used for English but there is
nothing language specific about it.  The module `POS' assigns the tags.
It accesses the following variables for parameterization.
`pos_lex_name'
     The name of a "lexicon" holding reverse probabilities of words
     given a tag (indexed by word).  If this is unset or has the value
     `NIL' no part of speech tagging takes place.

`pos_ngram_name'
     The name of a loaded ngram model of part of speech tags (loaded by
     `ngram.load').

`pos_p_start_tag'
     The name of the most likely tag before the start of an utterance.
     This is typically the tag for sentence final punctuation marks.

`pos_pp_start_tag'
     The name of the most likely tag two before the start of an
     utterance.  For English the is typically a simple noun, but for
     other languages it might be a verb.  If the ngram model is bigger
     than three this tag is effectively repeated for the previous left
     contexts.

`pos_map'
     We have found that it is often better to use a rich tagset for
     prediction of part of speech tags but that in later use (phrase
     breaks and dictionary lookup) a much more constrained tagset is
     better.  Thus mapping of the predicted tagset to a different
     tagset is supported.  `pos_map' should be a a list of pairs
     consisting of a list of tags to be mapped and the new tag they are
     to be mapped to.

   Note is it important to have the part of speech tagger match the
tags used in later parts of the system, particularly the lexicon.  Only
two of our lexicons used so far have (mappable) part of speech labels.

   An example of the part of speech tagger for English can be found in
`lib/pos.scm'.


File: festival.info,  Node: Phrase breaks,  Next: Intonation,  Prev: POS tagging,  Up: Top

Phrase breaks
*************

   There are two methods for predicting phrase breaks in Festival, one
simple and one sophisticated.  These two methods are selected through
the parameter `Phrase_Method' and phrasing is achieved by the module
`Phrasify'.

   The first method is by CART tree.  If parameter `Phrase_Method' is
`cart_tree', the CART tree in the variable `phrase_cart_tree' is
applied to each word to see if a break should be inserted or not.  The
tree should predict categories `BB' (for big break), `B' (for break) or
`NB' (for no break).  A simple example of a tree to predict phrase
breaks is given in the file `lib/phrase.scm'.
     (set! simple_phrase_cart_tree
     '
     ((R:Token.parent.punc in ("?" "." ":"))
       ((BB))
       ((R:Token.parent.punc in ("'" "\"" "," ";"))
        ((B))
        ((n.name is 0)
         ((BB))
         ((NB))))))

   The second and more elaborate method of phrase break prediction is
used when the parameter `Phrase_Method' is `prob_models'.  In this case
a probabilistic model using probabilities of a break after a word based
on the part of speech of the neighbouring words and the previous word.
This is combined with a ngram model of the distribution of breaks and
non-breaks using a Viterbi decoder to find the optimal phrasing of the
utterance.  The results using this technique are good and even show
good results on unseen data from other researchers' phrase break tests
(see `black97b').  However sometimes it does sound wrong, suggesting
there is still further work required.

   Parameters for this module are set through the feature list held in
the variable `phr_break_params', and example of which for English is
set in `english_phr_break_params' in the file `lib/phrase.scm'.  The
features names and meaning are

`pos_ngram_name'
     The name of a loaded ngram that gives probability distributions of
     B/NB given previous, current and next part of speech.

`pos_ngram_filename'
     The filename containing `pos_ngram_name'.

`break_ngram_name'
     The name of a loaded ngram of B/NB distributions.  This is
     typically a 6 or 7-gram.

`break_ngram_filename'
     The filename containing `break_ngram_name'.

`gram_scale_s'
     A weighting factor for breaks in the break/non-break ngram.
     Increasing the value insertes more breaks, reducing it causes less
     breaks to be inserted.

`phrase_type_tree'
     A CART tree that is used to predict type of break given the predict
     break position.  This (rather crude) technique is current used to
     distinguish major and minor breaks.

`break_tags'
     A list of the break tags (typically `(B NB)').

`pos_map'
     A part of speech map used to map the `pos' feature of words into a
     smaller tagset used by the phrase predictor.


File: festival.info,  Node: Intonation,  Next: Duration,  Prev: Phrase breaks,  Up: Top

Intonation
**********

   A number of different intonation modules are available with varying
levels of control.  In general intonation is generated in two steps.
  1. Prediction of accents (and/or end tones) on a per syllable basis.

  2. Prediction of F0 target values, this must be done after durations
     are predicted.

   Reflecting this split there are two main intonation modules that call
sub-modules depending on the desired intonation methods.  The
`Intonation' and `Int_Targets' modules are defined in Lisp
(`lib/intonation.scm') and call sub-modules which are (so far) in C++.

* Menu:

* Default intonation::  Effectively none at all.
* Simple intonation::   Accents and hats.
* Tree intonation::     Accents and Tones, and F0 prediction by LR
* Tilt intonation::     Using the Tilt intonation model
* General intonation::  A programmable intonation module
* Using ToBI::          A ToBI by rule example


File: festival.info,  Node: Default intonation,  Next: Simple intonation,  Up: Intonation

Default intonation
==================

   This is the simplest form of intonation and offers the modules
`Intonation_Default' and `Intonation_Targets_Default'.  The first of
which actually does nothing at all.  `Intonation_Targets_Default'
simply creates a target at the start of the utterance, and one at the
end.  The values of which, by default are 130 Hz and 110 Hz.  These
values may be set through the parameter `duffint_params' for example
the following will general a monotone at 150Hz.
     (set! duffint_params '((start 150) (end 150)))
     (Parameter.set 'Int_Method 'DuffInt)
     (Parameter.set 'Int_Target_Method Int_Targets_Default)


File: festival.info,  Node: Simple intonation,  Next: Tree intonation,  Prev: Default intonation,  Up: Intonation

Simple intonation
=================

   This module uses the CART tree in `int_accent_cart_tree' to predict
if each syllable is accented or not.  A predicted value of `NONE' means
no accent is generated by the corresponding `Int_Targets_Simple'
function.  Any other predicted value will cause a `hat' accent to be
put on that syllable.

   A default `int_accent_cart_tree' is available in the value
`simple_accent_cart_tree' in `lib/intonation.scm'.  It simply predicts
accents on the stressed syllables on content words in poly-syllabic
words, and on the only syllable in single syllable content words.  Its
form is
     (set! simple_accent_cart_tree
      '
       ((R:SylStructure.parent.gpos is content)
        ((stress is 1)
         ((Accented))
         ((position_type is single)
          ((Accented))
          ((NONE))))
        ((NONE))))

   The function `Int_Targets_Simple' uses parameters in the a-list in
variable `int_simple_params'.  There are two interesting parameters
`f0_mean' which gives the mean F0 for this speaker (default 110 Hz) and
`f0_std' is the standard deviation of F0 for this speaker (default 25
Hz).  This second value is used to determine the amount of variation to
be put in the generated targets.

   For each Phrase in the given utterance an F0 is generated starting at
`f0_code+(f0_std*0.6)' and declines `f0_std' Hz over the length of the
phrase until the last syllable whose end is set to `f0_code-f0_std'.
An imaginary line called `baseline' is drawn from start to the end
(minus the final extra fall), For each syllable that is accented (i.e.
has an IntEvent related to it) three targets are added.  One at the
start, one in mid vowel, and one at the end.  The start and end are at
position `baseline' Hz (as declined for that syllable) and the mid
vowel is set to `baseline+f0_std'.

   Note this model is not supposed to be complex or comprehensive but it
offers a very quick and easy way to generate something other than a
fixed line F0.  Something similar to this has been for Spanish and Welsh
without (too many) people complaining.  However it is not designed as a
serious intonation module.


File: festival.info,  Node: Tree intonation,  Next: Tilt intonation,  Prev: Simple intonation,  Up: Intonation

Tree intonation
===============

   This module is more flexible.  Two different CART trees can be used
to predict `accents' and `endtones'.  Although at present this module is
used for an implementation of the ToBI intonation labelling system it
could be used for many different types of intonation system.

   The target module for this method uses a Linear Regression model to
predict start mid-vowel and end targets for each syllable using
arbitrarily specified features.  This follows the work described in
`black96'.  The LR models are held as as described below *Note Linear
regression::.  Three models are used in the variables `f0_lr_start',
`f0_lr_mid' and `f0_lr_end'.


File: festival.info,  Node: Tilt intonation,  Next: General intonation,  Prev: Tree intonation,  Up: Intonation

Tilt intonation
===============

   Tilt description to be inserted.