This is festival.info, produced by Makeinfo version 3.12h from festival.texi. This file documents the `Festival' Speech Synthesis System a general text to speech system for making your computer talk and developing new synthesis techniques. Copyright (C) 1996-2001 University of Edinburgh Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the authors. File: festival.info, Node: Utterance structure, Next: Utterance types, Up: Utterances Utterance structure =================== Festival's basic object for synthesis is the _utterance_. An represents some chunk of text that is to be rendered as speech. In general you may think of it as a sentence but in many cases it wont actually conform to the standard linguistic syntactic form of a sentence. In general the process of text to speech is to take an utterance which contaisn a simple string of characters and convert it step by step, filling out the utterance structure with more information until a waveform is built that says what the text contains. The processes involved in convertion are, in general, as follows _Tokenization_ Converting the string of characters into a list of tokens. Typically this means whitespace separated tokesn of the original text string. _Token identification_ identification of general types for the tokens, usually this is trivial but requires some work to identify tokens of digits as years, dates, numbers etc. _Token to word_ Convert each tokens to zero or more words, expanding numbers, abbreviations etc. _Part of speech_ Identify the syntactic part of speech for the words. _Prosodic phrasing_ Chunk utterance into prosodic phrases. _Lexical lookup_ Find the pronucnation of each word from a lexicon/letter to sound rule system including phonetic and syllable structure. _Intonational accents_ Assign intonation accents to approrpiate syllables. _Assign duration_ Assign duration to each phone in the utterance. _Generate F0 contour (tune)_ Generate tune based on accents etc. _Render waveform_ Render waveform from phones, duration and F) target values, this itself may take several steps including unit selection (be they diphones or other sized units), imposition of dsesired prosody (duration and F0) and waveform reconstruction. The number of steps and what actually happens may vary and is dependent on the particular voice selected and the utterance's _type_, see below. Each of these steps in Festival is achived by a _module_ which will typically add new information to the utterance structure. An utterance structure consists of a set of _items_ which may be part of one or more _relations_. Items represent things like words and phones, though may also be used to represent less concrete objects like noun phrases, and nodes in metrical trees. An item contains a set of features, (name and value). Relations are typically simple lists of items or trees of items. For example the the `Word' relation is a simple list of items each of which represent a word in the utternace. Those words will also be in other relations, such as the _SylStructure_ relation where the word will be the top of a tree structure containing its syllables and segments. Unlike previous versions of the system items (then called stream items) are not in any particular relations (or stream). And are merely part of the relations they are within. Importantly this allows much more general relations to be made over items that was allowed in the previous system. This new architecture is the continuation of our goal of providing a general efficient structure for representing complex interrelated utterance objects. The architecture is fully general and new items and relations may be defined at run time, such that new modules may use any relations they wish. However within our standard English (and other voices) we have used a specific set of relations ass follows. _Token_ a list of trees. This is first formed as a list of tokens found in a character text string. Each root's daughters are the _Word_'s that the token is related to. _Word_ a list of words. These items will also appear as daughters (leaf nodes) of the `Token' relation. They may also appear in the `Syntax' relation (as leafs) if the parser is used. They will also be leafs of the `Phrase' relation. _Phrase_ a list of trees. This is a list of phrase roots whose daughters are the `Word's' within those phrases. _Syntax_ a single tree. This, if the probabilistic parser is called, is a syntactic binary branching tree over the members of the `Word' relation. _SylStructure_ a list of trees. This links the `Word', `Syllable' and `Segment' relations. Each `Word' is the root of a tree whose immediate daughters are its syllables and their daughters in turn as its segments. _Syllable_ a list of syllables. Each member will also be in a the `SylStructure' relation. In that relation its parent will be the word it is in and its daughters will be the segments that are in it. Syllables are also in the `Intonation' relation giving links to their related intonation events. _Segment_ a list of segments (phones). Each member (except silences) will be leaf nodes in the `SylStructure' relation. These may also be in the `Target' relation linking them to F0 target points. _IntEvent_ a list of intonation events (accents and bounaries). These are related to syllables through the `Intonation' relation as leafs on that relation. Thus their parent in the `Intonation' relation is the syllable these events are attached to. _Intonation_ a list of trees relating syllables to intonation events. Roots of the trees in `Intonation' are `Syllables' and their daughters are `IntEvents'. _Wave_ a single item with a feature called `wave' whose value is the generated waveform. This is a non-exhaustive list some modules may add other relations and not all utterance may have all these relations, but the above is the general case. File: festival.info, Node: Utterance types, Next: Example utterance types, Prev: Utterance structure, Up: Utterances Utterance types =============== The primary purpose of types is to define which modules are to be applied to an utterance. `UttTypes' are defined in `lib/synthesis.scm'. The function `defUttType' defines which modules are to be applied to an utterance of that type. The function `utt.synth' is called applies this list of module to an utterance before waveform synthesis is called. For example when a `Segment' type Utterance is synthesized it needs only have its values loaded into a `Segment' relation and a `Target' relation, then the low level waveform synthesis module `Wave_Synth' is called. This is defined as follows (defUttType Segments (Initialize utt) (Wave_Synth utt)) A more complex type is `Text' type utterance which requires many more modules to be called before a waveform can be synthesized (defUttType Text (Initialize utt) (Text utt) (Token utt) (POS utt) (Phrasify utt) (Word utt) (Intonation utt) (Duration utt) (Int_Targets utt) (Wave_Synth utt) ) The `Initialize' module should normally be called for all types. It loads the necessary relations from the input form and deletes all other relations (if any exist) ready for synthesis. Modules may be directly defined as C/C++ functions and declared with a Lisp name or simple functions in Lisp that check some global parameter before calling a specific module (e.g. choosing between different intonation modules). These types are used when calling the function `utt.synth' and individual modules may be called explicitly by hand if required. Because we expect waveform synthesis methods to themselves become complex with a defined set of functions to select, join, and modify units we now support an addition notion of `SynthTypes' like `UttTypes' these define a set of functions to apply to an utterance. These may be defined using the `defSynthType' function. For example (defSynthType Festival (print "synth method Festival") (print "select") (simple_diphone_select utt) (print "join") (cut_unit_join utt) (print "impose") (simple_impose utt) (simple_power utt) (print "synthesis") (frames_lpc_synthesis utt) ) A `SynthType' is selected by naming as the value of the parameter `Synth_Method'. Duration the application of the function `utt.synth' there are three hooks applied. This allows addition control of the synthesis process. `before_synth_hooks' is applied before any modules are applied. `after_analysis_hooks' is applied at the start of `Wave_Synth' when all text, linguistic and prosodic processing have been done. `after_synth_hooks' is applied after all modules have been applied. These are useful for things such as, altering the volume of a voice that happens to be quieter than others, or for example outputing information for a talking head before waveform synthesis occurs so preparation of the facial frames and synthesizing the waveform may be done in parallel. (see `festival/examples/th-mode.scm' for an example use of these hooks for a talking head text mode.) File: festival.info, Node: Example utterance types, Next: Utterance modules, Prev: Utterance types, Up: Utterances Example utterance types ======================= A number of utterance types are currently supported. It is easy to add new ones but the standard distribution includes the following. `Text' Raw text as a string. (Utterance Text "This is an example") `Words' A list of words (Utterance Words (this is an example)) Words may be atomic or lists if further features need to be specified. For example to specify a word and its part of speech you can use (Utterance Words (I (live (pos v)) in (Reading (pos n) (tone H-H%)))) Note: the use of the tone feature requires an intonation mode that supports it. Any feature and value named in the input will be added to the Word item. `Phrase' This allows explicit phrasing and features on Tokens to be specified. The input consists of a list of phrases each contains a list of tokens. (Utterance Phrase ((Phrase ((name B)) I saw the man (in ((EMPH 1))) the park) (Phrase ((name BB)) with the telescope))) ToBI tones and accents may also be specified on Tokens but these will only take effect if the selected intonation method uses them. `Segments' This allows specification of segments, durations and F0 target values. (Utterance Segments ((# 0.19 ) (h 0.055 (0 115)) (@ 0.037 (0.018 136)) (l 0.064 ) (ou 0.208 (0.0 134) (0.100 135) (0.208 123)) (# 0.19))) Note the times are in _seconds_ NOT milliseconds. The format of each segment entry is segment name, duration in seconds, and list of target values. Each target value consists of a pair of point into the segment (in seconds) and F0 value in Hz. `Phones' This allows a simple specification of a list of phones. Synthesis specifies fixed durations (specified in `FP_duration', default 100 ms) and monotone intonation (specified in `FP_F0', default 120Hz). This may be used for simple checks for waveform synthesizers etc. (Utterance Phones (# h @ l ou #)) Note the function `SayPhones' allows synthesis and playing of lists of phones through this utterance type. `Wave' A waveform file. Synthesis here simply involves loading the file. (Utterance Wave fred.wav) Others are supported, as defined in `lib/synthesis.scm' but are used internally by various parts of the system. These include `Tokens' used in TTS and `SegF0' used by `utt.resynth'. File: festival.info, Node: Utterance modules, Next: Accessing an utterance, Prev: Example utterance types, Up: Utterances Utterance modules ================= The module is the basic unit that does the work of synthesis. Within Festival there are duration modules, intonation modules, wave synthesis modules etc. As stated above the utterance type defines the set of modules which are to be applied to the utterance. These modules in turn will create relations and items so that ultimately a waveform is generated, if required. Many of the chapters in this manual are solely concerned with particular modules in the system. Note that many modules have internal choices, such as which duration method to use or which intonation method to use. Such general choices are often done through the `Parameter' system. Parameters may be set for different features like `Duration_Method', `Synth_Method' etc. Formerly the values for these parameters were atomic values but now they may be the functions themselves. For example, to select the Klatt duration rules (Parameter.set 'Duration_Method Duration_Klatt) This allows new modules to be added without requiring changes to the central Lisp functions such as `Duration', `Intonation', and `Wave_Synth'. File: festival.info, Node: Accessing an utterance, Next: Features, Prev: Utterance modules, Up: Utterances Accessing an utterance ====================== There are a number of standard functions that allow one to access parts of an utterance and traverse through it. Functions exist in Lisp (and of course C++) for accessing an utterance. The Lisp access functions are `(utt.relationnames UTT)' returns a list of the names of the relations currently created in `UTT'. `(utt.relation.items UTT RELATIONNAME)' returns a list of all items in `RELATIONNAME' in `UTT'. This is nil if no relation of that name exists. Note for tree relation will give the items in pre-order. `(utt.relation_tree UTT RELATIONNAME)' A Lisp tree presentation of the items `RELATIONNAME' in `UTT'. The Lisp bracketing reflects the tree structure in the relation. `(utt.relation.leafs UTT RELATIONNAME)' A list of all the leafs of the items in `RELATIONNAME' in `UTT'. Leafs are defined as those items with no daughters within that relation. For simple list relations `utt.relation.leafs' and `utt.relation.items' will return the same thing. `(utt.relation.first UTT RELATIONNAME)' returns the first item in `RELATIONNAME'. Returns `nil' if this relation contains no items `(utt.relation.last UTT RELATIONNAME)' returns the last (the most next) item in `RELATIONNAME'. Returns `nil' if this relation contains no items `(item.feat ITEM FEATNAME)' returns the value of feature `FEATNAME' in `ITEM'. `FEATNAME' may be a feature name, feature function name, or pathname (see below). allowing reference to other parts of the utterance this item is in. `(item.features ITEM)' Returns an assoc list of feature-value pairs of all local features on this item. `(item.name ITEM)' Returns the name of this `ITEM'. This could also be accessed as `(item.feat ITEM 'name)'. `(item.set_name ITEM NEWNAME)' Sets name on `ITEM' to be `NEWNAME'. This is equivalent to `(item.set_feat ITEM 'name NEWNAME)' `(item.set_feat ITEM FEATNAME FEATVALUE)' set the value of `FEATNAME' to `FEATVALUE' in `ITEM'. `FEATNAME' should be a simple name and not refer to next, previous or other relations via links. `(item.relation ITEM RELATIONNAME)' Return the item as viewed from `RELATIONNAME', or `nil' if `ITEM' is not in that relation. `(item.relationnames ITEM)' Return a list of relation names that this item is in. `(item.relationname ITEM)' Return the relation name that this item is currently being viewed as. `(item.next ITEM)' Return the next item in `ITEM''s current relation, or `nil' if there is no next. `(item.prev ITEM)' Return the previous item in `ITEM''s current relation, or `nil' if there is no previous. `(item.parent ITEM)' Return the parent of `ITEM' in `ITEM''s current relation, or `nil' if there is no parent. `(item.daughter1 ITEM)' Return the first daughter of `ITEM' in `ITEM''s current relation, or `nil' if there are no daughters. `(item.daughter2 ITEM)' Return the second daughter of `ITEM' in `ITEM''s current relation, or `nil' if there is no second daughter. `(item.daughtern ITEM)' Return the last daughter of `ITEM' in `ITEM''s current relation, or `nil' if there are no daughters. `(item.leafs ITEM)' Return a list of all lefs items (those with no daughters) dominated by this item. `(item.next_leaf ITEM)' Find the next item in this relation that has no daughters. Note this may traverse up the tree from this point to search for such an item. As from 1.2 the utterance structure may be fully manipulated from Scheme. Relations and items may be created and deleted, as easily as they can in C++; `(utt.relation.present UTT RELATIONNAME)' returns `t' if relation named `RELATIONNAME' is present, `nil' otherwise. `(utt.relation.create UTT RELATIONNAME)' Creates a new relation called `RELATIONNAME'. If this relation already exists it is deleted first and items in the relation are derefenced from it (deleting the items if they are no longer referenced by any relation). Thus create relation guarantees an empty relation. `(utt.relation.delete UTT RELATIONNAME)' Deletes the relation called `RELATIONNAME' in utt. All items in that relation are derefenced from the relation and if they are no longer in any relation the items themselves are deleted. `(utt.relation.append UTT RELATIONNAME ITEM)' Append `ITEM' to end of relation named `RELATIONNAME' in `UTT'. Returns `nil' if there is not relation named `RELATIONNAME' in `UTT' otherwise returns the item appended. This new item becomes the last in the top list. `ITEM' item may be an item itself (in this or another relation) or a LISP description of an item, which consist of a list containing a name and a set of feature vale pairs. It `ITEM' is `nil' or inspecified an new empty item is added. If `ITEM' is already in this relation it is dereferenced from its current position (and an emtpy item re-inserted). `(item.insert ITEM1 ITEM2 DIRECTION)' Insert `ITEM2' into `ITEM1''s relation in the direction specified by `DIRECTION'. `DIRECTION' may take the value, `before', `after', `above' and `below'. If unspecified, `after' is assumed. Note it is not recommended to insert above and below and the functions `item.insert_parent' and `item.append_daughter' should normally be used for tree building. Inserting using `before' and `after' within daughters is perfectly safe. `(item.append_daughter PARENT DAUGHTER)' Append `DAUGHTER', an item or a description of an item to the item `PARENT' in the `PARENT''s relation. `(item.insert_parent DAUGHTER NEWPARENT)' Insert a new parent above `DAUGHTER'. `NEWPARENT' may be a item or the description of an item. `(item.delete ITEM)' Delete this item from all relations it is in. All daughters of this item in each relations are also removed from the relation (which may in turn cause them to be deleted if they cease to be referenced by any other relation. `(item.relation.remove ITEM)' Remove this item from this relation, and any of its daughters. Other relations this item are in remain untouched. `(item.move_tree FROM TO)' Move the item `FROM' to the position of `TO' in `TO''s relation. `FROM' will often be in the same relation as `TO' but that isn't necessary. The contents of `TO' are dereferenced. its daughters are saved then descendants of `FROM' are recreated under the new `TO', then `TO''s previous daughters are derefenced. The order of this is important as `FROM' may be part of `TO''s descendants. Note that if `TO' is part of `FROM''s descendants no moving occurs and `nil' is returned. For example to remove all punction terminal nodes in the Syntax relation the call would be something like (define (syntax_relation_punc p) (if (string-equal "punc" (item.feat (item.daughter2 p) "pos")) (item.move_tree (item.daughter1 p) p) (mapcar syntax_remove_punc (item.daughters p)))) `(item.exchange_trees ITEM1 ITEM2)' Exchange `ITEM1' and `ITEM2' and their descendants in `ITEM2''s relation. If `ITEM1' is within `ITEM2''s descendents or vice versa `nil' is returns and no exchange takes place. If `ITEM1' is not in `ITEM2''s relation, no exchange takes place. Daughters of a node are actually represented as a list whose first daughter is double linked to the parent. Although being aware of this structure may be useful it is recommended that all access go through the tree specific functions `*.parent' and `*.daughter*' which properly deal with the structure, thus is the internal structure ever changes in the future only these tree access function need be updated. With the above functions quite elaborate utterance manipulations can be performed. For example in post-lexical rules where modifications to the segments are required based on the words and their context. *Note Post-lexical rules:: for an example of using various utterance access functions. File: festival.info, Node: Features, Next: Utterance I/O, Prev: Accessing an utterance, Up: Utterances Features ======== In previous versions items had a number of predefined features. This is no longer the case and all features are optional. Particularly the `start' and `end' features are no longer fixed, though those names are still used in the relations where yjeu are appropriate. Specific functions are provided for the `name' feature but they are just short hand for normal feature access. Simple features directly access the features in the underlying `EST_Feature' class in an item. In addition to simple features there is a mechanism for relating functions to names, thus accessing a feature may actually call a function. For example the features `num_syls' is defined as a feature function which will count the number of syllables in the given word, rather than simple access a pre-existing feature. Feature functions are usually dependent on the particular realtion the item is in, e.g. some feature functions are only appropriate for items in the `Word' relation, or only appropriate for those in the `IntEvent' relation. The third aspect of feature names is a path component. These are parts of the name (preceding in `.') that indicated some trversal of the utterance structure. For example the features `name' will access the name feature on the given item. The feature `n.name' will return the name feature on the next item (in that item's relation). A number of basic direction operators are defined. `n.' next `p.' previous `nn.' next next `pp.' previous `parent.' `daughter1.' first daughter `daughter2.' second daughter `daughtern.' last daughter `first.' most previous item `last.' most next item Also you may specific traversal to another relation relation, though the `R:<relationame>.' operator. For example given an Item in the syllable relation `R:SylStructure.parent.name' would give the name of word the syllable is in. Some more complex examples are as follows, assuming we are starting form an item in the `Syllable' relation. `stress' This item's lexical stress `n.stress' The next syllable's lexical stress `p.stress' The previous syllable's lexical stress `R:SylStructure.parent.name' The word this syllable is in `R:SylStructure.parent.R:Word.n.name' The word next to the word this syllable is in `n.R:SylStructure.parent.name' The word the next syllable is in `R:SylStructure.daughtern.ph_vc' The phonetic feature `vc' of the final segment in this syllable. A list of all feature functions is given in an appendix of this document. *Note Feature functions::. New functions may also be added in Lisp. In C++ feature values are of class _EST_Val_ which may be a string, int, or a float (or any arbitrary object). In Scheme this distinction cannot not always be made and sometimes when you expect an int you actually get a string. Care should be take to ensure the right matching functions are use in Scheme. It is recommended you use `string-append' or `string-match' as they will always work. If a pathname does not identify a valid path for the particular item (e.g. there is no next) `"0"' is returned. When collecting data from speech databases it is often useful to collect a whole set of features from all utterances in a database. These features can then be used for building various models (both CART tree models and linear regression modules use these feature names), A number of functions exist to help in this task. For example (utt.features utt1 'Word '(name pos p.pos n.pos)) will return a list of word, and part of speech context for each word in the utterance. *Note Extracting features:: for an example of extracting sets of features from a database for use in building stochastic models. File: festival.info, Node: Utterance I/O, Prev: Features, Up: Utterances Utterance I/O ============= A number of functions are available to allow an utterance's structure to be made available for other programs. The whole structure, all relations, items and features may be saved in an ascii format using the function `utt.save'. This file may be reloaded using the `utt.load' function. Note the waveform is not saved using the form. Individual aspects of an utterance may be selectively saved. The waveform itself may be saved using the function `utt.save.wave'. This will save the waveform in the named file in the format specified in the `Parameter' `Wavefiletype'. All formats supported by the Edinburgh Speech Tools are valid including `nist', `esps', `sun', `riff', `aiff', `raw' and `ulaw'. Note the functions `utt.wave.rescale' and `utt.wave.resample' may be used to change the gain and sample frequency of the waveform before saving it. A waveform may be imported into an existing utterance with the function `utt.import.wave'. This is specifically designed to allow external methods of waveform synthesis. However if you just wish to play an external wave or make it into an utterance you should consider the utterance `Wave' type. The segments of an utterance may be saved in a file using the function `utt.save.segs' which saves the segments of the named utterance in xlabel format. Any other stream may also be saved using the more general `utt.save.relation' which takes the additional argument of a relation name. The names of each item and the end feature of each item are saved in the named file, again in Xlabel format, other features are saved in extra fields. For more elaborated saving methods you can easily write a Scheme function to save data in an utterance in whatever format is required. See the file `lib/mbrola.scm' for an example. A simple function to allow the displaying of an utterance in Entropic's Xwaves tool is provided by the function `display'. It simply saves the waveform and the segments and sends appropriate commands to (the already running) Xwaves and xlabel programs. A function to synthesize an externally specified utterance is provided for by `utt.resynth' which takes two filename arguments, an xlabel segment file and an F0 file. This function loads, synthesizes and plays an utterance synthesized from these files. The loading is provided by the underlying function `utt.load.segf0'. File: festival.info, Node: Text analysis, Next: POS tagging, Prev: Utterances, Up: Top Text analysis ************* * Menu: * Tokenizing:: Splitting text into tokens * Token to word rules:: * Homograph disambiguation:: "Wed 5 may wind US Sen up" File: festival.info, Node: Tokenizing, Next: Token to word rules, Up: Text analysis Tokenizing ========== A crucial stage in text processing is the initial tokenization of text. A _token_ in Festival is an atom separated with whitespace from a text file (or string). If punctuation for the current language is defined, characters matching that punctuation are removed from the beginning and end of a token and held as features of the token. The default list of characters to be treated as white space is defined as (defvar token.whitespace " \t\n\r") While the default set of punctuation characters is (defvar token.punctuation "\"'`.,:;!?(){}[]") (defvar token.prepunctuation "\"'`({[") These are declared in `lib/token.scm' but may be changed for different languages, text modes etc. File: festival.info, Node: Token to word rules, Next: Homograph disambiguation, Prev: Tokenizing, Up: Text analysis Token to word rules =================== Tokens are further analysed into lists of words. A word is an atom that can be given a pronunciation by the lexicon (or letter to sound rules). A token may give rise to a number of words or none at all. For example the basic tokens This pocket-watch was made in 1983. would give a word relation of this pocket watch was made in nineteen eighty three Becuase the relationship between tokens and word in some cases is complex, a user function may be specified for translating tokens into words. This is designed to deal with things like numbers, email addresses, and other non-obvious pronunciations of tokens as zero or more words. Currently a builtin function `builtin_english_token_to_words' offers much of the necessary functionality for English but a user may further customize this. If the user defines a function `token_to_words' which takes two arguments: a token item and a token name, it will be called by the `Token_English' and `Token_Any' modules. A substantial example is given as `english_token_to_words' in `festival/lib/token.scm'. An example of this function is in `lib/token.scm'. It is quite elaborate and covers most of the common multi-word tokens in English including, numbers, money symbols, Roman numerals, dates, times, plurals of symbols, number ranges, telephone number and various other symbols. Let us look at the treatment of one particular phenomena which shows the use of these rules. Consider the expression "$12 million" which should be rendered as the words "twelve million dollars". Note the word "dollars" which is introduced by the "$" sign, ends up after the end of the expression. There are two cases we need to deal with as there are two tokens. The first condition in the `cond' checks if the current token name is a money symbol, while the second condition check that following word is a magnitude (million, billion, trillion, zillion etc.) If that is the case the "$" is removed and the remaining numbers are pronounced, by calling the builtin token to word function. The second condition deals with the second token. It confirms the previous is a money value (the same regular expression as before) and then returns the word followed by the word "dollars". If it is neither of these forms then the builtin function is called. (define (token_to_words token name) "(token_to_words TOKEN NAME) Returns a list of words for NAME from TOKEN." (cond ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches (item.feat token "n.name") ".*illion.?")) (builtin_english_token_to_words token (string-after name "$"))) ((and (string-matches (item.feat token "p.name") "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches name ".*illion.?")) (list name "dollars")) (t (builtin_english_token_to_words token name)))) It is valid to make some conditions return no words, though some care should be taken with that, as punctuation information may no longer be available to later processing if there are no words related to a token. File: festival.info, Node: Homograph disambiguation, Prev: Token to word rules, Up: Text analysis Homograph disambiguation ======================== Not all tokens can be rendered as words easily. Their context may affect the way they are to be pronounced. For example in the utterance On May 5 1985, 1985 people moved to Livingston. the tokens "1985" should be pronounced differently, the first as a year, "nineteen eighty five" while the second as a quantity "one thousand nine hundred and eighty five". Numbers may also be pronounced as ordinals as in the "5" above, it should be "fifth" rather than "five". Also, the pronunciation of certain words cannot simply be found from their orthographic form alone. Linguistic part of speech tags help to disambiguate a large class of homographs, e.g. "lives". A part of speech tagger is included in Festival and discussed in *Note POS tagging::. But even part of speech isn't sufficient in a number of cases. Words such as "bass", "wind", "bow" etc cannot by distinguished by part of speech alone, some semantic information is also required. As full semantic analysis of text is outwith the realms of Festival's capabilities some other method for disambiguation is required. Following the work of `yarowsky96' we have included a method for identified tokens to be further labelled with extra tags to help identify their type. Yarowsky uses _decision lists_ to identify different types for homographs. Decision lists are a restricted form of decision trees which have some advantages over full trees, they are easier to build and Yarowsky has shown them to be adequate for typical homograph resolution. Using disambiguators -------------------- Festival offers a method for assigning a `token_pos' feature to each token. It does so using Yarowsky-type disambiguation techniques. A list of disambiguators can be provided in the variable `token_pos_cart_trees'. Each disambiguator consists of a regular expression and a CART tree (which may be a decision list as they have the same format). If a token matches the regular expression the CART tree is applied to the token and the resulting class is assigned to the token via the feature `token_pos'. This is done by the `Token_POS' module. For example, the follow disambiguator distinguishes "St" (street and saint) and "Dr" (doctor and drive). ("\\([dD][Rr]\\|[Ss][tT]\\)" ((n.name is 0) ((p.cap is 1) ((street)) ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)") ((street)) ((title)))) ((punc matches ".*,.*") ((street)) ((p.punc matches ".*,.*") ((title)) ((n.cap is 0) ((street)) ((p.cap is 0) ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)") ((street)) ((title))) ((pp.name matches "[1-9][0-9]+") ((street)) ((title))))))))) Note that these only assign values for the feature `token_pos' and do nothing more. You must have a related token to word rule that interprets this feature value and does the required translation. For example the corresponding token to word rule for the above disambiguator is ((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)") (if (string-equal (item.feat token "token_pos") "street") (if (string-matches name "[dD][rR]") (list "drive") (list "street")) (if (string-matches name "[dD][rR]") (list "doctor") (list "saint")))) Building disambiguators ----------------------- Festival offers some support for building disambiguation trees. The basic method is to find all occurrences of a homographic token in a large text database, label each occurrence into classes, extract appropriate context features for these tokens and finally build an classification tree or decision list based on the extracted features. The extraction and building of trees is not yet a fully automated process in Festival but the file `festival/examples/toksearch.scm' shows some basic Scheme code we use for extracting tokens from very large collections of text. The function `extract_tokens' does the real work. It reads the given file, token by token into a token stream. Each token is tested against the desired tokens and if there is a match the named features are extracted. The token stream will be extended to provide the necessary context. Note that only some features will make any sense in this situation. There is only a token relation so referring to words, syllables etc. is not productive. In this example databases are identified by a file that lists all the files in the text databases. Its name is expected to be `bin/DBNAME.files' where `DBNAME' is the name of the database. The file should contain a list of filenames in the database e.g for the Gutenberg texts the file `bin/Gutenberg.files' contains gutenberg/etext90/bill11.txt gutenberg/etext90/const11.txt gutenberg/etext90/getty11.txt gutenberg/etext90/jfk11.txt ... Extracting the tokens is typically done in two passes. The first pass extracts the context (I've used 5 tokens either side). It extracts the file and position, so the token is identified, and the word in context. Next those examples should be labelled with a small set of classes which identify the type of the token. For example for a token like "Dr" whether it is a person's title or a street identifier. Note that hand-labelling can be laborious, though it is surprising how few tokens of particular types actually exist in 62 million words. The next task is to extract the tokens with the features that will best distinguish the particular token. In our "Dr" case this will involve punctuation around the token, capitalisation of surrounding tokens etc. After extracting the distinguishing tokens you must line up the labels with these extracted features. It would be easier to extract both the context and the desired features at the same time but experience shows that in labelling, more appropriate features come to mind that will distinguish classes better and you don't want to have to label twice. Once a set of features consisting of the label and features is created it is easy to use `wagon' to create the corresponding decision tree or decision list. `wagon' supports both decision trees and decision lists, it may be worth experimenting to find out which give the best results on some held out test data. It appears that decision trees are typically better, but are often much larger, and the size does not always justify the the sometimes only slightly better results. File: festival.info, Node: POS tagging, Next: Phrase breaks, Prev: Text analysis, Up: Top POS tagging *********** Part of speech tagging is a fairly well-defined process. Festival includes a part of speech tagger following the HMM-type taggers as found in the Xerox tagger and others (e.g. `DeRose88'). Part of speech tags are assigned, based on the probability distribution of tags given a word, and from ngrams of tags. These models are externally specified and a Viterbi decoder is used to assign part of speech tags at run time. So far this tagger has only been used for English but there is nothing language specific about it. The module `POS' assigns the tags. It accesses the following variables for parameterization. `pos_lex_name' The name of a "lexicon" holding reverse probabilities of words given a tag (indexed by word). If this is unset or has the value `NIL' no part of speech tagging takes place. `pos_ngram_name' The name of a loaded ngram model of part of speech tags (loaded by `ngram.load'). `pos_p_start_tag' The name of the most likely tag before the start of an utterance. This is typically the tag for sentence final punctuation marks. `pos_pp_start_tag' The name of the most likely tag two before the start of an utterance. For English the is typically a simple noun, but for other languages it might be a verb. If the ngram model is bigger than three this tag is effectively repeated for the previous left contexts. `pos_map' We have found that it is often better to use a rich tagset for prediction of part of speech tags but that in later use (phrase breaks and dictionary lookup) a much more constrained tagset is better. Thus mapping of the predicted tagset to a different tagset is supported. `pos_map' should be a a list of pairs consisting of a list of tags to be mapped and the new tag they are to be mapped to. Note is it important to have the part of speech tagger match the tags used in later parts of the system, particularly the lexicon. Only two of our lexicons used so far have (mappable) part of speech labels. An example of the part of speech tagger for English can be found in `lib/pos.scm'. File: festival.info, Node: Phrase breaks, Next: Intonation, Prev: POS tagging, Up: Top Phrase breaks ************* There are two methods for predicting phrase breaks in Festival, one simple and one sophisticated. These two methods are selected through the parameter `Phrase_Method' and phrasing is achieved by the module `Phrasify'. The first method is by CART tree. If parameter `Phrase_Method' is `cart_tree', the CART tree in the variable `phrase_cart_tree' is applied to each word to see if a break should be inserted or not. The tree should predict categories `BB' (for big break), `B' (for break) or `NB' (for no break). A simple example of a tree to predict phrase breaks is given in the file `lib/phrase.scm'. (set! simple_phrase_cart_tree ' ((R:Token.parent.punc in ("?" "." ":")) ((BB)) ((R:Token.parent.punc in ("'" "\"" "," ";")) ((B)) ((n.name is 0) ((BB)) ((NB)))))) The second and more elaborate method of phrase break prediction is used when the parameter `Phrase_Method' is `prob_models'. In this case a probabilistic model using probabilities of a break after a word based on the part of speech of the neighbouring words and the previous word. This is combined with a ngram model of the distribution of breaks and non-breaks using a Viterbi decoder to find the optimal phrasing of the utterance. The results using this technique are good and even show good results on unseen data from other researchers' phrase break tests (see `black97b'). However sometimes it does sound wrong, suggesting there is still further work required. Parameters for this module are set through the feature list held in the variable `phr_break_params', and example of which for English is set in `english_phr_break_params' in the file `lib/phrase.scm'. The features names and meaning are `pos_ngram_name' The name of a loaded ngram that gives probability distributions of B/NB given previous, current and next part of speech. `pos_ngram_filename' The filename containing `pos_ngram_name'. `break_ngram_name' The name of a loaded ngram of B/NB distributions. This is typically a 6 or 7-gram. `break_ngram_filename' The filename containing `break_ngram_name'. `gram_scale_s' A weighting factor for breaks in the break/non-break ngram. Increasing the value insertes more breaks, reducing it causes less breaks to be inserted. `phrase_type_tree' A CART tree that is used to predict type of break given the predict break position. This (rather crude) technique is current used to distinguish major and minor breaks. `break_tags' A list of the break tags (typically `(B NB)'). `pos_map' A part of speech map used to map the `pos' feature of words into a smaller tagset used by the phrase predictor. File: festival.info, Node: Intonation, Next: Duration, Prev: Phrase breaks, Up: Top Intonation ********** A number of different intonation modules are available with varying levels of control. In general intonation is generated in two steps. 1. Prediction of accents (and/or end tones) on a per syllable basis. 2. Prediction of F0 target values, this must be done after durations are predicted. Reflecting this split there are two main intonation modules that call sub-modules depending on the desired intonation methods. The `Intonation' and `Int_Targets' modules are defined in Lisp (`lib/intonation.scm') and call sub-modules which are (so far) in C++. * Menu: * Default intonation:: Effectively none at all. * Simple intonation:: Accents and hats. * Tree intonation:: Accents and Tones, and F0 prediction by LR * Tilt intonation:: Using the Tilt intonation model * General intonation:: A programmable intonation module * Using ToBI:: A ToBI by rule example File: festival.info, Node: Default intonation, Next: Simple intonation, Up: Intonation Default intonation ================== This is the simplest form of intonation and offers the modules `Intonation_Default' and `Intonation_Targets_Default'. The first of which actually does nothing at all. `Intonation_Targets_Default' simply creates a target at the start of the utterance, and one at the end. The values of which, by default are 130 Hz and 110 Hz. These values may be set through the parameter `duffint_params' for example the following will general a monotone at 150Hz. (set! duffint_params '((start 150) (end 150))) (Parameter.set 'Int_Method 'DuffInt) (Parameter.set 'Int_Target_Method Int_Targets_Default) File: festival.info, Node: Simple intonation, Next: Tree intonation, Prev: Default intonation, Up: Intonation Simple intonation ================= This module uses the CART tree in `int_accent_cart_tree' to predict if each syllable is accented or not. A predicted value of `NONE' means no accent is generated by the corresponding `Int_Targets_Simple' function. Any other predicted value will cause a `hat' accent to be put on that syllable. A default `int_accent_cart_tree' is available in the value `simple_accent_cart_tree' in `lib/intonation.scm'. It simply predicts accents on the stressed syllables on content words in poly-syllabic words, and on the only syllable in single syllable content words. Its form is (set! simple_accent_cart_tree ' ((R:SylStructure.parent.gpos is content) ((stress is 1) ((Accented)) ((position_type is single) ((Accented)) ((NONE)))) ((NONE)))) The function `Int_Targets_Simple' uses parameters in the a-list in variable `int_simple_params'. There are two interesting parameters `f0_mean' which gives the mean F0 for this speaker (default 110 Hz) and `f0_std' is the standard deviation of F0 for this speaker (default 25 Hz). This second value is used to determine the amount of variation to be put in the generated targets. For each Phrase in the given utterance an F0 is generated starting at `f0_code+(f0_std*0.6)' and declines `f0_std' Hz over the length of the phrase until the last syllable whose end is set to `f0_code-f0_std'. An imaginary line called `baseline' is drawn from start to the end (minus the final extra fall), For each syllable that is accented (i.e. has an IntEvent related to it) three targets are added. One at the start, one in mid vowel, and one at the end. The start and end are at position `baseline' Hz (as declined for that syllable) and the mid vowel is set to `baseline+f0_std'. Note this model is not supposed to be complex or comprehensive but it offers a very quick and easy way to generate something other than a fixed line F0. Something similar to this has been for Spanish and Welsh without (too many) people complaining. However it is not designed as a serious intonation module. File: festival.info, Node: Tree intonation, Next: Tilt intonation, Prev: Simple intonation, Up: Intonation Tree intonation =============== This module is more flexible. Two different CART trees can be used to predict `accents' and `endtones'. Although at present this module is used for an implementation of the ToBI intonation labelling system it could be used for many different types of intonation system. The target module for this method uses a Linear Regression model to predict start mid-vowel and end targets for each syllable using arbitrarily specified features. This follows the work described in `black96'. The LR models are held as as described below *Note Linear regression::. Three models are used in the variables `f0_lr_start', `f0_lr_mid' and `f0_lr_end'. File: festival.info, Node: Tilt intonation, Next: General intonation, Prev: Tree intonation, Up: Intonation Tilt intonation =============== Tilt description to be inserted.