This is festival.info, produced by Makeinfo version 3.12h from festival.texi. This file documents the `Festival' Speech Synthesis System a general text to speech system for making your computer talk and developing new synthesis techniques. Copyright (C) 1996-2001 University of Edinburgh Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the authors. File: festival.info, Node: General intonation, Next: Using ToBI, Prev: Tilt intonation, Up: Intonation General intonation ================== As there seems to be a number of intonation theories that predict F0 contours by rule (possibly using trained parameters) this module aids the external specification of such rules for a wide class of intonation theories (through primarily those that might be referred to as the ToBI group). This is designed to be multi-lingual and offer a quick way to port often pre-existing rules into Festival without writing new C++ code. The accent prediction part uses the same mechanisms as the Simple intonation method described above, a decision tree for accent prediction, thus the tree in the variable `int_accent_cart_tree' is used on each syllable to predict an `IntEvent'. The target part calls a specified Scheme function which returns a list of target points for a syllable. In this way any arbitrary tests may be done to produce the target points. For example here is a function which returns three target points for each syllable with an `IntEvent' related to it (i.e. accented syllables). (define (targ_func1 utt syl) "(targ_func1 UTT STREAMITEM) Returns a list of targets for the given syllable." (let ((start (item.feat syl 'syllable_start)) (end (item.feat syl 'syllable_end))) (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented") (list (list start 110) (list (/ (+ start end) 2.0) 140) (list end 100))))) This function may be identified as the function to call by the following setup parameters. (Parameter.set 'Int_Method 'General) (Parameter.set 'Int_Target_Method Int_Targets_General) (set! int_general_params (list (list 'targ_func targ_func1))) File: festival.info, Node: Using ToBI, Prev: General intonation, Up: Intonation Using ToBI ========== An example implementation of a ToBI to F0 target module is included in `lib/tobi_rules.scm' based on the rules described in `jilka96'. This uses the general intonation method discussed in the previous section. This is designed to be useful to people who are experimenting with ToBI (`silverman92'), rather than general text to speech. To use this method you need to load `lib/tobi_rules.scm' and call `setup_tobi_f0_method'. The default is in a male's pitch range, i.e. for `voice_rab_diphone'. You can change it for other pitch ranges by changing the folwoing variables. (Parameter.set 'Default_Topline 110) (Parameter.set 'Default_Start_Baseline 87) (Parameter.set 'Default_End_Baseline 83) (Parameter.set 'Current_Topline (Parameter.get 'Default_Topline)) (Parameter.set 'Valley_Dip 75) An example using this from STML is given in `examples/tobi.stml'. But it can also be used from Scheme. For example before defining an utterance you should execute the following either from teh command line on in some setup file (voice_rab_diphone) (require 'tobi_rules) (setup_tobi_f0_method) In order to allow specification of accents, tones, and break levels you must use an utterance type that allows such specification. For example (Utterance Words (boy (saw ((accent H*))) the (girl ((accent H*))) in the (park ((accent H*) (tone H-))) with the (telescope ((accent H*) (tone H-H%))))) (Utterance Words (The (boy ((accent L*))) saw the (girl ((accent H*) (tone L-))) with the (telescope ((accent H*) (tone H-H%)))))) You can display the the synthesized form of these utterance in Xwaves. Start an Xwaves and an Xlabeller and call the function `display' on the synthesized utterance. File: festival.info, Node: Duration, Next: UniSyn synthesizer, Prev: Intonation, Up: Top Duration ******** A number of different duration prediction modules are available with varying levels of sophistication. Segmental duration prediction is done by the module `Duration' which calls different actual methods depending on the parameter `Duration_Method'. All of the following duration methods may be further affected by both a global duration stretch and a per word one. If the parameter `Duration_Stretch' is set, all absolute durations predicted by any of the duration methods described here are multiplied by the parameter's value. For example (Parameter.set 'Duration_Stretch 1.2) will make everything speak more slowly. In addition to the global stretch method, if the feature `dur_stretch' on the related `Token' is set it will also be used as a multiplicative factor on the duration produced by the selected method. That is `R:Syllable.parent.parent.R:Token.parent.dur_stretch'. There is a lisp function `duration_find_stretch' wchi will return the combined gloabel and local duration stretch factor for a given segment item. Note these global and local methods of affecting the duration produced by models are crude and should be considered hacks. Uniform modification of durations is not what happens in real speech. These parameters are typically used when the underlying duration method is lacking in some way. However these can be useful. Note it is quite easy to implement new duration methods in Scheme directly. * Menu: * Default durations:: Fixed length durations * Average durations:: * Klatt durations:: Klatt rules from book. * CART durations:: Tree based durations File: festival.info, Node: Default durations, Next: Average durations, Up: Duration Default durations ================= If parameter `Duration_Method' is set to `Default', the simplest duration model is used. All segments are 100 milliseconds (this can be modified by `Duration_Stretch', and/or the localised Token related `dur_stretch' feature). File: festival.info, Node: Average durations, Next: Klatt durations, Prev: Default durations, Up: Duration Average durations ================= If parameter `Duration_Method' is set to `Averages' then segmental durations are set to their averages. The variable `phoneme_durations' should be an a-list of phones and averages in seconds. The file `lib/mrpa_durs.scm' has an example for the mrpa phoneset. If a segment is found that does not appear in the list a default duration of 0.1 seconds is assigned, and a warning message generated. File: festival.info, Node: Klatt durations, Next: CART durations, Prev: Average durations, Up: Duration Klatt durations =============== If parameter `Duration_Method' is set to `Klatt' the duration rules from the Klatt book (`allen87', chapter 9). This method requires minimum and inherent durations for each phoneme in the phoneset. This information is held in the variable `duration_klatt_params'. Each member of this list is a three-tuple, of phone name, inherent duration and minimum duration. An example for the mrpa phoneset is in `lib/klatt_durs.scm'. File: festival.info, Node: CART durations, Prev: Klatt durations, Up: Duration CART durations ============== Two very similar methods of duration prediction by CART tree are supported. The first, used when parameter `Duration_Method' is `Tree' simply predicts durations directly for each segment. The tree is set in the variable `duration_cart_tree'. The second, which seems to give better results, is used when parameter `Duration_Method' is `Tree_ZScores'. In this second model the tree predicts zscores (number of standard deviations from the mean) rather than duration directly. (This follows `campbell91', but we don't deal in syllable durations here.) This method requires means and standard deviations for each phone. The variable `duration_cart_tree' should contain the zscore prediction tree and the variable `duration_ph_info' should contain a list of phone, mean duration, and standard deviation for each phone in the phoneset. An example tree trained from 460 sentences spoken by Gordon is in `lib/gswdurtreeZ'. Phone means and standard deviations are in `lib/gsw_durs.scm'. After prediction the segmental duration is calculated by the simple formula duration = mean + (zscore * standard deviation) For some other duration models that affect an inherent duration by some factor this method has been used. If the tree predicts factors rather than zscores and the `duration_ph_info' entries are phone, 0.0, inherent duration. The above formula will generate the desired result. Klatt and Klatt-like rules can be implemented in the this way without adding a new method. File: festival.info, Node: UniSyn synthesizer, Next: Diphone synthesizer, Prev: Duration, Up: Top UniSyn synthesizer ****************** Since 1.3 a new general synthesizer module has been included. This designed to replace the older diphone synthesizer described in the next chapter. A redesign was made in order to have a generalized waveform synthesizer, singla processing module that could be used even when the units being concatenated are not diphones. Also at this stage the full diphone (or other) database pre-processing functions were added to the Speech Tool library. UniSyn database format ====================== The Unisyn synthesis modules can use databases in two basic formats, _separate_ and _grouped_. Separate is when all files (signal, pitchmark and coefficient files) are accessed individually during synthesis. This is the standard use during databse development. Group format is when a database is collected together into a single special file containing all information necessary for waveform synthesis. This format is designed to be used for distribution and general use of the database. A database should consist of a set of waveforms, (which may be translated into a set of coefficients if the desired the signal processing method requires it), a set of pitchmarks and an index. The pitchmarks are necessary as most of our current signal processing are pitch synchronous. Generating pitchmarks --------------------- Pitchmarks may be derived from laryngograph files using the our proved program `pitchmark' distributed with the speech tools. The actual parameters to this program are still a bit of an art form. The first major issue is which direction the lar files. We have seen both, though it does seem to be CSTR's ones are most often upside down while others (e.g. OGI's) are the right way up. The `-inv' argument to `pitchmark' is specifically provided to cater for this. There other issues in getting the pitchmarks aligned. The basic command for generating pitchmarks is pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \ -min 0.005 -max 0.012 -fill -def 0.01 -wave_end The `-min', `-max' and `-def' (fill values for unvoiced regions), may need to be changed depending on the speaker pitch range. The above is suitable for a male speaker. The `-fill' option states that unvoiced sections should be filled with equally spaced pitchmarks. Generating LPC coefficients --------------------------- LPC coefficients are generated using the `sig2fv' command. Two stages are required, generating the LPC coefficients and generating the residual. The prototypical commands for these are sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \ -coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \ -window_type hamming sigfilter wav/file001.wav -o lpc/file001.res -otype nist \ -lpcfilter lpc/file001.lpc -inv_filter For some databases you may need to normalize the power. Properly normalizing power is difficult but we provide a simple function which may do the jobs acceptably. You should do this on the waveform before lpc analysis (and ensure you also do the residual extraction on the normalized waveform rather than the original. ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav This normalizes the power by maximizing the signal first then multiplying it by the given factor. If the database waveforms are clean (i.e. no clicks) this can give reasonable results. Generating a diphone index ========================== The diphone index consists of a short header following by an ascii list of each diphone, the file it comes from followed by its start middle and end times in seconds. For most databases this files needs to be generated by some database specific script. An example header is EST_File index DataType ascii NumEntries 2005 IndexName rab_diphone EST_Header_End The most notable part is the number of entries, which you should note can get out of sync with the actual number of entries if you hand edit entries. I.e. if you add an entry and the system still can't find it check that the number of entries is right. The entries themselves may take on one of two forms, full entries or index entries. Full entries consist of a diphone name, where the phones are separated by "-"; a file name which is used to index into the pitchmark, LPC and waveform file; and the start, middle (change over point between phones) and end of the phone in the file in seconds of the diphone. For example r-uh edx_1001 0.225 0.261 0.320 r-e edx_1002 0.224 0.273 0.326 r-i edx_1003 0.240 0.280 0.321 r-o edx_1004 0.212 0.253 0.320 The second form of entry is an index entry which simply states that reference to that diphone should actually be made to another. For example aa-ll &aa-l This states that the diphone `aa-ll' should actually use the diphone `aa-l'. Note they are a number of ways to specify alternates for missing diphones an this method is best used for fixing single or small classes of missing or broken diphones. Index entries may appear anywhere in the file but can't be nested. Some checks are made one reading this index to ensure times etc are reasonable but multiple entries for the same diphone are not, in that case the later one will be selected. Database declaration ==================== There two major types of database _grouped_ and _ungrouped_. Grouped databases come as a single file containing the diphone index, coeficinets and residuals for the diphones. This is the standard way databases are distributed as voices in Festoval. Ungrouped access diphones from individual files and is designed as a method for debugging and testing databases before distribution. Using ungrouped dataabse is slower but allows quicker changes to the index, and associated coefficient files and residuals without rebuilding the group file. A database is declared to the system through the command `us_diphone_init'. This function takes a parameter list of various features used for setting up a database. The features are `name' An atomic name for this database, used in selecting it from the current set of laded database. `index_file' A filename name containing either a diphone index, as descripbed above, or a group file. The feature `grouped' defines the distinction between this being a group of simple index file. `grouped' Takes the value `"true"' or `"false"'. This defined simple index or if the index file is a grouped file. `coef_dir' The directory containing the coefficients, (LPC or just pitchmarks in the PSOLA case). `sig_dir' The directory containing the signal files (residual for LPC, full waveforms for PSOLA). `coef_ext' The extention for coefficient files, typically `".lpc"' for LPC file and `".pm"' for pitchmark files. `sig_ext' The extention for signal files, typically `".res"' for LPC residual files and `".wav"' for waveform files. `default_diphone' The diphone to be used when the requested one doesn't exist. No matter how careful you are you should always include a default diphone for distributed diphone database. Synthesis will throw an error if no diphone is found and there is no default. Although it is usually an error when this is required its better to fill in something than stop synthesizing. Typical values for this are silence to silence or schwa to schwa. `alternates_left' A list of pairs showing the alternate phone names for the left phone in a diphone pair. This is list is used to rewrite the diphone name when the directly requested one doesn't exist. This is the recommended method for dealing with systematic holes in a diphone database. `alternates_right' A list of pairs showing the alternate phone names for the right phone in a diphone pair. This is list is used to rewrite the diphone name when the directly requested one doesn't exist. This is the recommended method for dealing with systematic holes in a diphone database. An example database definition is (set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone") (set! rab_lpc_group (list '(name "rab_lpc_group") (list 'index_file (path-append rab_diphone_dir "group/rablpc16k.group")) '(alternates_left ((i ii) (ll l) (u uu) (i@ ii) (uh @) (a aa) (u@ uu) (w @) (o oo) (e@ ei) (e ei) (r @))) '(alternates_right ((i ii) (ll l) (u uu) (i@ ii) (y i) (uh @) (r @) (w @))) '(default_diphone @-@@) '(grouped "true"))) (us_dipohone_init rab_lpc_group) Making groupfiles ================= The function `us_make_group_file' will make a group file of the currently selected US diphone database. It loads in all diphone sin the dtabaase and saves them in the named file. An optional second argument allows specification of how the group file will be saved. These options are as a feature list. There are three possible options `track_file_format' The format for the coefficient files. By default this is `est_binary', currently the only other alternative is `est_ascii'. `sig_file_format' The format for the signal parts of the of the database. By default this is `snd' (Sun's Audio format). This was choosen as it has the smallest header and supports various sample formats. Any format supported by the Edinburgh Speech Tools is allowed. `sig_sample_format' The format for the samples in the signal files. By default this is `mulaw'. This is suitable when the signal files are LPC residuals. LPC residuals have a much smaller dynamic range that plain PCM files. Because `mulaw' representation is half the size (8 bits) of standard PCM files (16bits) this significantly reduces the size of the group file while only marginally altering the quality of synthesis (and from experiments the effect is not perceptible). However when saving group files where the signals are not LPC residuals (e.g. in PSOLA) using this default `mulaw' is not recommended and `short' should probably be used. UniSyn module selection ======================= In a voice selection a UniSyn database may be selected as follows (set! UniSyn_module_hooks (list rab_diphone_const_clusters )) (set! us_abs_offset 0.0) (set! window_factor 1.0) (set! us_rel_offset 0.0) (set! us_gain 0.9) (Parameter.set 'Synth_Method 'UniSyn) (Parameter.set 'us_sigpr 'lpc) (us_db_select rab_db_name) The `UniSyn_module_hooks' are run before synthesis, see the next selection about diphone name selection. At present only `lpc' is supported by the UniSyn module, though potentially there may be others. An optional implementation of TD-PSOLA `moulines90' has been written but fear of legal problems unfortunately prevents it being in the public distribution, but this policy should not be taken as acknowledging or not acknowledging any alleged patent violation. Diphone selection ================= Diphone names are constructed for each phone-phone pair in the Segment relation in an utterance. If a segment has the feature in forming a diphone name UniSyn first checks for the feature `us_diphone_left' (or `us_diphone_right' for the right hand part of the diphone) then if that doesn't exist the feature `us_diphone' then if that doesn't exist the feature `name'. Thus is is possible to specify diphone names which are not simply the concatenation of two segment names. This feature is used to specify consonant cluster diphone names for our English voices. The hook `UniSyn_module_hooks' is run before selection and we specify a function to add `us_diphone_*' features as appropriate. See the function `rab_diphone_fix_phone_name' in `lib/voices/english/rab_diphone/festvox/rab_diphone.scm' for an example. Once the diphone name is created it is used to select the diphone from the database. If it is not found the name is converted using the list of `alternates_left' and `alternates_right' as specified in the database declaration. If that doesn't specify a diphone in the database. The `default_diphone' is selected, and a warning is printed. If no default diphone is specified or the default diphone doesn't exist in the database an error is thrown. File: festival.info, Node: Diphone synthesizer, Next: Other synthesis methods, Prev: UniSyn synthesizer, Up: Top Diphone synthesizer ******************* _NOTE:_ use of this diphone synthesis is depricated and it will probably be removed from future versions, all of its functionality has been replaced by the UniSyn synthesizer. It is not compiled by default, if required add `ALSO_INCLUDE += diphone' to your `festival/config/config' file. A basic diphone synthesizer offers a method for making speech from segments, durations and intonation targets. This module was mostly written by Alistair Conkie but the base diphone format is compatible with previous CSTR diphone synthesizers. The synthesizer offers residual excited LPC based synthesis (`hunt89') and PSOLA (TM) (`moulines90') (PSOLA is not available for distribution). * Menu: * Diphone database format:: Format of basic dbs * LPC databases:: Building and using LPC files. * Group files:: Efficient binary formats * Diphone_Init:: Loading diphone databases * Access strategies:: Various access methods * Diphone selection:: Mapping phones to special diphone names File: festival.info, Node: Diphone database format, Next: LPC databases, Up: Diphone synthesizer Diphone database format ======================= A diphone database consists of a _dictionary file_, a set of _waveform files_, and a set of _pitch mark files_. These files are the same format as the previous CSTR (Osprey) synthesizer. The dictionary file consist of one entry per line. Each entry consists of five fields: a diphone name of the form P1-P2, a filename (without extension), a floating point start position in the file in milliseconds, a mid position in milliseconds (change in phone), and an end position in milliseconds. Lines starting with a semi-colon and blank lines are ignored. The list may be in any order. For example a partial list of phones may look like. ch-l r021 412.035 463.009 518.23 jh-l d747 305.841 382.301 446.018 h-l d748 356.814 403.54 437.522 #-@ d404 233.628 297.345 331.327 @-# d001 836.814 938.761 1002.48 Waveform files may be in any form, as long as every file is the same type, headered or unheadered as long as the format is supported the speech tools wave reading functions. These may be standard linear PCM waveform files in the case of PSOLA or LPC coefficients and residual when using the residual LPC synthesizer. *Note LPC databases:: Pitch mark files consist a simple list of positions in milliseconds (plus places after the point) in order, one per line of each pitch mark in the file. For high quality diphone synthesis these should be derived from laryngograph data. During unvoiced sections pitch marks should be artificially created at reasonable intervals (e.g. 10 ms). In the current format there is no way to determine the "real" pitch marks from the "unvoiced" pitch marks. It is normal to hold a diphone database in a directory with a number of sub-directories namely `dic/' contain the dictionary file, `wave/' for the waveform files, typically of whole nonsense words (sometimes this directory is called `vox/' for historical reasons) and `pm/' for the pitch mark files. The filename in the dictionary entry should be the same for waveform file and the pitch mark file (with different extensions). File: festival.info, Node: LPC databases, Next: Group files, Prev: Diphone database format, Up: Diphone synthesizer LPC databases ============= The standard method for diphone resynthesis in the released system is residual excited LPC (`hunt89'). The actual method of resynthesis isn't important to the database format, but if residual LPC synthesis is to be used then it is necessary to make the LPC coefficient files and their corresponding residuals. Previous versions of the system used a "host of hacky little scripts" to this but now that the Edinburgh Speech Tools supports LPC analysis we can provide a walk through for generating these. We assume that the waveform file of nonsense words are in a directory called `wave/'. The LPC coefficients and residuals will be, in this example, stored in `lpc16k/' with extensions `.lpc' and `.res' respectively. Before starting it is worth considering power normalization. We have found this important on all of the databases we have collected so far. The `ch_wave' program, part of the speech tools, with the optional `-scaleN 0.4' may be used if a more complex method is not available. The following shell command generates the files for i in wave/*.wav do fname=`basename $i .wav` echo $i lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \ -r lpc16k/$fname.res -otype htk -rtype nist $i done It is said that the LPC order should be sample rate divided by one thousand plus 2. This may or may not be appropriate and if you are particularly worried about the database size it is worth experimenting. The program `lpc_analysis', found in `speech_tools/bin', can be used to generate the lpc coefficients and residual. Note these should be reflection coefficients so they may be quantised (as they are in group files). The coefficients and residual files produced by different LPC analysis programs may start at different offsets. For example the Entropic's ESPS functions generate LPC coefficients that are offset by one frame shift (e.g. 0.01 seconds). Our own `lpc_analysis' routine has no offset. The `Diphone_Init' parameter list allows these offsets to be specified. Using the above function to generate the LPC files the description parameters should include (lpc_frame_offset 0) (lpc_res_offset 0.0) While when generating using ESPS routines the description should be (lpc_frame_offset 1) (lpc_res_offset 0.01) The defaults actually follow the ESPS form, that is `lpc_frame_offset' is 1 and `lpc_res_offset' is equal to the frame shift, if they are not explicitly mentioned. Note the biggest problem we have in implementing the residual excited LPC resynthesizer was getting the right part of the residual to line up with the right LPC coefficients describing the pitch mark. Making errors in this degrades the synthesized waveform notably, but not seriously, making it difficult to determine if it is an offset problem or some other bug. Although we have started investigating if extracting pitch synchronous LPC parameters rather than fixed shift parameters gives better performance, we haven't finished this work. `lpc_analysis' supports pitch synchronous analysis but the raw "ungrouped" access method does not yet. At present the LPC parameters are extracted at a particular pitch mark by interpolating over the closest LPC parameters. The "group" files hold these interpolated parameters pitch synchronously. The American English voice `kd' was created using the speech tools `lpc_analysis' program and its set up should be looked at if you are going to copy it. The British English voice `rb' was constructed using ESPS routines. File: festival.info, Node: Group files, Next: Diphone_Init, Prev: LPC databases, Up: Diphone synthesizer Group files =========== Databases may be accessed directly but this is usually too inefficient for any purpose except debugging. It is expected that _group files_ will be built which contain a binary representation of the database. A group file is a compact efficient representation of the diphone database. Group files are byte order independent, so may be shared between machines of different byte orders and word sizes. Certain information in a group file may be changed at load time so a database name, access strategy etc. may be changed from what was set originally in the group file. A group file contains the basic parameters, the diphone index, the signal (original waveform or LPC residual), LPC coefficients, and the pitch marks. It is all you need for a run-time synthesizer. Various compression mechanisms are supported to allow smaller databases if desired. A full English LPC plus residual database at 8k ulaw is about 3 megabytes, while a full 16 bit version at 16k is about 8 megabytes. Group files are created with the `Diphone.group' command which takes a database name and an output filename as an argument. Making group files can take some time especially if they are large. The `group_type' parameter specifies `raw' or `ulaw' for encoding signal files. This can significantly reduce the size of databases. Group files may be partially loaded (see access strategies) at run time for quicker start up and to minimise run-time memory requirements. File: festival.info, Node: Diphone_Init, Next: Access strategies, Prev: Group files, Up: Diphone synthesizer Diphone_Init ============ The basic method for describing a database is through the `Diphone_Init' command. This function takes a single argument, a list of pairs of parameter name and value. The parameters are `name' An atomic name for this database. `group_file' The filename of a group file, which may itself contain parameters describing itself `type' The default value is `pcm', but for distributed voices this is always `lpc'. `index_file' A filename containing the diphone dictionary. `signal_dir' A directory (slash terminated) containing the pcm waveform files. `signal_ext' A dot prefixed extension for the pcm waveform files. `pitch_dir' A directory (slash terminated) containing the pitch mark files. `pitch_ext' A dot prefixed extension for the pitch files `lpc_dir' A directory (slash terminated) containing the LPC coefficient files and residual files. `lpc_ext' A dot prefixed extension for the LPC coefficient files `lpc_type' The type of LPC file (as supported by the speech tools) `lpc_frame_offset' The number of frames "missing" from the beginning of the file. Often LPC parameters are offset by one frame. `lpc_res_ext' A dot prefixed extension for the residual files `lpc_res_type' The type of the residual files, this is a standard waveform type as supported by the speech tools. `lpc_res_offset' Number of seconds "missing" from the beginning of the residual file. Some LPC analysis technique do not generate a residual until after one frame. `samp_freq' Sample frequency of signal files `phoneset' Phoneset used, must already be declared. `num_diphones' Total number of diphones in database. If specified this must be equal or bigger than the number of entries in the index file. If it is not specified the square of the number of phones in the phoneset is used. `sig_band' number of sample points around actual diphone to take from file. This should be larger than any windowing used on the signal, and/or up to the pitch marks outside the diphone signal. `alternates_after' List of pairs of phones stating replacements for the second part of diphone when the basic diphone is not found in the diphone database. `alternates_before' List of pairs of phones stating replacements for the first part of diphone when the basic diphone is not found in the diphone database. `default_diphone' When unexpected combinations occur and no appropriate diphone can be found this diphone should be used. This should be specified for all diphone databases that are to be robust. We usually us the silence to silence diphone. No mater how carefully you designed your diphone set, conditions when an unknown diphone occur seem to _always_ happen. If this is not set and a diphone is requested that is not in the database an error occurs and synthesis will stop. Examples of both general set up, making group files and general use are in `lib/voices/english/rab_diphone/festvox/rab_diphone.scm' File: festival.info, Node: Access strategies, Next: Diphone selection, Prev: Diphone_Init, Up: Diphone synthesizer Access strategies ================= Three basic accessing strategies are available when using diphone databases. They are designed to optimise access time, start up time and space requirements. `direct' Load all signals at database init time. This is the slowest startup but the fastest to access. This is ideal for servers. It is also useful for small databases that can be loaded quickly. It is reasonable for many group files. `dynamic' Load signals as they are required. This has much faster start up and will only gradually use up memory as the diphones are actually used. Useful for larger databases, and for non-group file access. `ondemand' Load the signals as they are requested but free them if they are not required again immediately. This is slower access but requires low memory usage. In group files the re-reads are quite cheap as the database is well cached and a file description is already open for the file. Note that in group files pitch marks (and LPC coefficients) are always fully loaded (cf. `direct'), as they are typically smaller. Only signals (waveform files or residuals) are potentially dynamically loaded. File: festival.info, Node: Diphone selection, Prev: Access strategies, Up: Diphone synthesizer Diphone selection ================= The appropriate diphone is selected based on the name of the phone identified in the segment stream. However for better diphone synthesis it is useful to augment the diphone database with other diphones in addition to the ones directly from the phoneme set. For example dark and light l's, distinguishing consonants from their consonant cluster form and their isolated form. There are however two methods to identify this modification from the basic name. When the diphone module is called the hook `diphone_module_hooks' is applied. That is a function of list of functions which will be applied to the utterance. Its main purpose is to allow the conversion of the basic name into an augmented one. For example converting a basic `l' into a dark l, denoted by `ll'. The functions given in `diphone_module_hooks' may set the feature `diphone_phone_name' which if set will be used rather than the `name' of the segment. For example suppose we wish to use a dark l (`ll') rather than a normal l for all l's that appear in the coda of a syllable. First we would define a function to which identifies this condition and adds the addition feature `diphone_phone_name' identify the name change. The following function would achieve this (define (fix_dark_ls utt) "(fix_dark_ls UTT) Identify ls in coda position and relabel them as ll." (mapcar (lambda (seg) (if (and (string-equal "l" (item.name seg)) (string-equal "+" (item.feat seg "p.ph_vc")) (item.relation.prev seg "SylStructure")) (item.set_feat seg "diphone_phone_name" "ll"))) (utt.relation.items utt 'Segment)) utt) Then when we wish to use this for a particular voice we need to add (set! diphone_module_hooks (list fix_dark_ls)) in the voice selection function. For a more complex example including consonant cluster identification see the American English voice `ked' in `festival/lib/voices/english/ked/festvox/kd_diphone.scm'. The function `ked_diphone_fix_phone_name' carries out a number of mappings. The second method for changing a name is during actual look up of a diphone in the database. The list of alternates is given by the `Diphone_Init' function. These are used when the specified diphone can't be found. For example we often allow mappings of dark l, `ll' to `l' as sometimes the dark l diphone doesn't actually exist in the database. File: festival.info, Node: Other synthesis methods, Next: Audio output, Prev: Diphone synthesizer, Up: Top Other synthesis methods *********************** Festival supports a number of other synthesis systems * Menu: * LPC diphone synthesizer:: A small LPC synthesizer (Donovan diphones) * MBROLA:: Interface to MBROLA * Synthesizers in development:: File: festival.info, Node: LPC diphone synthesizer, Next: MBROLA, Up: Other synthesis methods LPC diphone synthesizer ======================= A very simple, and very efficient LPC diphone synthesizer using the "donovan" diphones is also supported. This synthesis method is primarily the work of Steve Isard and later Alistair Conkie. The synthesis quality is not as good as the residual excited LPC diphone synthesizer but has the advantage of being much smaller. The donovan diphone database is under 800k. The diphones are loaded through the `Donovan_Init' function which takes the name of the dictionary file and the diphone file as arguments, see the following for details lib/voices/english/don_diphone/festvox/don_diphone.scm File: festival.info, Node: MBROLA, Next: Synthesizers in development, Prev: LPC diphone synthesizer, Up: Other synthesis methods MBROLA ====== As an example of how Festival may use a completely external synthesis method we support the free system MBROLA. MBROLA is both a diphone synthesis technique and an actual system that constructs waveforms from segment, duration and F0 target information. For details see the MBROLA home page at `http://tcts.fpms.ac.be/synthesis/mbrola.html'. MBROLA already supports a number of diphone sets including French, Spanish, German and Romanian. Festival support for MBROLA is in the file `lib/mbrola.scm'. It is all in Scheme. The function `MBROLA_Synth' is called when parameter `Synth_Method' is `MBROLA'. The function simply saves the segment, duration and target information from the utterance, calls the external `mbrola' program with the selected diphone database, and reloads the generated waveform back into the utterance. An MBROLA-ized version of the Roger diphoneset is available from the MBROLA site. The simple Festival end is distributed as part of the system in `festvox_en1.tar.gz'. The following variables are used by the process `mbrola_progname' the pathname of the mbrola executable. `mbrola_database' the name of the database to use. This variable is switched between different speakers. File: festival.info, Node: Synthesizers in development, Prev: MBROLA, Up: Other synthesis methods Synthesizers in development =========================== In addition to the above synthesizers Festival also supports CSTR's older PSOLA synthesizer written by Paul Taylor. But as the newer diphone synthesizer produces similar quality output and is a newer (and hence a cleaner) implementation further development of the older module is unlikely. An experimental unit seleciton synthesis module is included in `modules/clunits/' it is an implementation of `black97c'. It is included for people wishing to continue reserach in the area rather than as a fully usable waveform synthesis engine. Although it sometimes gives excellent results it also sometimes gives amazingly bad ones too. We included this as an example of one possible framework for selection-based synthesis. As one of our funded projects is to specifically develop new selection based synthesis algorithms we expect to include more models within later versions of the system. Also, now that Festival has been released other groups are working on new synthesis techniques in the system. Many of these will become available and where possible we will give pointers from the Festival home page to them. Particularly there is an alternative residual excited LPC module implemented at the Center for Spoken Language Understanding (CSLU) at the Oregon Graduate Institute (OGI). File: festival.info, Node: Audio output, Next: Voices, Prev: Other synthesis methods, Up: Top Audio output ************ If you have never heard any audio ever on your machine then you must first work out if you have the appropriate hardware. If you do, you also need the appropriate software to drive it. Festival can directly interface with a number of audio systems or use external methods for playing audio. The currently supported audio methods are `NAS' NCD's NAS, is a network transparent audio system (formerly called netaudio). If you already run servers on your machines you simply need to ensure your `AUDIOSERVER' environment variable is set (or your `DISPLAY' variable if your audio output device is the same as your X Windows display). You may set NAS as your audio output method by the command (Parameter.set 'Audio_Method 'netaudio) `/dev/audio' On many systems `/dev/audio' offers a simple low level method for audio output. It is limited to mu-law encoding at 8KHz. Some implementations of `/dev/audio' allow other sample rates and sample types but as that is non-standard this method only uses the common format. Typical systems that offer these are Suns, Linux and FreeBSD machines. You may set direct `/dev/audio' access as your audio method by the command (Parameter.set 'Audio_Method 'sunaudio) `/dev/audio (16bit)' Later Sun Microsystems workstations support 16 bit linear audio at various sample rates. Support for this form of audio output is supported. It is a compile time option (as it requires include files that only exist on Sun machines. If your installation supports it (check the members of the list `*modules*') you can select 16 bit audio output on Suns by the command (Parameter.set 'Audio_Method 'sun16audio) Note this will send it to the local machine where the festival binary is running, this might not be the one you are sitting next to--that's why we recommend netaudio. A hacky solution to playing audio on a local machine from a remote machine without using netaudio is described in *Note Installation:: `/dev/dsp (voxware)' Both FreeBSD and Linux have a very similar audio interface through `/dev/dsp'. There is compile time support for these in the speech tools and when compiled with that option Festival may utilise it. Check the value of the variable `*modules*' to see which audio devices are directly supported. On FreeBSD, if supported, you may select local 16 bit linear audio by the command (Parameter.set 'Audio_Method 'freebsd16audio) While under Linux, if supported, you may use the command (Parameter.set 'Audio_Method 'linux16audio) Some earlier (and smaller machines) only have 8bit audio even though they include a `/dev/dsp' (Soundblaster PRO for example). This was not dealt with properly in earlier versions of the system but now the support automatically checks to see the sample width supported and uses it accordingly. 8 bit at higher frequencies that 8K sounds better than straight 8k ulaw so this feature is useful. `mplayer' Under Windows NT or 95 you can use the `mplayer' command which we have found requires special treatement to get its parameters right. Rather than using `Audio_Command' you can select this on Windows machine with the following command (Parameter.set 'Audio_Method 'mplayeraudio) Alternatively built-in audio output is available with (Parameter.set 'Audio_Method 'win32audio) `SGI IRIX' Builtin audio output is now available for SGI's IRIX 6.2 using the command (Parameter.set 'Audio_Method 'irixaudio) `Audio Command' Alternatively the user can provide a command that can play an audio file. Festival will execute that command in an environment where the shell variables `SR' is set to the sample rate (in Hz) and `FILE' which, by default, is the name of an unheadered raw, 16bit file containing the synthesized waveform in the byte order of the machine Festival is running on. You can specify your audio play command and that you wish Festival to execute that command through the following command (Parameter.set 'Audio_Command "sun16play -f $SR $FILE") (Parameter.set 'Audio_Method 'Audio_Command) On SGI machines under IRIX the equivalent would be (Parameter.set 'Audio_Command "sfplay -i integer 16 2scomp rate $SR end $FILE") (Parameter.set 'Audio_Method 'Audio_Command) The `Audio_Command' method of playing waveforms Festival supports two additional audio parameters. `Audio_Required_Rate' allows you to use Festival's internal sample rate conversion function to any desired rate. Note this may not be as good as playing the waveform at the sample rate it is originally created in, but as some hardware devices are restrictive in what sample rates they support, or have naive resample functions this could be optimal. The second additional audio parameter is `Audio_Required_Format' which can be used to specify the desired output forms of the file. The default is unheadered raw, but this may be any of the values supported by the speech tools (including nist, esps, snd, riff, aiff, audlab, raw and, if you really want it, ascii). For example suppose you have a program that only plays sun headered files at 16000 KHz you can set up audio output as (Parameter.set 'Audio_Method 'Audio_Command) (Parameter.set 'Audio_Required_Rate 16000) (Parameter.set 'Audio_Required_Format 'snd) (Parameter.set 'Audio_Command "sunplay $FILE") Where the audio method supports it, you can specify alternative audio device for machine that have more than one audio device. (Parameter.set 'Audio_Device "/dev/dsp2") If Netaudio is not available and you need to play audio on a machine different from teh one Festival is running on we have had reports that `snack' (`http://www.speech.kth.se/snack/') is a possible solution. It allows remote play but importnatly also supports Windows 95/NT based clients. Because you do not want to wait for a whole file to be synthesized before you can play it, Festival also offers an _audio spooler_ that allows the playing of audio files while continuing to synthesize the following utterances. On reasonable workstations this allows the breaks between utterances to be as short as your hardware allows them to be. The audio spooler may be started by selecting asynchronous mode (audio_mode async) This is switched on by default be the function `tts'. You may put Festival back into synchronous mode (i.e. the `utt.play' command will wait until the audio has finished playing before returning). by the command (audio_mode sync) Additional related commands are `(audio_mode 'close)' Close the audio server down but wait until it is cleared. This is useful in scripts etc. when you wish to only exit when all audio is complete. `(audio_mode 'shutup)' Close the audio down now, stopping the current file being played and any in the queue. Note that this may take some time to take effect depending on which audio method you use. Sometimes there can be 100s of milliseconds of audio in the device itself which cannot be stopped. `(audio_mode 'query)' Lists the size of each waveform currently in the queue. File: festival.info, Node: Voices, Next: Tools, Prev: Audio output, Up: Top Voices ****** This chapter gives some general suggestions about adding new voices to Festival. Festival attempts to offer an environment where new voices and languages can easily be slotted in to the system. * Menu: * Current voices:: Currently available voices * Building a new voice:: Building a new voice * Defining a new voice:: Defining a new voice