7 APIs

Flite is a library that we expected will be embedded into other applications. Included with the distribution is a small example executable that allows synthesis of strings of text and text files from the command line.

You may want to look at Bard http://festvox.org/bard/, an ebook reader with a tight coupling to flite as a synthesizer. This is the most elaborate use of the Flite API within our suite of programs.


7.1 flite binary

The example flite binary may be suitable for very simple applications. Unlike Festival its start up time is very short (less than 25ms on a PIII 500MHz) making it practical (on larger machines) to call it each time you need to synthesize something.

flite TEXT OUTPUTTYPE

If TEXT contains a space it is treated as a string of text and converted to speech, if it does not contain a space TEXT is treated as a file name and the contents of that file are converted to speech. The option -t specifies TEXT is to be treat as text (not a filename) and -f forces treatment as a file. Thus

flite -t hello 

will say the word "hello" while

flite hello 

will say the content of the file hello. Likewise

flite "hello world."

will say the words "hello world" while

flite -f "hello world"

will say the contents of a file hello world. If no argument is specified text is read from standard input.

The second argument OUTPUTTYPE is the name of a file the output is written to, or if it is play then it is played to the audio device directly. If it is none then the audio is created but discarded, this is used for benchmarking. If it is stream then the audio is streamed through a call back function (though this is not particularly useful in the command line version. If OUTPUTTYPE is omitted, play is assumed. You can also explicitly set the outputtype with the -o flag.

flite -f doc/alice -o alice.wav

7.2 Voice selection

All the voices in the distribution are collected into a single simple list in the global variable flite_voice_list. You can select a voice from this list from the command line

flite -voice awb -f doc/alice -o alice.wav

And list which voices are currently supported in the binary with

flite -lv

The voices which get linked together are those listed in the VOICES in the main/Makefile. You can change that as you require.

Voices may also be dynamically loaded from files as well as built in. The argument to the -voice option may be pathname to a dumped (Clustergen) voice. This may be a Unix pathname or a URL (only protocols http and file are supported. For example

flite -voice file://cmu_us_awb.flitevox -f doc/alice -o alice.wav
flite -voice http://festvox.org/voices/cmu_us_ksp.flitevox -f doc/alice -o alice.wav

Voices will be loaded once and added to flite_voice_list. Although these voices are often small (a few megabytes) there will still be some time required to read them in the first time. The voices are not mapped, they are read into newly created structures.

This loading function is currently only supported for Clustergen voices.


7.3 C example

Each voice in Flite is held in a structure, a pointer to which is returned by the voice registration function. In the standard distribution, the example diphone voice is cmu_us_kal.

Here is a simple C program that uses the flite library

#include "flite.h"

cst_voice * register_cmu_us_kal(const char *voxdir);

int main(int argc, char **argv)
{
    cst_voice *v;

    if (argc != 2)
    {
        fprintf(stderr,"usage: flite_test FILE\n");
        exit(-1);
    }

    flite_init();

    v = register_cmu_us_kal(NULL);

    flite_file_to_speech(argv[1],v,"play");

}

Assuming the shell variable FLITEDIR is set to the flite directory the following will compile the system (with appropriate changes for your platform if necessary).

gcc -Wall -g -o flite_test flite_test.c -I$FLITEDIR/include -L$FLITEDIR/lib 
    -lflite_cmu_us_kal -lflite_usenglish -lflite_cmulex -lflite -lm

7.4 Public Functions

Although, of course you are welcome to call lower level functions, there a few key functions that will satisfy most users of flite.

void flite_init(void);

This must be called before any other flite function can be called. As of Flite 1.1, it actually does nothing at all, but there is no guarantee that this will remain true.

cst_wave *flite_text_to_wave(const char *text,cst_voice *voice);

Returns a waveform (as defined in include/cst_wave.h) synthesized from the given text string by the given voice.

float flite_file_to_speech(const char *filename, cst_voice *voice, const char *outtype);

synthesizes all the sentences in the file filename with given voice. Output (at present) can only reasonably be, play or none. If the feature file_start_position with an integer, that point is used as start position in the file to be synthesized.

float flite_text_to_speech(const char *text, cst_voice *voice, const char *outtype);

synthesizes the text in string point to by text, with the given voice. outtype may be a filename where the generated waveform is written to, or "play" and it will be sent to the audio device, or "none" and it will be discarded. The return value is the number of seconds of speech generated.

cst_utterance *flite_synth_text(const char *text,cst_voice *voice);

synthesize the given text with the given voice and returns an utterance from it for further processing and access.

cst_utterance *flite_synth_phones(const char *phones,cst_voice *voice);

synthesize the given phones with the given voice and returns an utterance from it for further processing and access.

cst_voice *flite_voice_select(const char *name);

returns a pointer to the voice named name. Will retrurn NULL if there is not match, if name == NULL then the first voice in the voice list is returned. If name is a url (starting with file: or http:, that file will be accessed and the voice will be downloaded from there.

float flite_ssml_file_to_speech(const char *filename, cst_voice *voice, const char *outtype);

Will read the file as ssml, not all ssml tags are supported but many are, unsupported ones are ignored. Voice selection works by naming the internal name of the voice, or the name may be a url and the voice will be loaded. The audio tag is supported for loading waveform files, again urls are supported.

float flite_ssml_text_to_speech(const char *text, cst_voice *voice, const char *outtype);

Will treat the text as ssml.

int flite_voice_add_lex_addenda(cst_voice *v, const cst_string *lexfile);

loads the pronunciations from lexfile into the lexicon identified in the given voice (which will cause all other voices using that lexicon to also get this new addenda list. An example lexicon file is given in flite/tools/examples.lex. Words may be in double quotes, an optional part of speech tag may be give. A colon separates the headword/postag from the list of phonemes. Stress values (if used in the lexicon) must be specified. Bad phonemes will be complained about on standard out.


7.5 Streaming Synthesis

In 1.4 support was added for streaming synthesis. Basically you may provide a call back function that will be called with waveform data immediately when it is available. This potentially can reduce the delay between sending text to the synthesized and having audio available.

The support is through a call back function of type

int audio_stream_chunk(const cst_wave *w, int start, int size, 
                       int last, cst_audio_streaming_info *asi)

If the utterance feature streaming_info is set (which can be set in a voice or in an utterance). The LPC or MLSA resynthesis functions will call the provided function as buffers become available. The LPC and MLSA waveform synthesis functions are used for diphones, limited domain, unit selection and clustergen voices. Note explicit support is required for streaming so new waveform synthesis function may not have the functionality.

An example streaming function is provided in src/audio/au_streaming.c and is used by the example flite main program when stream is given as the playing option. (Though in the command line program the function it isn’t really useful.)

In order to use streaming you must provide call back function in your particular thread. This is done by adding features to the voice in your thread. Suppose your function was declared as

int example_audio_stream_chunk(const cst_wave *w, int start, int size, 
                       int last, void *user)

You can add this function as the streaming function through the statement

     cst_audio_streaming_info *asi;
...
     asi = new_audio_streaming_info();
     asi->asc = example_audio_stream_chunk;
     feat_set(voice->features,
             "streaming_info",
             audio_streaming_info_val(asi));

You may also optionally include your own pointer to any information you additionally want to pass to your function. For example

typedef my_callback_struct {
   cst_audiodev *fd;
   int count;
};
cst_audio_streaming_info *asi;

...

mcs = cst_alloc(my_callback_struct,1);
mcs->fd=NULL;
mcs->count=1;

asi = new_audio_streaming_info();
asi->asc = example_audio_stream_chunk;
asi->userdata = mcs;
feat_set(voice->features,
         "streaming_info",
         audio_streaming_info_val(asi));

Another example is given in testsuite/by_word_main.c which shows a call back funtion that also prints the token as it is being synthesized. The utt field in the cst_audio_streaming_info structure will be set to the current utterance. Please note that the item field in the cst_audio_streaming_info structure is for your convenience and is not set by anyone at all. The previous sentence exists in the documentation so that I can point at it, when user’s fail to read it.


This document was generated on July 19, 2023 using texi2any.