Speech Synthesizer Review

Over the past few days I've been playing around with the various speech synthesizers on the market today. What I've compiled here is a list of the more feature rich synthesizers. "Richness" was determined by two factors, how I felt the speech synthesis sounded and what features the synthesizer provided towards our goal of adding lip-syncing and other speech determined acts (gesture, gaze, etc.) Below is a table outlining my findings (with sound samples and URLs where available) followed by a more in-depth description of each synthesizer. (Or at least as much depth as I could get.)

(Ben A. Moore)

Starting where Ben stopped, I have added two more synthesizers to this review. I also checked for more recent versions of the synthesizers already mentioned on this page and they all still seemed up-to-date.

(Casper Eyckelhof)

Notes

You can hear samples of what I feel are the better options for synthesizers below. After I had compiled this document, Jeff informed me what we also need to get timing information back from the synthesizer, so I'm now looking into that information as well. Hopefully this should be a sufficient amount of information to start with.

Samples and Links

Name Company Prosody Callbacks Compare Samples Demo Install
Laureate BT Labs X X1 Male(British)Female(British) BritishBritish Try it! Server, Local
Truetalk Entropic X X2     Server, Local
Whistler Microsoft Research X3 X4 MaleFemale   Local
TTS3000/M Lernout & Hauspie X X   British   ?
Next Generation Speech Lucent(AT&T Research) X ? MaleFemale Try it! ?
Bell Labs TTS AT&T Bell Labs X no MaleFemale Try it! ?
Festival CSTR University of Edinburgh X ? Male w/ FestivalMale w/ MBROLA(British) BritishBritish Try it!5 Local(compile)
ETI-Eloquence ETI-Eloquence X X MaleFemale Try it Server, Local
Realspeak Lernout & Hauspie X ?     ?
Elan Speech Engine Elan informatique X X Try it Local
Some of the samples are British English to check hold the mouse over the sample icon to check.



  Key:
X: Has feature  ?: Unknown : Play sample : Male : Female
1. Only provides AudioStart and AudioStop callbacks from SAPI.
2. Only provides callbacks at ends of words and from marks set within the text.
3. Control is limited to the voice pitch.
4. This is the reference implementation of the SAPI. It provides "Visual" data for lip-syncing.
5. There are some bugs in this page, which cause it problems with Netscape. (Only tested under windows.)



 
Name Voice Prosody SAPI Callbacks
Laureate Generated from voice samples Pitch adjustment, speed, word duration, volume Yes1 AudioStart,AudioStop
Truetalk Parameterized(Vocal Tract, Pitch, Breathiness, Gender Pitch, Speed, Volume, Tone(Rising/Falling) No Word Boundaries, Text Indicies
Whistler Generated from voice samples Pitch, Speed, Volume Yes AudioStart, AudioStop, AttribChanged, Bookmark, ClickIn, ClickOut, TextDataStarted, TextDataStopped, Visual, WordPosition
TTS3000/M Based on speech samples Volume, Pitch, Speed, Pauses No ?
Next Generation Speech Unit Selection ? ? ?
Bell Labs TTS Parameterized(same as Truetalk) Same as TrueTalk No None
Festival Diphone and Unit Selection Pitch, Tone, Emphasis, Pauses, Volume No None
ETI-Eloquence Parameterized(Pitch, Speed, Volume, Head Size, Pitch Fluctuation, Breathiness, Roughness) Pitch, Speed, Volume, Emphasis, Tone, Intonation, Pauses Yes Wave-out,Text Indicies,Phoneme Indicies
(probably all SAPI callbacks, not 100% clear from website)
RealSpeak Concatenation of human voice segments pauses, emphasis,?  ? ?
Elan Speech Engine Concatenative synthesis Pitch, speed, pauses,? Yes  (probably all SAPI callbacks, not 100% clear from website

Laureate

This is the BT Labs speech synthesizer. It allows for the creation of new voices from samples of a human voice. Laureate provides a lot of control over the prosody of the generated text, but unfortunately may not provide a sufficient number of callbacks to allow for accurate lip-syncing. They implement the Microsoft SAPI interface, but currently only provide notification of "AudioStart" and "AudioStop" events. Their representative believes that it may be possible to do lip syncing with only start and stop notification, but that would have to be studied.

TrueTalk

TrueTalk is our old stand-by speech synthesizer. I'm fairly familiar with its interface and could probably get it to do what we want. It only provides a few callbacks as well. There are two types. One type is sent at when "index markers" are passed over in the text, and the other at the end of each synthesized word. We have had fairly decent results with this engine so far.

Whistler

This is Microsoft's offering to the speech community. It provided all of the callbacks we could need but is sorely lacking in the ability to modify the voice. You have control over the pitch. The other properties limit your control to prenamed properties, I haven't messed around with these properties enough to entirely discount them, but I have my doubts about their usefulness.
Additions by Casper: Microsoft says on their website that the voice still has a distinct machine sound, where other companies tell that their engine sounds (almost) human.
The whisper engine does include a dialog to change the pronounciation of words. An example can be found (after a default installation of the SAPI 4) in C:\Program Files\Microsoft Speech SDK\TTS, called attstest.exe .

TTS3000/M

The engine doesn't sound too bad, but I have very little information about this product other that the fact that it provides a lot of control over the voice. (I e-mailed them, but they haven't gotten back to me yet.)

Next Generation Speech

This is the new AT&T(Lucent) TTS Engine. It sounds awesome, but it is not available yet and I don't have much info on how its interface works. A possibility would be to try to get a pre-pre-pre-release(as source if possible) and then modify it to do what we need as far as callbacks are concerned. It is built upon CHATR, and Festival to provide its text processing so it is only a audio generating engine at this point. (They may never make it do the text processing itself since CHATR and Festival already do this part well.) Personal note: I think this may be a better option for a later project, but starting on it now would probably be good. Update: Since I initially produced this page Lucent is now providing 3 voices for their new engine. I've added some new samples to the table above. Update2: I've added a new page that highlights some of the prosody control features found in this synthesizer. This page is in a very rough form and currently does not provide explanations of any of the control strings provided.

Bell Labs TTS

This is the old standby synthesizer. It provides all the voice control that we need, but it does not provide the callbacks necessary to do what we want. Many of the synthesizers on the market today are based upon this engine (including Entropic's). I'm still gathering more detailed information about this engine.

Festival

I had initially left Festival out of this list because all of the samples I had heard to date were terrible. Then I ran across a sample from their new engine (this may be the same work being done at AT&T). And I was impressed. So I'll collect some more info on Festival soon and post it here. Update: The new version of Festival(1.4.0) was release last month. It comes capable of using the new synthesis methods heard in the samples of AT&T's new synthesis engine. Unfortunately it does not ship with a voice, you have to provide that yourself. They do provide a tutorial on how to produce your own voices for use with Festival. Considerations: Festival is a research product. As such it has some ease of use problems with its installation. This should be taken into consideration depending on what its intended use it. (E.g. try not to ship it to people unless you are going to precompile it.)

ETI-Eloquence

This engine definitely has possibilities. It provides most of the features we need including the ability to look up the phoneme representation of a word. (Whistler also can provide this feature, but now through an Java/ActiveX interface.) It sounds OK (nowhere near the Next Generation Speech from AT&T) and will probably improve over time.
Update by Casper: ETI-Eloquence now supports SAPI. Especially the manual tweaked voices (using prosody tags) sound good. The fact that they support tags within the strings to be spoken, means a lot of relatively easy control over prosodic features.
Some quotes from their website that are interesting:
For each phoneme (speech sound), ETI-Eloquence can signal the application as the phoneme is being played through the output device, to allow the application to synchronize animation with the speech.
ETI-Eloquence provides application developers with three kinds of dictionaries for overriding the default pronunciation of items in the input text: special words,
abbreviations, and roots.
ETI-Eloquence provides a number of prosody (intonation and timing) annotations, which can be used [...] to produce a wide range of special prosodic effects.
All features are described in more detail in their online white paper.

RealSpeak

The samples on the website sound pretty nice, although there are some mistakes in stressing the right syllables. But the information on there homepage is just minimal. A few promising lines:
Exception dictionaries are easily created to customize specific words
Available in a female voice
Control sequences available to specify pauses, emphasis, DTMF tones, etc.

This engine was designed for quality and not for low memory and CPU useage, so it probably takes a lot of resources.
Something that might be useful is the fact that this engine comes in multiple languages. Only English and German are shipping yet, but French, Italian, Korean, Spanish, Dutch, and Japanese will follow.

I've sent them an email with questions regarding API, callbacks, etc and still waiting for an answer.
Update: This engine is unsuitable for desktop applications because of the resources it needs. Designed for the telecommunications market.

Elan Speech Engine

The quality of this engine is not very good: although you can understand it with no problem, it just doesn't sound natural. This might be especially true for the english voices; I do think the German and French voices are better. The engine supports SAPI and if they put some more work in the English voices, it can be an option in the future.
Note: there is a downloadable demo, but I didn't get it to run on my PC.
 

Some final notes

Running a speech synthesizer as a server: I don't think it would be impossible to take any of the solutions here and turn it into a server based speech synthesizer. The only thing you would need to do is get the audio data from the synthesizer and pipe it to a network stream or file. Most of the programs that demo the capabilities of the program demonstrate that it can write out the data to a file.

I haven't listed what platforms are supported by each system yet. But generally you can assume they are available for Win9x/NT, with the exception of Truetalk. A few are also available on Unix and/or Linux.

Last updated November 5, 1999 by Casper Eyckelhof