Speech Synthesizer Review
Over the past few days I've been playing around with the various speech
synthesizers on the market today. What I've compiled here is a list of
the more feature rich synthesizers. "Richness" was determined by two factors,
how I felt the speech synthesis sounded and what features the synthesizer
provided towards our goal of adding lip-syncing and other speech determined
acts (gesture, gaze, etc.) Below is a table outlining my findings (with
sound samples and URLs where available) followed by a more in-depth description
of each synthesizer. (Or at least as much depth as I could get.)
(Ben A. Moore)
Starting where Ben stopped, I have added two more synthesizers to this
review. I also checked for more recent versions of the synthesizers already
mentioned on this page and they all still seemed up-to-date.
You can hear samples of what I feel are the better options
for synthesizers below. After I had compiled this document, Jeff informed
me what we also need to get timing information back from the synthesizer,
so I'm now looking into that information as well. Hopefully this should
be a sufficient amount of information to start with.
Samples and Links
|X: Has feature
||: Play sample
Only provides AudioStart and AudioStop callbacks from SAPI.
Only provides callbacks at ends of words and from marks set within the
Control is limited to the voice pitch.
This is the reference implementation of the SAPI. It provides "Visual"
data for lip-syncing.
There are some bugs in this page, which cause it problems with Netscape.
(Only tested under windows.)|
||Generated from voice samples
||Pitch adjustment, speed, word duration, volume
||Parameterized(Vocal Tract, Pitch, Breathiness,
||Pitch, Speed, Volume, Tone(Rising/Falling)
||Word Boundaries, Text Indicies
||Generated from voice samples
||Pitch, Speed, Volume
||AudioStart, AudioStop, AttribChanged, Bookmark,
ClickIn, ClickOut, TextDataStarted, TextDataStopped, Visual, WordPosition
||Based on speech samples
||Volume, Pitch, Speed, Pauses
|Next Generation Speech
|Bell Labs TTS
||Parameterized(same as Truetalk)
||Same as TrueTalk
||Diphone and Unit Selection
||Pitch, Tone, Emphasis, Pauses, Volume
||Parameterized(Pitch, Speed, Volume, Head Size,
Pitch Fluctuation, Breathiness, Roughness)
||Pitch, Speed, Volume, Emphasis, Tone, Intonation,
||Wave-out,Text Indicies,Phoneme Indicies
(probably all SAPI callbacks, not 100% clear from website)
||Concatenation of human voice segments
|Elan Speech Engine
||Pitch, speed, pauses,?
|| (probably all SAPI callbacks,
not 100% clear from website
This is the BT Labs speech synthesizer. It allows for the
creation of new voices from samples of a human voice. Laureate provides
a lot of control over the prosody of the generated text, but unfortunately
may not provide a sufficient number of callbacks to allow for accurate
lip-syncing. They implement the Microsoft SAPI interface, but currently
only provide notification of "AudioStart" and "AudioStop" events. Their
representative believes that it may be possible to do lip syncing with
only start and stop notification, but that would have to be studied.
TrueTalk is our old stand-by speech synthesizer. I'm fairly
familiar with its interface and could probably get it to do what we want.
It only provides a few callbacks as well. There are two types. One type
is sent at when "index markers" are passed over in the text, and the other
at the end of each synthesized word. We have had fairly decent results
with this engine so far.
This is Microsoft's offering to the speech community. It
provided all of the callbacks we could need but is sorely lacking in the
ability to modify the voice. You have control over the pitch. The other
properties limit your control to prenamed properties, I haven't messed
around with these properties enough to entirely discount them, but I have
my doubts about their usefulness.
Additions by Casper: Microsoft says on their website
that the voice still has a distinct machine sound, where other companies
tell that their engine sounds (almost) human.
The whisper engine does include a dialog to change the
pronounciation of words. An example can be found (after a default installation
of the SAPI 4) in C:\Program Files\Microsoft Speech SDK\TTS, called attstest.exe
The engine doesn't sound too bad, but I have very little
information about this product other that the fact that it provides a lot
of control over the voice. (I e-mailed them, but they haven't gotten back
to me yet.)
Next Generation Speech
This is the new AT&T(Lucent) TTS Engine. It sounds awesome,
but it is not available yet and I don't have much info on how its interface
works. A possibility would be to try to get a pre-pre-pre-release(as source
if possible) and then modify it to do what we need as far as callbacks
are concerned. It is built upon CHATR, and Festival to provide its text
processing so it is only a audio generating engine at this point. (They
may never make it do the text processing itself since CHATR and Festival
already do this part well.) Personal note: I think this may be a
better option for a later project, but starting on it now would probably
be good. Update: Since I initially produced this page Lucent is
now providing 3 voices for their new engine. I've added some new samples
to the table above. Update2: I've added a new page
that highlights some of the prosody control features found in this synthesizer.
This page is in a very rough form and currently does not provide explanations
of any of the control strings provided.
Bell Labs TTS
This is the old standby synthesizer. It provides all the
voice control that we need, but it does not provide the callbacks necessary
to do what we want. Many of the synthesizers on the market today are based
upon this engine (including Entropic's). I'm still gathering more detailed
information about this engine.
I had initially left Festival out of this list because all
of the samples I had heard to date were terrible. Then I ran across a sample
from their new engine (this may be the same work being done at AT&T).
And I was impressed. So I'll collect some more info on Festival soon and
post it here. Update: The new version of Festival(1.4.0) was release
last month. It comes capable of using the new synthesis methods heard in
the samples of AT&T's new synthesis engine. Unfortunately it does not
ship with a voice, you have to provide that yourself. They do provide a
tutorial on how to produce your own voices for use with Festival. Considerations:
is a research product. As such it has some ease of use problems with its
installation. This should be taken into consideration depending on what
its intended use it. (E.g. try not to ship it to people unless you are
going to precompile it.)
This engine definitely has possibilities. It provides most
of the features we need including the ability to look up the phoneme representation
of a word. (Whistler also can provide this feature, but now through an
Java/ActiveX interface.) It sounds OK (nowhere near the Next Generation
Speech from AT&T) and will probably improve over time.
Update by Casper: ETI-Eloquence now supports SAPI.
Especially the manual tweaked voices (using prosody tags) sound good. The
fact that they support tags within the strings to be spoken, means a lot
of relatively easy control over prosodic features.
Some quotes from their website that are interesting:
For each phoneme (speech sound), ETI-Eloquence can
signal the application as the phoneme is being played through the output
device, to allow the application to synchronize animation with the speech.
ETI-Eloquence provides application developers with
three kinds of dictionaries for overriding the default pronunciation of
items in the input text: special words,
abbreviations, and roots.
ETI-Eloquence provides a number of prosody (intonation
and timing) annotations, which can be used [...] to produce a wide range
of special prosodic effects.
All features are described in more detail in their online
The samples on the website sound pretty nice, although there
are some mistakes in stressing the right syllables. But the information
on there homepage is just minimal. A few promising lines:
Exception dictionaries are easily created to customize
Available in a female voice
Control sequences available to specify pauses, emphasis,
DTMF tones, etc.
This engine was designed for quality and not for low memory
and CPU useage, so it probably takes a lot of resources.
Something that might be useful is the fact that this
engine comes in multiple languages. Only English and German are shipping
yet, but French, Italian, Korean, Spanish, Dutch, and Japanese will follow.
I've sent them an email with
questions regarding API, callbacks, etc and still waiting for an answer.
Update: This engine is unsuitable for desktop
applications because of the resources it needs. Designed for the telecommunications
Elan Speech Engine
The quality of this engine is not very good: although you
can understand it with no problem, it just doesn't sound natural. This
might be especially true for the english voices; I do think the German
and French voices are better. The
engine supports SAPI and if they put some more work in the English voices,
it can be an option in the future.
Note: there is a downloadable demo, but I didn't get
it to run on my PC.
Some final notes
Running a speech synthesizer as a server: I don't
think it would be impossible to take any of the solutions here and turn
it into a server based speech synthesizer. The only thing you would need
to do is get the audio data from the synthesizer and pipe it to a network
stream or file. Most of the programs that demo the capabilities of the
program demonstrate that it can write out the data to a file.
I haven't listed what platforms are supported by each
system yet. But generally you can assume they are available for Win9x/NT,
with the exception of Truetalk. A few are also available on Unix and/or
Last updated November 5, 1999 by Casper