There are a couple different ways that vocal synthesis can be approached. The method that I'm using is called "formant synthesis", and is one of the oldest techniques that's been used for computer synthesis.

In English, there are approximately 40 distinct "sounds" that make up the language, are referred to as "phonemes".

There are different phonetic systems, but one of the simplest for American English is the "Arpabet", which uses plain text characters to represent phonemes. For example, the word "dictionary" would be written:

/D IH K SH AH N EH R IY/

You can play around with the the online CMU Pronouncing Dictionary to see how this works.

I use the CMU Dictionary to convert English into phonemes. If a word isn't found in the dictionary, I fall back to a public domain program called "Reciter" which guesses how to pronounce the word.

Phonemes are turned into sound by simulating the human vocal tract electronically. Before explaining that, let give give a (very simplified) explanation of how we create vocal sounds.

As air passes through the glottal folds, the folds vibrate and create sound. By controlling the tension (which in turn controls the length of the folds' opening), we can raise and lower the pitch we create. This pitch is called the fundamental frequency (F0), which we hear at the pitch of the voice.

This pitched glottal pulse (which resembled a kazoo sound) passes through our mouth. We use our tongue to create one or more resonating chambers that reinforce specific frequencies in the glottal pitch. These reinforced frequencies are called "resonances", and are what distinguishes one phonemes from another.

For example, (borrowing from the SoftVoice website), here are a number of vowel sounds, and the frequencies of resonance for the "average" male speaker in Hz:

/IY/ (beet): 270, 2300, 3000
/IH/ (bit): 400, 2000, 2550
/EH/ (bet): 530, 1850, 2500
/AE/ (bat): 660, 1700, 2400
/AH/ (but): 640, 1200, 2400
/UW/ (boot): 300, 870, 2250

In the phoneme /IY/ (as in beet), the first formant (F1) is at 270Hz, the second (F2) is at 2300Hz, and the third (F3) is at 3000Hz. Again, these formants don't alter the fundamental pitch, and remain fixed no matter what pitch you're singing.

Some phonemes are obviously more complex than that. For example, the phonemes /IY/ and /UW/ are diphthongs, and consist of two distinct targets. But I'm digressing...

To do this electronically, I generate a waveform that approximates a glottal pulse at the desired pitch, and pass it through a series of bandpass filters - one for each formant frequeny - to resonate at the desired frequencies. The output is a rough approximation of the sound.

Changing the pitch of the glottal pulse changes the pitch that's being sung. Changing the resonating filters to new values changes the phoneme that's being sung.

Some sounds (like the frication in the /F/ or the plosive in the /T/) are created by means different than described above. I used to synthesize them, but I now use digital samples because they give better results.

Very little work that I've done are my own ideas. I've borrowed heavily from the published work of Dennis Klatt, who wrote one of the first text-to-speech computer programs.

If you're curious, I'd highly recommend downloading this Formant Synthesis Demo. Click and drag in the area marked F1/F2 (formants one and two) and you'll get a good idea how this works.

Did that clear up some of the mystery?


-- David Cuny
My virtual singer development blog

Vocal control, you say. Never heard of it. Is that some kind of ProTools thing?