Thanks for the your response. Before getting into details, I should clarify: this project is just re-implementing what others have already done, and much better than me. From text to speech: The MITalk system
(Allen) was a great resource for me. The technology is essentially abandoned for other, better approaches. I'll give an explanation why I was unable to go down that route.
Also, my immediate goal is to get Don something usable. If DynaVox
finally gets him a better synthesis program, then, Hurrah!
. No need for this program.
I'm aware of the Vocaloid
software - I've got the Avanna
as well, because it's probably got the best English accent of all the current Vocaloids
I've also spent a lot of time looking at UTAU
, a free synthetic singer written along the lines of Vocaloid
In fact, my initial approach was exactly what you suggested: record various phonemes (using Vowel/Consonant/Vowel patterns), cross-fade them together, and use pitch shifting.
I've actually written a number of tools to do this. The stumbling block was the pitch shifting. The pitch shifting needs to shift some
frequencies (the glottal pulse) and keep others fixed (the formants) or you get the "Mickey Mouse" effect.
BiaB uses the astonishingly good elastiq
algorithm. I couldn't find any free libraries that gave decent results - even the RubberBand library
, which has formant preservation, didn't do an acceptable job.
I tried FFT-based pitch shifting, but didn't have much luck.
I got better results with PSOLA (Pitch Synchronous Overlap and Add), but there were significant artifacts: Here's an example
The examples I'd heard of formant-based synthesis convinced me that while it lacked realism, it was
capable of creating intelligible
and musical synthesis. I think you'll agree that, with some tuning, this synthesizer may not create realistic
voices, but they can be understandable
And to be honest, I've been focused on just getting the code to work. I've spent very little time on fine-tuning the phonemes. This is alpha-software, and there's lots of room for improvement.
That said, in Text-to-Speech Synthesis
, Paul Taylor argues that formant-based synthesis is intrinsically un-natural because it can't capture the details of real speech, so I don't hold high hopes for it.
I've considered mixing pre-recorded audio with synthesized sounds like eSpeak
, but that raises plenty of issues. And there's still the issue of handling sounds like /B/, /D/ and /G/, which are voiced and
consonants. So for the moment, I'm sticking with "pure" synthesis.
I hope that somewhat explains that approach I've taken. Despite the many flaws, I figured it was time to move ahead with the project. For the moment, I'll be focusing on creating a UI.