Hi, Pat.
Thanks for the your response. Before getting into details, I should clarify: this project is just re-implementing what others have already done, and much better than me.
From text to speech: The MITalk system (Allen) was a great resource for me. The technology is essentially abandoned for other, better approaches. I'll give an explanation why I was unable to go down that route.
Also, my immediate goal is to get Don something usable. If
DynaVox finally gets him a better synthesis program, then,
Hurrah!. No need for this program.
I'm aware of the
Vocaloid software - I've got the
Avanna as well, because it's probably got the best English accent of all the current
Vocaloids.
I've also spent a lot of time looking at
UTAU, a free synthetic singer written along the lines of
Vocaloid.
In fact, my initial approach was exactly what you suggested: record various phonemes (using Vowel/Consonant/Vowel patterns), cross-fade them together, and use pitch shifting.
I've actually written a number of tools to do this. The stumbling block was the pitch shifting. The pitch shifting needs to shift
some frequencies (the glottal pulse) and keep others fixed (the formants) or you get the "Mickey Mouse" effect.
BiaB uses the astonishingly good
elastiq algorithm. I couldn't find any free libraries that gave decent results - even the
RubberBand library, which has formant preservation, didn't do an acceptable job.
I tried FFT-based pitch shifting, but didn't have much luck.
I got better results with PSOLA (Pitch Synchronous Overlap and Add), but there were significant artifacts:
Here's an example.
The examples I'd heard of formant-based synthesis convinced me that while it lacked realism, it
was capable of creating
intelligible and musical synthesis. I think you'll agree that, with some tuning, this synthesizer may not create
realistic voices, but they can be
understandable.
And to be honest, I've been focused on just getting the code to work. I've spent very little time on fine-tuning the phonemes. This is alpha-software, and there's lots of room for improvement.
That said, in
Text-to-Speech Synthesis, Paul Taylor argues that formant-based synthesis is intrinsically un-natural because it can't capture the details of real speech, so I don't hold high hopes for it.
I've considered mixing pre-recorded audio with synthesized sounds like
eSpeak, but that raises plenty of issues. And there's still the issue of handling sounds like /B/, /D/ and /G/, which are voiced
and consonants. So for the moment, I'm sticking with "pure" synthesis.
I hope that somewhat explains that approach I've taken. Despite the many flaws, I figured it was time to move ahead with the project. For the moment, I'll be focusing on creating a UI.