Hi, Guenter.

Thanks for the link!

My goal is create a free Vocaloid-like program that can sing reasonably well in English.

I'm focusing on the rendering engine, rather than the GUI. For one thing, I think Sinsy shows that it's possible to get reasonably good output if there's some built-in intelligence about phrasing and such.

There's already a free editor for Vocaloid/Utau written in Java called Cadencii, so there's no need to reinvent that wheel.

I've done some proof-of-concept work. The pitch detection and pitch shifting code seem to work reasonably well. Here's an example. You can ignore the "buzzy" sound - that's part of the original recording. The important thing is that it (more or less) preserves the character of the sound without introducing the "chipmunk effect".

I'm creating a sort of "melodic backbone" of frequency information that includes preparation, portamento, overshoot and vibrato, and while it's pretty rough, it shows promise. Right now, it only renders the melody as a sine wave. Here's an example.

I'm currently in the process of gluing these parts together into a simple rendering engine that can take phonemes sung one one pitch, and follow that melodic curve. Once that's done, I'll can have it start singing "La, la, la".

I've also experimented with cross-fading phonemes together, both with simple time-based crossfades and more complex harmonic-based crossfades (like in Vocaloid). I can't hear much difference between the two. When I add that, I should be able to render simple melodies like "do-be-do-be-do".

If I can get it that far, I'll need to start building a sung database of phonemes. I'm not looking forward to that, because English requires a ton of phonemes. I expect to record blocks of blends in VCV format, like "ahtah ehteh ihtih ohtoh uhtuh aytay eetee ietie ohtoh ...", much like UTAU does. Each of the blends can be used for VCV blends ("...ahtah..."), VC endings ("...aht") and CV beginnings ("tah...").

It's being coded in Java mainly because I'm comfortable coding in it. I suspect it would be fairly straight forward to convert to C or C# when it's done.

I'd considered trying to do this with the UTAU engine, but there's no documentation on it that I can find. Also, it seems to require Japanese localization, and that's a bit of a pain.

If you've got any additional questions, let me know, or shoot me an email - I PM'd you my email address.


-- David Cuny
My virtual singer development blog

Vocal control, you say. Never heard of it. Is that some kind of ProTools thing?