Originally Posted by Jeff Yankauer
Are any of the AI sounds extracted directly from copyrighted material, or does it use different set of loops/synths/samples to create tracks that sound similar to the copyrighted material that it was trained on, without actually having directly sampled audio from the copyrighted material?
Neither - the audio isn't stored in the way you are describing.

AI is "trained" on the audio. The training process consists of the data (in this case, potentially copyright audio) being fed into a self-organizing neural network, and some output being created. The match between the output and the expected output is then measured to create an error value, and the weights in the neural network are adjusted so the next time the same input is given, it will be closer to the desired output.

For the sake of creating an example, imagine that the AI was being trained to generate 4 seconds of audio. The input might be "lion roar", and the output would be audio representing the roar of a lion.

But what would that audio actually look like? It reality, it probably wouldn't be the exact audio stream, because that's a lot of computation. Instead, it would likely be a series of 10ms frames, with each frame representing the audio as a spectral envelope. A spectral envelope is a coarse representation of the energy that exists within the frequency bands. There are various ways of converting spectral envelopes into audio.

So right off the bat, the audio that's being generated is an inexact representation of the sound. The job of the AI is to predict what the energy levels in the spectral bands is going to be over the course of 4 seconds. At 4000 milliseconds per second, that's 1600 frames.

It's likely to have some sort of "memory", so that network might keep track of the last 20 frames as part if its input. So the actual question to the neural network would be more like "Given this as the last 20 frames, and the target being "lion roar", predict the values for the 40 spectral bands in the next frame".

The neural network is a self-organizing network, so it's anyone's guess what the innards of the network are going to be. But you're not going to find any "sampled audio" in the network.

On the other hand, that doesn't mean that the network won't replicate the copyright audio that it was trained on.


-- David Cuny
My virtual singer development blog

Vocal control, you say. Never heard of it. Is that some kind of ProTools thing?