In theory, if something is learnable, AI will eventually find a method of learning it.
For the purposes of AI training, audio needs to be converted into something that computers can deal with. That's typically FFT, which then is split into frequency bands to better simulate human hearing.
Audio compression also takes psychoacoustics into account, deciding which frequencies in the audio you aren't likely miss because they're being masked by other frequencies, and discarding those frequencies so there's less information to encode. Because transients require more information to encode, they are also often simplified. High frequencies may be replaced by bands of noise.
So there is certainly a loss of information by the encoding process.
But I also suspect that there's a lot of audio information from the bass that's hard to pull out of the audio because it's simply not there - especially in cases where the bass and kick drum are locked together.
If that's the case, getting "missing" audio data and filling in holes is a matter of extrapolation for the neural network.
We do that all the time, and we often get it wrong.
I suspect there's a fine line between a neural network successfully guessing what a partially masked bass line might be, and it hallucinating a statistically probable but incorrect bass line.
That may be the reason why the quality has only progressed to where it currently is - it's better to be incomplete but accurate, than more complete, but wrong.