benanne OP t1_j3qxvaa wrote on January 10, 2023 at 1:50 PM

Reply to comment by jimmymvp in [R] Diffusion language models by benanne

As I understand it, the main motivation for latent diffusion is that in perceptual domains, ~99% of information content in the input signals is less perceptually relevant, so it does not make sense to spend a lot of model capacity on it (lossy image compression methods like JPEG are based on the same observation). Training an autoencoder first to get rid of the majority of this irrelevant information can greatly simplify the generative modelling problem at almost no cost to fidelity.

This idea was originally used with great success to adapt autoregressive models to perceptual domains. Autoregression in pixel space (e.g. PixelRNN, PixelCNN) or amplitude space for audio (e.g. WaveNet, SampleRNN) does work, but it doesn't scale very well. Things work much better if you first use VQ-VAE (or even better, VQGAN) to compress the input signals, and then apply autoregression in its latent space.

The same is true for diffusion models, though in this case there is another mechanism we can use to reduce the influence of perceptually irrelevant information: changing the relative weighting of the noise levels during training, to downweight high-frequency components. Diffusion models actually do this out of the box when compared to likelihood-based models, which is why I believe they have completely taken over generative modelling of perceptual signals (as I discuss in the blog post).

But despite the availability of this reweighting mechanism, the latent approach can still provide further efficiency benefits. Stable Diffusion is testament to this: I believe the only reason they are able to offer up a model that generates high-res content on a single consumer GPU, is because of the adversarial autoencoder they use to get rid of all the imperceptible fine-grained details first.

I think this synergy between adversarial models (for low-level detail) and likelihood- or diffusion-based models (for structure and content) is still underutilised. There's a little bit more discussion about this in section 6 of my blog post on typicality: https://benanne.github.io/2020/09/01/typicality.html#right-level (though this largely predates the rise of diffusion models)