Noise vs. Clean Data: What Should a Diffusion Model Learn? (Part 2)
By Alexander (Hongyi) Huang
Overview: In Part 1, we explored the empirical distinction between -prediction and -prediction. Here, we discuss an information-theoretic perspective to explain these observations.
Introduction
To understand why -prediction outperforms -prediction, we can treat the forward diffusion process as a noisy channel and the model as a learned decoder. This perspective partially explains why -prediction becomes fundamentally ill-conditioned in high ambient dimensions, even though the underlying data lie on a low-dimensional manifold.
Diffusion as an Information Channel
The forward diffusion step
can be viewed as transmitting the signal through an information channel that progressively decreases the signal-to-noise ratio (SNR). The reverse model acts as a decoder that attempts to maximize the mutual information between the corrupted sample and the quantity it is trained to predict (either or ).
Entropy Scales With Dimension
A key property of differential entropy is that it scales linearly with the dimension of the random vector. For Gaussian noise,
so the entropy of grows as . In contrast, the clean data live on a manifold of intrinsic dimension , so their entropy scales as , independent of the ambient dimension .1
Mutual Information in ε-Prediction
Although occupies only a -dimensional manifold, the Gaussianly distributed noise spreads across all ambient dimensions. Only the components of aligned with directions in which the data vary—the "tangent directions" of the manifold—contain information about , and there are only such directions. The remaining directions are orthogonal to the data manifold and thus contain no signal.
Consequently, while , the conditional entropy satisfies
since the orthogonal components cannot be recovered from the corrupted sample. The mutual information is therefore
Thus the decoder receives only bits of useful information but must predict a -dimensional vector. The fraction of informative signal is
explaining why -prediction becomes poorly conditioned in high-dimensional ambient spaces.
Mutual Information in x₀-Prediction
When predicting directly, the model focuses solely on the structured, low-dimensional signal. The mutual information satisfies
but crucially the model also needs to predict only a quantity of intrinsic dimension . The channel preserves bits of information, and the prediction target has degrees of freedom, making the estimation problem well-conditioned even when is very large.
Geometrically, one may think of the data manifold as a curved sheet embedded in . Variations along the sheet (the "tangent directions") describe meaningful changes in the data, while directions orthogonal to the sheet do not. -prediction only involves these tangent directions, while -prediction requires modeling noise in all directions, most of which are irrelevant. As the ambient dimension increases, this disparity becomes increasingly severe.
Conclusion
Thereotically, optimizing the mean square error results in the optimal -prediction model and -prediction model . If both models have infinite capacity to achieve their theoretical optima, they are equivalently related by
However, all neural networks in practice are constrained by their finite capacity to achieve their theoretical optimal. As the ambient dimension increases, the -dimensional signal along the "tangent direction" in -prediction is overwhelmed by -dimensional noise along the "orthogonal direction". Consequently, a disproportionate share of the model's capacity is wasted on predicting irreducible randomness. In contrast, -prediction allows the model to focus all its finite capacity on predicting the underlying structure, and achieves robust performance at high dimensions. 2
Notes
Footnotes
-
To make sense of the entropy of clean data distributed on a -dimensional manifold, we can think of the clean data as being spread over a thin -wide band surrounding the manifold to avoid infinity in differential entropy. ↩
-
These are only my preliminary, non-rigorous thoughts on understanding diffusion models in the framework of information theory. ↩