← Back to home

Noise vs. Clean Data: What Should a Diffusion Model Learn? (Part 2)

By Alexander (Hongyi) Huang

Overview: In Part 1, we explored the empirical distinction between ϵ\epsilon-prediction and x0x_0-prediction. Here, we discuss an information-theoretic perspective to explain these observations.

Introduction

To understand why x0x_0-prediction outperforms ϵ\epsilon-prediction, we can treat the forward diffusion process as a noisy channel and the model as a learned decoder. This perspective partially explains why ϵ\epsilon-prediction becomes fundamentally ill-conditioned in high ambient dimensions, even though the underlying data lie on a low-dimensional manifold.

Diffusion as an Information Channel

The forward diffusion step

xt=αˉtx0+1αˉtϵ,ϵN(0,ID)x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon, \qquad \epsilon \sim \mathcal{N}(0,I_D)

can be viewed as transmitting the signal x0x_0 through an information channel that progressively decreases the signal-to-noise ratio (SNR). The reverse model acts as a decoder that attempts to maximize the mutual information between the corrupted sample xtx_t and the quantity it is trained to predict (either ϵ\epsilon or x0x_0).

Entropy Scales With Dimension

A key property of differential entropy is that it scales linearly with the dimension of the random vector. For Gaussian noise,

h(ϵ)=D2log(2πe)h(\epsilon) = \frac{D}{2}\log(2\pi e)

so the entropy of ϵ\epsilon grows as O(D)O(D). In contrast, the clean data x0x_0 live on a manifold of intrinsic dimension dDd \ll D, so their entropy scales as h(x0)=O(d)h(x_0)=O(d), independent of the ambient dimension DD.1

Mutual Information in ε-Prediction

Although x0x_0 occupies only a dd-dimensional manifold, the Gaussianly distributed noise ϵRD\epsilon \in \mathbb{R}^D spreads across all DD ambient dimensions. Only the components of ϵ\epsilon aligned with directions in which the data vary—the "tangent directions" of the manifold—contain information about x0x_0, and there are only dd such directions. The remaining DdD-d directions are orthogonal to the data manifold and thus contain no signal.

Consequently, while h(ϵ)=O(D)h(\epsilon)=O(D), the conditional entropy satisfies

h(ϵxt)O(Dd)h(\epsilon \mid x_t) \approx O(D-d)

since the orthogonal components cannot be recovered from the corrupted sample. The mutual information is therefore

I(ϵ;xt)=h(ϵ)h(ϵxt)O(D)O(Dd)=O(d)I(\epsilon; x_t) = h(\epsilon) - h(\epsilon \mid x_t) \approx O(D) - O(D-d) = O(d)

Thus the decoder receives only O(d)O(d) bits of useful information but must predict a DD-dimensional vector. The fraction of informative signal is

O(d)O(D)0as D\frac{O(d)}{O(D)} \to 0 \qquad \text{as } D \to \infty

explaining why ϵ\epsilon-prediction becomes poorly conditioned in high-dimensional ambient spaces.

Mutual Information in x₀-Prediction

When predicting x0x_0 directly, the model focuses solely on the structured, low-dimensional signal. The mutual information satisfies

I(x0;xt)=h(x0)h(x0xt)O(d)I(x_0; x_t) = h(x_0) - h(x_0 \mid x_t) \approx O(d)

but crucially the model also needs to predict only a quantity of intrinsic dimension dd. The channel preserves O(d)O(d) bits of information, and the prediction target has O(d)O(d) degrees of freedom, making the estimation problem well-conditioned even when DD is very large.

Geometrically, one may think of the data manifold as a curved sheet embedded in RD\mathbb{R}^D. Variations along the sheet (the "tangent directions") describe meaningful changes in the data, while directions orthogonal to the sheet do not. x0x_0-prediction only involves these tangent directions, while ϵ\epsilon-prediction requires modeling noise in all DD directions, most of which are irrelevant. As the ambient dimension increases, this disparity becomes increasingly severe.

Conclusion

Thereotically, optimizing the mean square error results in the optimal ϵ\epsilon-prediction model ϵθ(xt,t)=E[xtx0xt]\epsilon_\theta(x_t, t)= \mathbb{E}[x_t-x_0 | x_t] and x0x_0-prediction model xθ(xt,t)=E[x0xt]x_\theta(x_t, t)= \mathbb{E}[x_0 | x_t]. If both models have infinite capacity to achieve their theoretical optima, they are equivalently related by

E[ϵxt]=xtαˉtE[x0xt]1αˉt.\mathbb{E}[\epsilon | x_t] = \frac{x_t - \sqrt{\bar{\alpha}_t} \mathbb{E}[x_0 | x_t]}{\sqrt{1-\bar{\alpha}_t}}.

However, all neural networks in practice are constrained by their finite capacity to achieve their theoretical optimal. As the ambient dimension DD increases, the dd-dimensional signal along the "tangent direction" in ϵ\epsilon-prediction is overwhelmed by (Dd)(D-d)-dimensional noise along the "orthogonal direction". Consequently, a disproportionate share of the model's capacity is wasted on predicting irreducible randomness. In contrast, x0x_0-prediction allows the model to focus all its finite capacity on predicting the underlying structure, and achieves robust performance at high dimensions. 2


Notes

Footnotes

  1. To make sense of the entropy of clean data distributed on a dd-dimensional manifold, we can think of the clean data as being spread over a thin ϵ\epsilon'-wide band surrounding the manifold to avoid infinity in differential entropy.

  2. These are only my preliminary, non-rigorous thoughts on understanding diffusion models in the framework of information theory.