January 9, 2026

Noise vs. Clean Data: What Should a Diffusion Model Learn? (Part 2)

By Alexander (Hongyi) Huang

Overview: In Part 1, we explored the empirical distinction between $\epsilon$ -prediction and $x_0$ -prediction. Here, we discuss an information-theoretic perspective to explain these observations.

Introduction

To understand why $x_0$ -prediction outperforms $\epsilon$ -prediction, we can treat the forward diffusion process as a noisy channel and the model as a learned decoder. This perspective partially explains why $\epsilon$ -prediction becomes fundamentally ill-conditioned in high ambient dimensions, even though the underlying data lie on a low-dimensional manifold.

Diffusion as an Information Channel

The forward diffusion step

x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon, \qquad \epsilon \sim \mathcal{N}(0,I_D)

can be viewed as transmitting the signal $x_0$ through an information channel that progressively decreases the signal-to-noise ratio (SNR). The reverse model acts as a decoder that attempts to maximize the mutual information between the corrupted sample $x_t$ and the quantity it is trained to predict (either $\epsilon$ or $x_0$ ).

Entropy Scales With Dimension

A key property of differential entropy is that it scales linearly with the dimension of the random vector. For Gaussian noise,

h(\epsilon) = \frac{D}{2}\log(2\pi e)

so the entropy of $\epsilon$ grows as $O(D)$ . In contrast, the clean data $x_0$ live on a manifold of intrinsic dimension $d \ll D$ , so their entropy scales as $h(x_0)=O(d)$ , independent of the ambient dimension $D$ .¹

Mutual Information in ε-Prediction

Although $x_0$ occupies only a $d$ -dimensional manifold, the Gaussianly distributed noise $\epsilon \in \mathbb{R}^D$ spreads across all $D$ ambient dimensions. Only the components of $\epsilon$ aligned with directions in which the data vary—the "tangent directions" of the manifold—contain information about $x_0$ , and there are only $d$ such directions. The remaining $D-d$ directions are orthogonal to the data manifold and thus contain no signal.

Consequently, while $h(\epsilon)=O(D)$ , the conditional entropy satisfies

h(\epsilon \mid x_t) \approx O(D-d)

since the orthogonal components cannot be recovered from the corrupted sample. The mutual information is therefore

I(\epsilon; x_t) = h(\epsilon) - h(\epsilon \mid x_t) \approx O(D) - O(D-d) = O(d)

Thus the decoder receives only $O(d)$ bits of useful information but must predict a $D$ -dimensional vector. The fraction of informative signal is

\frac{O(d)}{O(D)} \to 0 \qquad \text{as } D \to \infty

explaining why $\epsilon$ -prediction becomes poorly conditioned in high-dimensional ambient spaces.

Mutual Information in x₀-Prediction

When predicting $x_0$ directly, the model focuses solely on the structured, low-dimensional signal. The mutual information satisfies

I(x_0; x_t) = h(x_0) - h(x_0 \mid x_t) \approx O(d)

but crucially the model also needs to predict only a quantity of intrinsic dimension $d$ . The channel preserves $O(d)$ bits of information, and the prediction target has $O(d)$ degrees of freedom, making the estimation problem well-conditioned even when $D$ is very large.

Geometrically, one may think of the data manifold as a curved sheet embedded in $\mathbb{R}^D$ . Variations along the sheet (the "tangent directions") describe meaningful changes in the data, while directions orthogonal to the sheet do not. $x_0$ -prediction only involves these tangent directions, while $\epsilon$ -prediction requires modeling noise in all $D$ directions, most of which are irrelevant. As the ambient dimension increases, this disparity becomes increasingly severe.

Conclusion

Thereotically, optimizing the mean square error results in the optimal $\epsilon$ -prediction model $\epsilon_\theta(x_t, t)= \mathbb{E}[x_t-x_0 | x_t]$ and $x_0$ -prediction model $x_\theta(x_t, t)= \mathbb{E}[x_0 | x_t]$ . If both models have infinite capacity to achieve their theoretical optima, they are equivalently related by

\mathbb{E}[\epsilon | x_t] = \frac{x_t - \sqrt{\bar{\alpha}_t} \mathbb{E}[x_0 | x_t]}{\sqrt{1-\bar{\alpha}_t}}.

However, all neural networks in practice are constrained by their finite capacity to achieve their theoretical optimal. As the ambient dimension $D$ increases, the $d$ -dimensional signal along the "tangent direction" in $\epsilon$ -prediction is overwhelmed by $(D-d)$ -dimensional noise along the "orthogonal direction". Consequently, a disproportionate share of the model's capacity is wasted on predicting irreducible randomness. In contrast, $x_0$ -prediction allows the model to focus all its finite capacity on predicting the underlying structure, and achieves robust performance at high dimensions. ²

Notes

Footnotes

To make sense of the entropy of clean data distributed on a $d$ -dimensional manifold, we can think of the clean data as being spread over a thin $\epsilon'$ -wide band surrounding the manifold to avoid infinity in differential entropy. ↩
These are only my preliminary, non-rigorous thoughts on understanding diffusion models in the framework of information theory. ↩