Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Tony Bonnaire

^\dagger

, LPENS, Université PSL, Paris, [email protected]

Raphaël Urfin

^\dagger

, LPENS, Université PSL, Paris, [email protected]

Giulio Biroli, LPENS, Université PSL, Paris, [email protected]

Marc Mézard, Department of Computing Sciences, Bocconi University, Milano, [email protected]

\dagger

Equal contribution.

Abstract

Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time

\tau_\mathrm{gen}

at which models begin to generate high-quality samples, and a later time

\tau_\mathrm{mem}

beyond which memorization emerges. Crucially, we find that

\tau_\mathrm{mem}

increases linearly with the training set size

n

, while

\tau_\mathrm{gen}

remains constant. This creates a growing window of training times with

n

where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when

n

becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

1. Introduction

Diffusion Models [DMs, 1, 2, 3, 4] achieve state-of-the-art performance in a wide variety of AI tasks such as the generation of images [5], audios [6], videos [7], and scientific data [8, 9]. This class of generative models, inspired by out-of-equilibrium thermodynamics [1], corresponds to a two-stage process: the first one, called forward, gradually adds noise to a data, whereas the second one, called backward, generates new data by denoising Gaussian white noise samples. In DMs, the reverse process typically involves solving a stochastic differential equation (SDE) with a force field called score. However, it is also possible to define a deterministic transport through an ordinary differential equation (ODE), treating the score as a velocity field, an approach that is for instance followed in flow matching [10].

Understanding the generalization properties of score-based generative methods is a central issue in machine learning, and a particularly important question is how memorization of the training set is avoided in practice. A model without regularization achieving zero training loss only learns the empirical score, and is bound to reproduce samples of the training dataset at the end of the backward process. This memorization regime [11, 12] is empirically observed when the training set is small and disappears when it increases beyond a model-dependent threshold [13]. Understanding the mechanisms controlling this change of regimes from memorization to generalization is a central challenge for both theory and applications. Model regularization and inductive biases imposed by the network architecture were shown to play a role [14, 15], as well as a dynamical regularization due to the finiteness of the learning rate [16]. However, the regime shift described above is consistently observed even in models where all these regularization mechanisms are present. This suggests that the core mechanism behind the transition from memorization to generalization lies elsewhere. In this work, we demonstrate -- first through numerical experiments, and then via the theoretical analysis of a simplified model -- that this transition is driven by an implicit dynamical bias towards generalizing solutions emerging in the training, which allows to avoid the memorization phase.

Figure 1: Qualitative summary of our contributions. (Left) Illustration of the training dynamics of a diffusion model. Depending on the training time $\tau$ , we identify three regimes measured by the inverse quality of the generated samples (blue curve) and their memorization fraction (red curve). The generalization regime extends over a large window of training times which increases with the training set size $n$ . On top, we show a one dimensional example of the learned score function during training (orange). The gray line gives the exact empirical score, at a given noise level, while the black dashed line corresponds to the true (population) score. (Right) Phase diagram in the $(n, p)$ plane illustrating three regimes of diffusion models: Memorization when $n$ is sufficiently small at fixed $p$ , Architectural Regularization for $n>n^{\star}(p)$ (which is model and dataset dependent, as discussed in [17, 14]), and Dynamical Regularization, corresponding to a large intermediate generalization regime obtained when the training dynamics is stopped early, i.e. $\tau \in \left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right]$ .

💭 Click to ask about this figure

Contributions and theoretical picture. We investigate the dynamics of score learning using gradient descent, both numerically and analytically, and study the generation properties of the score depending on the time

\tau

at which the training is stopped. The theoretical picture built from our results and combining several findings from the recent literature is illustrated in Figure 1. The two main parameters are the size of the training set

n

and the expressivity of the class of score functions on which one trains the model, characterized by a number of parameters

p

; when both

n

and

p

are large one can identify three main regimes. Given

p

, if

n

is larger than

n^*(p)

(which depends on the training set and on the class of scores), the score model is not expressive enough to represent the empirical score associated to

n

data, and instead provides a smooth interpolation, approximately independent of the training set. In this regime, even with a very large training time

\tau\to\infty

, memorization does not occur because the model is regularized by its architecture and the finite number of parameters. When

n<n^*(p)

the model is expressive enough to memorize, and two timescales emerge during training: one,

\tau_\mathrm{gen}

, is the minimum training time required to achieve high-quality data generation; the second,

\tau_\mathrm{mem}>\tau_\mathrm{gen}

, signals when further training induces memorization, and causes the model to increasingly reproduce the training samples (left panel). The first timescale,

\tau_\mathrm{gen}

, is found independent of

n

, whereas the second,

\tau_\mathrm{mem}

, grows approximately linearly with

n

, thus opening a large window of training times during which the model generalizes if early stopped when

\tau \in [\tau_\mathrm{gen}, \tau_\mathrm{mem}]

Our results shows that implicit dynamical regularization in training plays a crucial role in score-based generative models, substantially enlarging the generalization regime (see right panel of Figure 1), and hence allowing to avoid memorization even in highly overparameterized settings. We find that the key mechanism behind the widening gap between

\tau_\mathrm{gen}

and

\tau_\mathrm{mem}

is the irregularity of the empirical score at low noise level and large

n

. In this regime the models used to approximate the score provide a smooth interpolation that remains stable for a long period of training times and closely approximates the population score, a behavior likely rooted in the spectral bias of neural networks [18]. Only at very long training times do the dynamics converge to the low lying minimum corresponding to the empirical score, leading to memorization (as illustrated in the one-dimensional examples in the left panel of Figure 1).

The theoretical picture described above is based on our numerical and analytical results, and builds up on previous works, in particular numerical analysis characterizing the memorization--generalization transition [19, 20], analytical works on memorization of DMs [17, 14, 13], and studies on the spectral bias of deep neural networks [18]. Our numerical experiments use a class of scores based on a realistic U-Net [21] trained on downscaled images of the CelebA dataset [22]. By varying

n

and

p

, we measure the evolution of the sample quality (through FID) and the fraction of memorization during learning, which support the theoretical scenario presented in Figure 1. Additional experimental results on synthetic data are provided in Supplemental Material (SM, Sects. Appendix A and Appendix B). On the analytical side, we focus on a class of scores constructed from random features and simplified models of data, following [17]. In this setting, the timescales of training dynamics correspond directly to the inverse eigenvalues of the random feature correlation matrix. Leveraging tools from random matrix theory, we compute the spectrum in the limit of large datasets, high-dimensional data, and overparameterized models. This analysis reveals, in a fully tractable way, how the theoretical picture of Figure 1 emerges within the random feature framework.

Related works. - The memorization transition in DMs has been the subject of several recent empirical investigations [23, 24, 25] which have demonstrated that state-of-the-art image DMs -- including Stable Diffusion and DALL·E -- can reproduce a non-negligible portion of their training data, indicating a form of memorization. Several additional works [19, 20] examined how this phenomenon is influenced by factors such as data distribution, model configuration, and training procedure, and provide a strong basis for the numerical part of our work.

A series of theoretical studies in the high-dimensional regime have analyzed the memorization--generalization transition during the generative dynamics under the empirical score assumption [12, 26, 27], showing how trajectories are attracted to the training samples. Within this high-dimensional framework, [28, 29, 30, 17] study the score learning for various model classes. In particular, [17] uses a Random Feature Neural Network [31]. The authors compute the asymptotic training and test losses for $\tau\rightarrow\infty$ and relate it to memorization.

The theoretical part of our work generalizes this approach to study the role of training dynamics and early stopping in the memorization--generalization transition.
Recent works have also uncovered complementary sources of implicit regularization explaining how DMs avoid memorization. Architectural biases and limited network capacity were for instance shown to constrain memorization in [14, 13], and finiteness of the learning rate prevents the model from learning the empirical score in [16]. Also related to our analysis, [32] provides general bounds showing the beneficial role of early stopping the training dynamics to enhance generalization for finitely supported target distributions, as well as a study of its effect for one-dimensional gaussian mixtures.
Finally, previous studies on supervised learning [18, 33], and more recently on DMs [34], have shown that deep neural networks display a frequency-dependent learning speed, and hence a learning bias towards low frequency functions.

This fact plays an important role in the results we present since the empirical score contains a low frequency part that is close to the population score, and a high-frequency part that is dataset-dependent. To the best of our knowledge, the training time to learn the high-frequency part and hence memorize, that we find to scale with $n$ , has not been studied from this perspective in the context of score-based generative methods.

Setting: generative diffusion and score learning. Standard DMs define a transport from a target distribution

P_0

\mathbb{R}^d

to a Gaussian white noise

\mathcal{N}(0, \bm{I}_d)

through a forward process defined as an Ornstein-Uhlenbeck (OU) stochastic differential equation (SDE):

where

\mathrm{d} \bm{\mathrm{B}}(t)

is square root of two times a Wiener process. Generation is performed by time-reversing the SDE Equation 1 using the score function

\bm{\mathrm{s}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log P_t(\bm{\mathrm{x}})

where

P_t(\bm{\mathrm{x}})

is the probability density at time

t

along the forward process, and the noise

\mathrm{d} \bm{\mathrm{B}}(t)

is also the square root of two times a Wiener process. As shown in the seminal works [35, 36],

\bm{\mathrm{s}}(\bm{\mathrm{x}}, t)

can be obtained by minimizing the score matching loss

where

\Delta_t=1-e^{-2t}

. In practice, the optimization problem is restricted to a parametrized class of functions

\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}(t), t)

defined, for example, by a neural network with parameters

\bm{\theta}

. The expectation over

\bm{\mathrm{x}}

is replaced by the empirical average over the training set (

n

iid samples

\bm{\mathrm{x}}^\nu

drawn from

P_0

where

\bm{\mathrm{x}}^\nu_t(\bm{\xi})=e^{-t} \bm{\mathrm{x}}^\nu+\sqrt{\Delta_t} \bm{\xi}

. The loss in (Equation 4) can be minimized with standard optimizers, such as stochastic gradient descent [SGD, 37] or Adam [38]. In practice, a single model conditioned on the diffusion time

t

is trained by integrating (Equation 4) over time [39]. The solution of the minimization of Equation 4 is the so-called empirical score (e.g. [12, 11]), defined as

\bm{\mathrm{s}}_{\mathrm{emp}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log P_t^\mathrm{emp}(\bm{\mathrm{x}})

, with

This solution is known to inevitably recreate samples of the training set at the end of the generative process (i.e., it perfectly memorizes), unless

n

grows exponentially with the dimension

d

[12]. However, this is not the case in many practical applications where memorization is only observed for relatively small values of

n

, and disappears well before

n

becomes exponentially large in

d

. The empirical minimization performed in practice, within a given class of models and a given minimization procedure, does not drive the optimization to the global minimum of Equation 4, but instead to a smoother estimate of the score that is independent of the training set with good generalization properties [13], as the global minimum of Equation 3 would do. Understanding how it is possible, and in particular the role played by the training dynamics to avoid memorization, is the central aim of the present work.

2. Generalization and memorization during training of diffusion models

$**Figure 2:** **Memorization transition as a function of the training set size $n$ for U-Net score models on CelebA.** *(Left)* FID (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (dashed lines, right axis) against training time $\tau$ for various $n$. Inset: normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ with the rescaled time $\tau/n$. *(Middle)* Training (solid lines) and test (dashed lines) loss with $\tau$ for several $n$ at fixed $t=0.01$. Inset: both losses plotted against $\tau/n$. Error bars on the losses are imperceptible. *(Right)* Generated samples from the model trained with $n=1024$ for $\tau=100$ K or $\tau=1.62$ M steps, along with their nearest neighbors in the training set.$

Figure 2: Memorization transition as a function of the training set size $n$ for U-Net score models on CelebA. (Left) FID (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (dashed lines, right axis) against training time $\tau$ for various $n$ . Inset: normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ with the rescaled time $\tau/n$ . (Middle) Training (solid lines) and test (dashed lines) loss with $\tau$ for several $n$ at fixed $t=0.01$ . Inset: both losses plotted against $\tau/n$ . Error bars on the losses are imperceptible. (Right) Generated samples from the model trained with $n=1024$ for $\tau=100$ K or $\tau=1.62$ M steps, along with their nearest neighbors in the training set.

💭 Click to ask about this figure

Data & architecture. We conduct our experiments on the CelebA face dataset [22], which we convert to grayscale downsampled images of size

d=32\times32

, and vary the training set size

n

from 128 up to 32768. Our score model has a U-Net architecture [21] with three resolution levels and a base channel width of

W

with multipliers 1, 2 and 3 respectively. All our networks are DDPMs [2] trained to predict the injected noise at diffusion time

t

using SGD with momentum at fixed batch size

\min(n, 512)

. The models are all conditioned on

t

, i.e. a single model approximates the score at all times, and make use of a standard sinusoidal position embedding [40] that is added to the features of each resolution. More details about the numerical setup can be found in SM (Appendix A).

Evaluation metrics. To study the transition from generalization to memorization during training, we monitor the loss Equation 4 during training using a fixed diffusion time

t = 0.01

. At various numbers of SGD updates

\tau

, we compute the loss on all

n

training examples (training loss) and on a held-out test set of 2048 images (test loss). To characterize the score obtained after a training time

\tau

, we assess the originality and quality of samples by generating 10K samples using a DDIM accelerated sampling [41]. We compute (i) the Fréchet-Inception Distance [FID, 42] against 10K test samples which we use to identify the generalization time

\tau_\mathrm{gen}

; and (ii) the fraction of memorized generated samples

f_\mathrm{mem}(\tau)

granting access to

\tau_\mathrm{mem}

, the memorization time. Following previous numerical studies [20, 19], a generated sample

\bm{\mathrm{x}}_\tau

is considered memorized if

where

\bm{\mathrm{a}}^{\mu_1}

and

\bm{\mathrm{a}}^{\mu_2}

are the nearest and second nearest neighbors of

\bm{\mathrm{x}}_\tau

in the training set in the

L_2

sense. In what follows, we choose to work with

k=1/3

[20, 19], but we checked that varying

k

1/2

1/4

does not impact the claims about the scaling. Error bars in the figures correspond to twice the standard deviation over 5 different test sets for FIDs, and 5 noise realizations for

\mathcal{L}_\mathrm{train}

and

\mathcal{L}_\mathrm{test}

. For

f_\mathrm{mem}

, we report the 95% CIs on the mean evaluated with 1, 000 bootstrap samples.

Role of training set size on the learning dynamics. At fixed model capacity (

p=4\times 10^6

, base width

W=32

), we investigate how the training set size

n

impacts the previous metrics. In the left panel of Figure 2, we first report the FID (solid lines) and

f_\mathrm{mem}(\tau)

(dashed lines) for various

n

. All trainings dynamics exhibit two phases. First, the FID quickly decreases to reach a minimum value on a timescale

\tau_\mathrm{gen}

(

\approx100

K) that does not depend on

n

. In the right panel, the generated samples at

\tau=100

K clearly differ from their nearest neighbors in the training set, indicating that the model generalizes correctly. Beyond this time, the FID remains flat.

f_\mathrm{mem}(\tau)

is zero until a later time

\tau_\mathrm{mem}

after which it increases, clearly signaling the entrance into a memorization regime, as illustrated by the generated samples in the right-most panel of Figure 2, very close to their nearest neighbors. Both the transition time

\tau_\mathrm{mem}

and the value of the final fraction

f_\mathrm{mem}(\tau_\mathrm{max})

(with

\tau_\mathrm{max}

being one to four million SGD steps) vary with

n

. The inset plot shows the normalized memorization fraction

f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})

against the rescaled time

\tau/n

, making all curves collapse and increase at around

\tau/n \approx 300

, showing that

\tau_\mathrm{mem} \propto n

, and demonstrating the existence of a generalization window for

\tau \in \left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right]

that widens linearly with

n

, as illustrated in the left panel of Figure 1.

As highlighted in the introduction, memorization in DMs is ultimately driven by the overfitting of the empirical score

\bm{\mathrm{s}}_\mathrm{mem}(\bm{\mathrm{x}}, t)

. The evolution of

\mathcal{L}_\mathrm{train}(\tau)

and

\mathcal{L}_\mathrm{test}(\tau)

at fixed

t=0.01

are shown in the middle panel of Figure 2 for

n

ranging from 512 to 32768. Initially, the two losses remain nearly indistinguishable, indicating that the learned score

\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}, t)

does not depend on the training set. Beyond a critical time,

\mathcal{L}_\mathrm{train}

continues to decrease while

\mathcal{L}_\mathrm{test}

increases, leading to a nonzero generalization loss whose magnitude depends on

n

. As

n

increases, this critical time also increases and, eventually, the training and test loss gap shrinks: for

n=32768

, the test loss remains close to the training loss, even after 11 million SGD steps. The inset shows the evolution of both losses with

\tau/n

, demonstrating that the overfitting time scales linearly with the training set size

n

, just like

\tau_\mathrm{mem}

identified in the left panel. Moreover, there is a consistent lag between the overfitting time and

\tau_\mathrm{mem}

at fixed

n

, reflecting the additional training required for the model to overfit the empirical score sufficiently to reproduce the training samples, and therefore to impact the memorization fraction.

Memorization is not due to data repetition. We must stress that this delayed memorization with

n

is not due to the mere repetition of training samples, as a first intuition could suggest. In SM Sects. Appendix A and Appendix B, we show that full-batch updates still yield

\tau_\mathrm{mem}\propto n

. In other words, even if at fixed

\tau

all models have processed each sample equally often, larger

n

consistently postpone memorization. This confirms that memorization in DMs is driven by a fundamental

n

-dependent change in the loss landscape -- not by a sample repetition during training.

Figure 3: Effect of the number of parameters in the U-Net architecture on the timescales of the training dynamics. (Left) FID (panels A, B) and normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ (panels C, D) for various $n$ and $W$ during training. In panels B and D, time is rescaled such that all curves collapse. (Right) $(n, p)$ phase diagram of generalization vs memorization for U-Nets trained on CelebA. Curves show, for $\tau \in \{\tau_\mathrm{gen}, 3\tau_\mathrm{gen}, 8\tau_\mathrm{gen}\}$ , the minimal dataset size $n(p)$ satisfying $f_\mathrm{mem}(\tau)=0$ . The shaded background indicates the memorization--generalization boundary for $\tau=\tau_\mathrm{gen}$ .

💭 Click to ask about this figure

Effect of the model capacity. To study more precisely the role of the model capacity on the memorization--generalization transition, we vary the number of parameters

p

by changing the U-Nets base width

W \in \{8, 16, 32, 48, 64\}

, resulting in a total of

p\in\{0.26, 1, 4, 9, 16\}\times10^6

parameters. In the left panel of Figure 3, we plot both the FID (top row) and the normalized memorization fraction (bottom row) as functions of

\tau

for several width

W

and training set sizes

n

. Panels A and C demonstrate that higher-capacity networks (larger

W

) achieve high-quality generation and begin to memorize earlier than smaller ones. Panels B and D show that the two characteristic timescales simply scale as

\tau_\mathrm{gen} \propto W^{-1}

and

\tau_\mathrm{mem} \propto nW^{-1}

. In particular, this implies that, for

W>8

, the critical training set size

n_\mathrm{gm}(p)

at which

\tau_\mathrm{mem}=\tau_\mathrm{gen}

is approximately independent of

p

(at least on the limited values of

p

we focused on). When

n>n_\mathrm{gm}(p)

, the interval

\left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right]

opens up, so that early stopping within this window yields high quality samples without memorization. In the right panel of Figure 3, we display this boundary (solid line) in the

(n, p)

plane by fixing the training time to

\tau=\tau_\mathrm{gen}

, that we identify numerically using the collapse of all FIDs at around

W\tau_\mathrm{gen}\approx 3\times 10^6

(see panel B), and computing the smallest

n

such that

f_\mathrm{mem}(\tau)=0

. The resulting solid curve delineates two regimes: below the curve, memorization already starts at

\tau_\mathrm{gen}

; above the curve, the models generalize perfectly under early stopping. We repeat this experiment for

\tau=3\tau_\mathrm{gen}

and

\tau=8\tau_\mathrm{gen}

, showing saturation to larger and larger

p

\tau

increases. Eventually, for

\tau \to \infty

, we expect these successive boundaries to converge to the architectural regularization threshold

n^\star(p)

, i.e. the point beyond which the network avoids memorization because it is not expressive enough, as found in [17] and highlighted in the right panel of Figure 1. In order to estimate

n^\star(p)

, we measure for a given

\tau

the largest

n(\tau)

yielding

f_\mathrm{mem}\approx0

. The curve

n(\tau)

approaches

n^\star(p)

for large

\tau

. We therefore estimate

n^\star(p)

by measuring the asymptotic values of

n(\tau)

, which in practice is reached already at

\tau=\tau_\mathrm{max}=2

M updates for the values of

W

we focus on.

3. Training dynamics of a Random Features Network

Notations. We use bold symbols for vectors and matrices. The

L^2

norm of a vector

\bm{\mathrm{x}}

is denoted by

\lVert \bm{\mathrm{x}} \rVert = (\sum_i \bm{\mathrm{x}}_i^2)^{1/2}

. We write

f = \mathcal{O}(g)

to mean that in the limit

n, p \to \infty

, there exists a constant

C

such that

\lvert f \rvert \leq C \lvert g \rvert

. Setting. We study analytically a model introduced in [17], where the data lie in

d

dimensions. We parametrize the score with a Random Features Neural Network [RFNN, 31]

An RFNN, illustrated in Figure 4 (left), is a two-layer neural-network whose first layer weights (

\bm{\mathrm{W}}\in\mathbb{R}^{p\times d}

) are drawn from a Gaussian distribution and remain frozen while the second layer weights (

\bm{\mathrm{A}} \in \mathbb{R}^{d\times p}

) are learned during training. This model has already served as theoretical framework for studying several behaviors of deep neural network such as the double descent phenomenon [43, 44].

\sigma

is an element-wise non-linear activation function. We consider a training set of

n

iid samples

\bm{\mathrm{x}}^\nu \sim P_{\bm{\mathrm{x}}}

for

\nu = 1, \ldots, n

and we focus on the high-dimensional limit

d, p, n\rightarrow\infty

with the ratios

\psi_p=p/d, \psi_n=n/d

kept fixed. We study the training dynamics associated to the minimization of the empirical score matching loss defined in (Equation 4) at a fixed diffusion time

t

. This is a simplification compared to practical methods, which use a single model for all

t

. It has been already studied in previous theoretical works [28, 17]. The loss (Equation 4) is rescaled by a factor

1/d

in order to ensure a finite limit at large

d

. We also study the evolution of the test loss evaluated on test points and the distance to the exact score

\bm{\mathrm{s}}(\bm{\mathrm{x}})=\nabla\log P_{\bm{\mathrm{x}}}

\begin{align} \mathcal{L}_\mathrm{test}=\frac{1}{d}\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}\left[\lVert \sqrt{\Delta_t} \bm{\mathrm{s}}_{\bm{\mathrm{A}}}(\bm{\mathrm{x}}_t(\bm{\xi}))+ \bm{\xi}\rVert ^2\right], \quad\mathcal{E}_\mathrm{score}=\frac{1}{d}\mathbb{E}_{\bm{\mathrm{x}}}\left[\lVert \bm{\mathrm{s}}_{\bm{\mathrm{A}}}(\bm{\mathrm{x}})-\nabla\log P_{\bm{\mathrm{x}}}\rVert^2\right], \end{align}\tag{8}

💭 Click to ask about this equation

where the expectations

\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}

are computed over

\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}}

and

\bm{\xi} \sim \mathcal{N}(0, \bm{I}_d)

. The generalization loss, defined as

\mathcal{L}_\mathrm{gen} = \mathcal{L}_\mathrm{test} - \mathcal{L}_\mathrm{train}

, indicates the degree of overfitting in the model while the distance to the exact score

\mathcal{E}_\mathrm{score}

measures the quality of the generation as it is an upper bound on the Kullback–Leibler divergence between the target and generated distributions [45, 46]. The weights

\bm{\mathrm{A}}

are updated via gradient descent

where

\eta

is the learning rate. In the high-dimensional limit, as the learning rate

\eta \to 0

, and after rescaling time as

\tau = k\eta / d^2

, the discrete-time dynamics converges to the following continuous-time gradient flow:

with

\begin{align} \bm{\mathrm{U}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)^T\right], \quad \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right) \bm{\xi}^T\right]. \end{align}

💭 Click to ask about this equation

$**Figure 4:** (*Left*) **Illustration of an RFNN.** (*Middle/Right*) **Spectrum of $\bm{\mathrm{U}}$.** Density $\rho(\lambda)$ from Theorem 1 in the overparameterized Regime I described in Theorem 2, with $\psi_p = 64$, $\psi_n = 8$, $t = 0.01$, and $\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1)$. The bulk of the spectrum (orange) is between $\lambda\approx10$ and $\lambda\approx45$. The histogram shows the eigenvalues from a single realization of $\bm{\mathrm{U}}$ at $d = 100$. Inset: zoom near $\lambda = 0$ (in blue) showing the first bulk $\rho_1$ and the delta peak at $\lambda = s_t^2$. (*Right*) Same as (*Middle*), but with $\rho_{\boldsymbol{\Sigma}}(\lambda) = \frac{1}{2}\delta(\lambda - 0.5) + \frac{1}{2}\delta(\lambda - 1.5)$. The first bulk in blue remains unchanged, as it depends only on $\sigma_{\bm{\mathrm{x}}}^2 = \operatorname{Tr}(\boldsymbol{\Sigma})/d = 1$ in both cases, while the second bulk varies with $\boldsymbol{\Sigma}$.$

Figure 4: (Left) Illustration of an RFNN. (Middle/Right) Spectrum of $\bm{\mathrm{U}}$ . Density $\rho(\lambda)$ from Theorem 1 in the overparameterized Regime I described in Theorem 2, with $\psi_p = 64$ , $\psi_n = 8$ , $t = 0.01$ , and $\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1)$ . The bulk of the spectrum (orange) is between $\lambda\approx10$ and $\lambda\approx45$ . The histogram shows the eigenvalues from a single realization of $\bm{\mathrm{U}}$ at $d = 100$ . Inset: zoom near $\lambda = 0$ (in blue) showing the first bulk $\rho_1$ and the delta peak at $\lambda = s_t^2$ . (Right) Same as (Middle), but with $\rho_{\boldsymbol{\Sigma}}(\lambda) = \frac{1}{2}\delta(\lambda - 0.5) + \frac{1}{2}\delta(\lambda - 1.5)$ . The first bulk in blue remains unchanged, as it depends only on $\sigma_{\bm{\mathrm{x}}}^2 = \operatorname{Tr}(\boldsymbol{\Sigma})/d = 1$ in both cases, while the second bulk varies with $\boldsymbol{\Sigma}$ .

💭 Click to ask about this figure

Assumptions. For our analytical results to hold, we make the following mathematical assumptions which are standard when studying Random Features [47, 48, 49] namely (i) the activation function

\sigma

admits a Hermite polynomial expansion

\sigma(x)=\sum_{s=0}^\infty\frac{\alpha_s}{s!}He_s(x)

; and (ii) the data distribution

P_{\bm{\mathrm{x}}}

has sub-Gaussian tails and a covariance

\boldsymbol{\Sigma}=\mathbb{E}_{P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]

with bounded spectrum. We assume that the empirical distribution of eigenvalues of

\boldsymbol{\Sigma}

converges weakly in the high dimensional limit to a deterministic density

\rho_{\boldsymbol{\Sigma}}(\lambda)

and that

\operatorname{Tr}(\boldsymbol{\Sigma})/d

converges to a finite limit (for a more precise mathematical statement see SM Appendix C.3). Moreover, we make additional assumptions that are not essential to the proofs but which simplify the analysis: (iii) the activation function

\sigma

verifies

\mu_0=\mathbb{E}_z[\sigma(z)]=0

; and (iv) the second layer

\bm{\mathrm{A}}

is initialized with zero weights

\bm{\mathrm{A}}(\tau=0)=0

. In numerical applications, unless specified, we use

\sigma(z)=\tanh(z)

and

P_{\bm{\mathrm{x}}}=\mathcal{N}(0, \bm{I}_d)

$**Figure 5:** **Evolution of the training and test losses for the RFNN.** (A) Distance to the true score $\mathcal{E}_\mathrm{score}$ against training time $\tau$ for $\psi_n=4, 8, 16, 32$, $\psi_p=64, t=0.1$ and $d=100$. In the inset, the training time is rescaled by $\tau_\mathrm{mem}=\psi_p/\Delta_t\lambda_\mathrm{min}$. (B) Training (solid) and test (dashed) losses for various $\psi_n$. The inset shows both losses rescaled by $\tau_\mathrm{mem}$. (C) Heatmaps of $\mathcal{L}_\mathrm{gen}$ for $\tau=10^{3}$ (top) and $\tau=10^4$ (bottom) as a function of $\psi_n$ and $\psi_p$. All the curves use Pytorch [50] gradient descent. More numerical details can be found in SM Section D.$

Figure 5: Evolution of the training and test losses for the RFNN. (A) Distance to the true score $\mathcal{E}_\mathrm{score}$ against training time $\tau$ for $\psi_n=4, 8, 16, 32$ , $\psi_p=64, t=0.1$ and $d=100$ . In the inset, the training time is rescaled by $\tau_\mathrm{mem}=\psi_p/\Delta_t\lambda_\mathrm{min}$ . (B) Training (solid) and test (dashed) losses for various $\psi_n$ . The inset shows both losses rescaled by $\tau_\mathrm{mem}$ . (C) Heatmaps of $\mathcal{L}_\mathrm{gen}$ for $\tau=10^{3}$ (top) and $\tau=10^4$ (bottom) as a function of $\psi_n$ and $\psi_p$ . All the curves use Pytorch [50] gradient descent. More numerical details can be found in SM Section D.

💭 Click to ask about this figure

Emergence of the two timescales during training. We first show in Figure 5 that the behavior of training and test losses in the RF model mirrors the one found in realistic cases in Section 2, with a separation of timescales

\tau_\mathrm{gen}

and

\tau_\mathrm{mem}

which increases with

n

. Equation (Equation 10) is linear in

\bm{\mathrm{A}}

and hence it can be solved exactly (see SM). The timescales of the training dynamics are given by the inverse eigenvalues of the

p\times p

matrix

\Delta_t \bm{\mathrm{U}}/\psi_p

. Building on the Gaussian Equivalence Principle [GEP, 51, 48, 52] and the theory of linear pencils [53], [17] ([17]) derive a coupled system of equations characterizing the Stieltjes transform of the eigenvalue density

\rho(\lambda)

\bm{\mathrm{U}}

for isotropic Gaussian data that lie in a

D

-dimensional subspace with

D\le d

and

D=\mathcal{O}(d)

. We offer an alternative derivation presented in SM for general variance using the replica method [54] -- a heuristic method from the statistical physics of disordered systems -- yielding the more compact formulation for obtaining the spectrum stated in Theorem 1. Before stating the theorem, we introduce

\begin{align} &b_t=\mathbb{E}_{u, v}[v\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)], \quad a_t=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}], \\ &v_t^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2, \\ &s_t^2= \mathbb{E}_u[\sigma(\Gamma_t u)^2]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2, \end{align}

💭 Click to ask about this equation

where

\sigma_{\bm{\mathrm{x}}}^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma})}{d}

\Gamma_t=e^{-2t}\sigma_{\bm{\mathrm{x}}}^2+\Delta_t=1+e^{-2t}(\sigma_{\bm{\mathrm{x}}}^2-1)

and the expectation is over the

u, v, w

random variables which are independent standard Gaussian

\mathcal{N}(0, 1)

Theorem 1

Let

q(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}

r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2})

and

s(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}})

, with

z\in\mathbb{C}

. Let

Then

q(z), r(z)

and

s(z)

satisfy the following set of three equations:

\begin{align} &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}\psi_p} {\psi_n} r+\frac{\psi_p v_t^2}{\psi_n} q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0, \end{align}

💭 Click to ask about this equation

The eigenvalue distribution of

\bm{\mathrm{U}}

\rho(\lambda)

, can then be obtained using the Sokhotski–Plemelj inversion formula

\rho(\lambda)=\underset{\varepsilon \rightarrow0^+}{\lim}\frac{1}{\pi}\operatorname{Im}q(\lambda+i\varepsilon)

We now focus on the asymptotic regime

\psi_p, \psi_n \gg 1

, typical for strongly over‑parameterized models trained on large data sets. In this limit, the spectrum of

\bm{\mathrm{U}}

can be described analytically by the following Theorem 2.

Theorem 2: Informal

Let

\rho

denote the spectral density of

\bm{\mathrm{U}}

Regime I (overparametrized): $\psi_p>\psi_n\gg 1$ .

Regime II (underparametrized): $\psi_n>\psi_p\gg 1$ .

where

\rho_1

is an atomless measure with support

and

\rho_2

coincides with the asymptotic eigenvalue bulk density of the population covariance

\tilde{\bm{\mathrm{U}}}=\mathbb{E}_{\bm{\mathrm{X}}}[\bm{\mathrm{U}}]

;

\rho_2

is independent of

\psi_n

and its support is on the scale

\psi_p

The eigenvectors associated with

\delta(\lambda-{s_t^2})

leave both training and test losses unchanged and are therefore irrelevant. In the limit

\psi_p\gg \psi_n

, the supports of

\rho_1

and

\rho_2

are respectively on the scales

\psi_p/\psi_n

and

\psi_p

, i.e. they are well separated.

The proofs of both theorems are shown in SM (Appendix C). We recall that training timescales are directly related to eigenvalues

\lambda

via the relation

\tau^{-1}= \psi_p / \Delta_t \lambda_\mathrm{\min}

. Theorem 2 therefore demonstrates the emergence of the two training timescales

\tau_{\mathrm{mem}}

and

\tau_{\mathrm{gen}}

in the overparametrized regime of the RFNN model. They are respectively associated to the measures

\rho_1

and

\rho_2

, which are well separated in regime I, for

\psi_p\gg\psi_n\gg1

, as shown in Figure 4.

Generalization: The timescale

\tau_{\mathrm{gen}}

on which the first relaxation takes place is associated to the formation of the generalization regime. It is related to the bulk

\rho_2

and is or order

1/\Delta_t

. This regime only depends on the population covariance

\boldsymbol{\Sigma}

of the data and is independent of the specific realization of the dataset. On this timescale, which is of order one, both the training

\mathcal{L}_{\mathrm{train}}

and test

\mathcal{L}_{\mathrm{test}}

losses decrease. The generalization loss

\mathcal{L}_\mathrm{gen} = \mathcal{L}_{\mathrm{test}}-\mathcal{L}_{\mathrm{train}}

is zero, and

\mathcal{E}_\mathrm{score}

tends to a value that we find to scale as

\mathcal{O}(\psi_n^{-\eta})

with

\eta\simeq0.59

numerically (see Figure 5).

Memorization: The timescale

\tau_{\mathrm{mem}}

, on which the second stage of the dynamics takes place, is associated to overfitting and memorization. It is related to the bulk

\rho_1

, and scales as

\psi_p / \Delta_t \lambda_\mathrm{\min}

, where

\lambda_\mathrm{\min}

is the left edge of

\rho_1

. In the overparameterized regime

p \gg n

\tau_{\mathrm{mem}}

becomes large and of order

\psi_n/\Delta_t

, thus implying a scaling of

\tau_{\mathrm{mem}}

with

n

. On this timescale, the training loss decreases while the test loss increases, converging to their respective asymptotic values as computed in [17]. Figure 5 indeed shows that all training and test curves separate, correspondingly the generalization loss

\mathcal{L}_{\mathrm{gen}}

increases, at a time that scales with

\psi_p /\Delta_t \lambda_{\min}

, as shown in the inset.

n

increases, the asymptotic (

\tau \rightarrow \infty

) generalization loss

\mathcal{L}_{\mathrm{gen}}

decreases, indicating a reduced overfitting. For

n>n^*(p) = p

, although some overfitting remains (i.e.,

\mathcal{L}_{\mathrm{gen}} > 0

), the value of

\mathcal{L}_{\mathrm{gen}}

is sensibly reduced, and the model is no longer expressive enough to memorize the training data, as shown in [17]. This regime corresponds to the Architectural Regularization phase in Figure 1. We show in Figure 5 (panel C) how the generalization loss

\mathcal{L}_{\mathrm{gen}}

varies in the

(n, p)

plane depending on the time

\tau

at which training is stopped. In agreement with the above results, we find that the generalization--memorization transition line depends on

\tau

and moves upward for larger values of

\tau

, similarly to the numerical results exposed in Figure 3 and the illustration in Figure 1.

4. Conclusions

We have shown that the training dynamics of neural network-based score functions display a form of implicit regularization that prevents memorization even in highly overparameterized diffusion models. Specifically, we have identified two well-separated timescales in the learning:

\tau_\mathrm{gen}

, at which models begins to generate high-quality, novel samples, and

\tau_\mathrm{mem}

, beyond which they start to memorize the training data. The gap between these timescales grows with the size of the training set, leading to a broad window where early stopped models generate novel samples of high-quality. We have demonstrated that this phenomenon happens in realistic settings, for controlled synthetic data, and in analytically tractable models. Although our analysis focuses on DMs, the underlying score‑learning mechanism we uncover is common to all score‑based generative models such as stochastic interpolants [55] or flow matching [10]; we therefore expect our results to generalize to this broader class.

Limitations and future works. - While we derived our results under SGD optimization, most DMs are trained in practice with Adam [38]. In SM Sects. Appendix A and Appendix D, we show that the two key timescales still arise using Adam, although with much fewer optimization steps. Studying how different optimizers shift these timescales would be valuable for practical usage.

All experiments in Section 2 are conducted with unconditional DMs. We additionally verify in SM Sect. B, using a toy Gaussian mixture dataset and classifier-free guidance [56], that the same scaling of $\tau_\mathrm{mem}$ with $n$ holds in the conditional settings. Understanding precisely how the absolute timescales $\tau_\mathrm{mem}$ and $\tau_\mathrm{gen}$ depend on the conditioning remains an open question.
Our numerical experiments cover a range of $p$ between 1M and 16M. Exploring a wider range is essential to map the full $(n, p)$ phase diagram sketched in Figure 1 and understand the precise effect of expressivity on dynamical regularization.
Finally, our theoretical analysis rely on well-controlled data and score models that reproduce the core effects. Extending these analytical frameworks to richer data distributions (such as Gaussian mixtures or data from the hidden manifold model) and to structured architectures would be valuable to further characterize the implicit dynamical regularization of training score-functions. In particular investigating how heavy-tailed data distribution [57] affect the picture described here could be valuable.
Although DMs trained on large and diverse datasets likely avoid the memorization regime we study here, some industrial models were shown to exhibit partial memorization [23, 24]. Our results provide practical guidelines (early-stopping, control the network capacity) to train DMs robustly and hence avoid memorization, which can be especially helpful in data-scarce domains (e.g., physical sciences).

Acknowledgments

The authors thank Valentin De Bortoli for initial motivating discussions on memorization--generalization transitions. RU thanks Beatrice Achilli, Jérome Garnier-Brun, Carlo Lucibello and Enrico Ventura for insightful discussions. RU is grateful to Bocconi University for its hospitality during his stay, during which part of this work was conducted. This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-AD011016319). GB acknowledges support from the French government under the management of the Agence Nationale PR[AI] RIE-PSAI (ANR-23-IACL-0008). MM acknowledges the support of the PNRR-PE-AI FAIR project funded by the NextGeneration EU program. After completing this work, we became aware that A. Favero, A. Sclocchi, and M. Wyart [58] had also been investigating the memorization--generalization transition from a similar perspective.

Appendix

Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training **Supplementary Material (SM)** Tony Bonnaire$^\dagger$, Raphaël Urfin$^\dagger$, Giulio Biroli, Marc Mézard

This document provides detailed derivations and additional experiments supporting the main text (MT). In Appendix A, we give details about the numerical experiments carried out in Section 2. In Appendix B we provide additional numerical experiments on simplified score and data models. Appendix C gives formal proofs of the main theorems of Section 3. Finally, Appendix D exposes more details on the numerical experiments of Section 3.

A. Numerical experiments on CelebA

A.1 Details on the numerical setup

Dataset. All numerical experiments in Section 2 of the MT use the CelebA face dataset [22]. We center-crop each RGB image to

32\times 32

pixels and convert to grayscale images in order to accelerate the training of our Diffusion Models (DMs). To precisely control the samples seen by a model, no data augmentation is applied, and we vary the training set size

n

in the window

\left[128, 32768\right]

. Examples of training samples are shown in the left-most block of Figure 6.

Architecture. As commonly done in DDPMs implementations [e.g., 2, 4], the network approximating the score function is a U-Net [21] made of three resolution levels, each containing two residual blocks with channel multipliers

\{1, 2, 4\}

respectively. We apply attention to the two coarsest resolutions, and embed the diffusion time via sinusoidal position embedding [40]. The base channel width

W

varies from

16

64

depending on the experiment, resulting in a total of

1

16

million trainable parameters.

Time reparameterization. Compared to the framework presented in the MT, the DDPMs we train make use of a time reparameterization of the forward and backward processes with a variance schedule

\{\beta_{t'}\}_{t'=1}^T

, where

T

is the time horizon given as a number of steps, fixed to 1000 in our experiments. The variance is evolving linearly from

\beta_1 = 10^{-4}

\beta_{1000} = 2 \times 10^{-2}

. A sample at time

t'

, denoted

\bm{\mathrm{x}}(t')

, can be expressed from

\bm{\mathrm{x}}(0)

as the following interpolation

where

\overline{\alpha}(t') = \prod_{s=1}^{t'} (1-\beta_s)

, and

\bm{\xi}

is a standard and centered Gaussian noise. This is a reparameterization of the Ornstein-Uhlenbeck process from Equation 1 defined through time

t

in the MT, with

Training. All DMs are trained with Stochastic Gradient Descent (SGD) at fixed learning rate

\eta = 0.01

, fixed momentum

\beta=0.95

and batch size

B=\min(n, 512)

. We focus on SGD to facilitate the analysis of time scaling, avoiding problems that may cause alternative adaptive optimization schemes like Adam [38]. We train each model for at least 2M SGD steps, sometimes more for large values of

n

displaying memorization only later. We do not employ exponential moving average or learning-rate warm-up.

Generation. To accelerate sampling while preserving FID, we employ the DDIM sampler of [41] ([41]) which replaces the Markovian reverse SDE with a deterministic, non-Markovian update. Given a trained denoiser

\bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t)

, we iterate for

t=T', \ldots, 1

with

T'=200

. During training, we generate at 40 milestones a set of 10, 000 samples to assess generalization and memorization. Examples of samples obtained from a model trained on

n=1024

samples with base width

W=32

are shown in the middle and right blocks from Figure 6 for two training times,

\tau=190

K and

\tau=1.62

M. At

\tau=190

K the model generalizes (

f_\mathrm{mem}=0\%

) and achieve a test FID of 35.1. After too much training, memorization sets in and, by

\tau=1.62

M steps, nearly half the generated samples reproduce training images (

f_\mathrm{mem}=47.2\%

Statistical evaluation. FIDs [42] are computed¹ using 10, 000 generated samples and 10, 000 test samples, averaged over 5 independent runs with disjoint test sets. Error bars in the MT denote twice the standard deviation. Training and test losses are estimated similarly over 5 repeated evaluation on

n

training samples and 2048 test samples, and give negligible confidence intervals. For the memorization fraction

f_\mathrm{mem}(\tau)

, we report the standard error on the mean obtained via bootstrap resampling of the 10, 000 generated samples. We also verified that the scaling in the memorization time

\tau_\mathrm{mem}

is insensitive to the choice of the threshold

k

used to define

f_\mathrm{mem}

in Equation 6 by testing larger and lower values.

Using the pytorch-fid Python package.

Computing resources. Most trainings were performed on Nvidia H100 GPUs (80GB of memory). A typical run of 2M steps takes approximately 50 hours on two GPUs and vary with the model size (defined through its base width

W

). In total, we train 18 distinct models for the several

n, W

configurations of the MT. The longest training (

n=32768

and

W=32

in Figure 2) ran for 11M steps. The generation of

10, 000

samples over 40 training times takes around an additional hour per model on the same hardware support.

Figure 7: Impact of batch size and optimizer on the scaling of $\tau_\mathrm{mem}$ . FID (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (in %, dashed lines, right axis) against training time $\tau$ for various $n$ . Inset: normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ with the rescaled time $\tau/n$ . (Left) Memorization scaling for $B=n$ . (Right) Generalization--Memorization transition with Adam optimizer for $W=64$ .

💭 Click to ask about this figure

A.2 Batch-size effect: repetition vs. memorization

All the experiments in the MT use a fixed batch size

B=512

, and in Sect Section 2 we emphasize that the observed

\mathcal{O}(n)

scaling of

\tau_\mathrm{mem}

cannot be explained by repetition over training samples. To validate this statement, the left panel of Figure 7 shows FID and memorization fraction curves when we train the models with full-batch updates (

B=n

) for

n\in\left[128, 512\right]

. At any fixed

\tau

, every sample has been seen exactly

\tau

times. Yet

\tau_\mathrm{mem}

continues to grow linearly with

n

, as shown in the inset. This demonstrates that larger datasets reshape the loss landscape -- requiring proportionally more updates to overfit -- rather than simply increasing memorization through repeated exposure of training samples.

A.3 What about Adam?

We conclude this section by repeating our analysis at fixed

W=64

using the Adam optimizer [38] instead of SGD with momentum. The learning rate is

\eta = 1\times10^{-4}

, gradient averages take values

(\beta_1, \beta_2)=(0.9, 0.999)

, and batch size

B=\min(512, n)

. We keep all other settings and evaluation metrics as above. As shown in the right panel of Figure 7, Adam yields the same two-phase training dynamics with first a generalization regime with

f_\mathrm{mem}=0

and good performances (small FID), and later a memorization phase at

\tau_\mathrm{mem}\propto n

, as shown in the inset. The only difference is that both

\tau_\mathrm{gen}

and

\tau_\mathrm{mem}

occur after much fewer steps compared to SGD. This also points out that the emergence of the two well-separated timescales and their scaling is a fundamental property of the loss landscape.

B. Generalization--memorization transition in the Gaussian Mixture Model

The aim of this section is to show our results hold for other data distributions than natural images, and alternative score model that U-Net architectures.

B.1 Settings

Data distribution. We focus on data iid sampled from a

d

-dimensional Gaussian Mixture Model (GMM) made of two balanced Gaussians centered on

\pm \bm{\mu}

with unit covariance, i.e.,

In what follows, we choose to work with

\bm{\mu} = \bm{1}_d

, with

\bm{1}_d = \left[1, \ldots, 1\right] ^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^d

. In this controlled setup, the generalization score can be computed analytically from

\mathbb{P}_0

and reads

Score model. The denoise

\bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t)

is implemented as a lightweight residual multi-layer neural network (see Figure 8): an input layer projecting

\mathbb{R}^d \to \mathbb{R}^W

, followed by three identical residual blocks and an output layer projecting back to

\mathbb{R}^d

. Each block consists of two fully connected layers of width

W

, a skip connection, and a layer normalization. We encode the diffusion time

t

via sinusoidal position embedding and add it to the first feature of each block. The total number of parameter in the network is

p(d, W) = W(2d + 13) + d + 6W^2

. For

d=8

, and

W=128

, the reference setting of this section, this yields

p=102, 024

trainable parameters.

Training and computing resources. Unless otherwise specified, we train every model of this section with SGD at fixed learning rate

\eta = 6\times 10^{—3}

and momentum

\beta=0.95

using full-batch updates

B=n

for

n\in\{128, 256, 512, 1024, 2048, 4096\}

, running for up to 4M updates. All experiments are executed on an Nvidia RTX 2080 Ti, with the largest

n=4096

requiring around 10 hours to complete.

Generalization and memorization metrics. In addition to the memorization fraction

f_\mathrm{mem}(\tau)

, we exploit this controlled setting where we know the true data distribution

\mathbb{P}_0

to directly measure how closely it matches the generated distribution

\mathbb{P}_{\bm{\theta}}

via the Kullback-Leibler (KL) divergence

The cross-entropy term

\mathbb{E}_{\mathbb{P}_{\bm{\theta}}}\left[\log\mathbb{P}_0\right]

is easy to estimate using Monte Carlo,

where

\{\tilde{\bm{\mathrm{x}}}_\mu\}_{\mu=1}^N

are

N=10, 000

samples drawn from the model with parameters

\bm{\theta}(\tau)

at training time

\tau

. Estimating the negative entropy term

\mathbb{E}_{\mathbb{P}_{\bm{\theta}}}\left[\log\mathbb{P}_{\bm{\theta}}\right]

is more challenging, since DMs only give access to the score function

\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}})

and not the underlying probability distribution

\mathbb{P}_{\bm{\theta}}

. We can however employ time integration to express it as a function of the score only,

with

This expression assumes that the model learns an accurate representation of the score function. It is noteworthy to mention that samples are generated using standard Euler-Maruyama discretization of the backward process Equation 2 of the MT over

T=1000

timesteps.

B.2 Scaling of $\tau_\mathrm{mem}$ and $\tau_\mathrm{gen}$ with $n$ and $W$

In Figure 9, the left panel shows how the KL divergence and memorization fraction evolve with training time

\tau

for different training set sizes

n

at fixed width

W=128

, while the right panel fixes

n=2048

and varies

W

. In both cases, we observe two distinct phases. First, the KL divergence decreases to near zero on a timescale

\tau_\mathrm{gen}

independent of

n

during which the model fully generalizes (

f_\mathrm{mem}=0

). Beyond

\tau_\mathrm{gen}

, both

D_\mathrm{KL}(\mathbb{P}_{\bm{\theta}}|\mathbb{P}_0)

and

f_\mathrm{mem}

begin to rise at a time

\tau_\mathrm{mem}

that scales linearly with

n

, as highlighted by the inset of the left panel. In contrast,

\tau_\mathrm{mem}

scales with

W^{-1}

, as shown in the inset of the right panel. While the precise dependence of

\tau_\mathrm{gen}

with

W

remains inconclusive in this setting and require a more careful analysis, these results on the GMM mirror the main findings of the MT: the training dynamics of DMs unfolds first in a generalization phase and only later -- at

\tau_\mathrm{mem} \propto n/W

-- memorization begins.

Figure 9: Generalization--Memorization transition as a function of the training set size $n$ and width $W$ for ResNet score models on GMM ( $d=8$ ). KL divergences (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (in %, dashed lines, right axis) against training time $\tau$ for various (Left) $n \in \{256, 512, 1024, 2048, 4096\}$ at fixed $W=128$ . (Right) $W \in \{64, 128, 256\}$ at fixed $n=2048$ . Insets: $D_\mathrm{KL}(\mathbb{P}_{\bm{\theta}} | \mathbb{P}_0)$ and $f_\mathrm{mem}$ against the rescaled time $\tau/n$ (left) and $\tau W$ (right).

💭 Click to ask about this figure

B.3 Discussion on conditional diffusion models

Conditional generation aims to sample from distributions of the form

p(\bm{\mathrm{x}} | \bm{\mathrm{y}})

, where

\bm{\mathrm{y}}

denotes a conditioning variable such as a class label, a text embedding, or any other contextual information. DMs can naturally be extended to this setting using for instance classifier-free guidance [56]. Although conditioning often improves sample quality in practice, memorization effects have also been observed in models trained conditionally [25, 59, 60]. Our analysis does not rely on the model being unconditional since these variables typically enter the model as additional inputs and we expect our result to hold in this setting as well. To investigate it, we train a classifier-free guidance model to generate sample from our Gaussian mixture conditionally on the class label, and compute the memorized fraction as a function of

\tau

that we report in Figure 10. In the inset, when rescaling the training time by

n

, the curves for

n \in \{256, 512, 1024\}

all collapse perfectly, confirming that the phenomenon persists in the conditional setting. For more complex datasets,

\tau_\mathrm{mem}

and

\tau_\mathrm{gen}

may in fact depend on the conditioning variable and intermediate regimes could exist where certain classes have already entered the generalization (or memorization) phase while others have not yet.

C. Proofs of the analytical results

In the following we provide the mathematical arguments and the proofs for the statement in the MT. The section using the replica method is not mathematically rigorous but uses a well established method of theoretical physics, which has been already shown to provide correct results in several cases. The final result is rigorous, since it can alternatively be obtained from the rigorous free random matrix approach of [17], as shown in Appendix C.4.

C.1 Notations

We recall here the notations used throughout Section 3 of the MT and Appendix C of the SM.

\begin{align} &d: \text{Data dimension}\\ &n:\text{Numbers of data points}\\ &p:\text{Dimension of the hidden layers of the RFNN}\\ &\bm{I}_d:\text{Identity matrix in dimension } d\\ &\sigma(x):\text{Activation function of the model}\\ &P_{\bm{\mathrm{x}}}:\text{Distribution of the data points}\\ &P_{t}:\text{Distribution of the noisy data points at diffusion time} t \text{.}\\ &\psi_n=\frac{n}{d}\\ &\psi_p=\frac{p}{d}\\ &\Delta_t=1-e^{-2t}\\ & \boldsymbol{\Sigma}=\mathbb{E}_{\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]\\ & \boldsymbol{\Sigma}_t=e^{-2t} \boldsymbol{\Sigma}+\Delta_t \bm{I}_d\\ &\Gamma_t^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma}_t)}{d}\\ &\sigma_{\bm{\mathrm{x}}}^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma})}{d} \end{align}

💭 Click to ask about this equation

\begin{align} &\lVert \sigma\rVert^2=\mathbb{E}_z[\sigma(\Gamma_t z)^2]\\ &b_t^2=\left(\mathbb{E}_{u, v}[v\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)] \right)^2\\ &a_t=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}]\\ &v_t^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2\\ &s_t^2=\lVert \sigma \rVert^2-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2\\ &\mu_1(t)=\mathbb{E}_{u}[\sigma(\Gamma_tu)u]=\sqrt{e^{-2t}\sigma_{\bm{\mathrm{x}}}^2a_t^2+b_t^2}. \end{align}

💭 Click to ask about this equation

Unless specified, all the expectation values are taken for standard Gaussian variables. We will denote

the matrix whose columns are the data point vectors and likewise we decompose

\bm{\mathrm{W}}

where

\bm{\mathrm{W}}_i \in \mathbb{R}^{ d}

denotes the

i

th row of

\bm{\mathrm{W}}

We recall the definitions of the matrices

\bm{\mathrm{U}}

and

\bm{\mathrm{V}}

\begin{align} & \bm{\mathrm{U}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)^T\right], \\ & \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right) \bm{\xi}^T\right]. \end{align}

💭 Click to ask about this equation

C.2 Closed form of the learning dynamics

Proposition 3

Let

\bm{\mathrm{A}}(\tau)

be the solution of the gradient flow (Equation 10) defined in the MT with initial conditions

\bm{\mathrm{A}}(\tau=0)= \bm{\mathrm{A}}_0

, then

with

Proof: We expand the square in the training loss

and compute the gradient

Solving this ordinary differential equation yields the desired result. Consequently, the timescales of the dynamics of

\bm{\mathrm{A}}(\tau)

is determined by the inverse of the eigenvalues of

\Delta_t \bm{\mathrm{U}} / \psi_p

C.3 Gaussian Equivalence Principle

As explained in [47, 48, 49], the Gaussian Equivalence Principle which applies in the high dimensional setting considered here establishes an equivalence between the spectral properties of the original model and those of a Gaussian covariate model in which the nonlinear activation function is replaced by a linear term and a nonlinear term that acting as noise,

\begin{align} \sigma\left(\frac{ \bm{\mathrm{W}} \bm{\mathrm{x}}}{\sqrt d}\right) \to \kappa_1 \frac{\bm{\mathrm{W}} \bm{\mathrm{x}}'}{\sqrt d} + \kappa_* \bm{\eta}, \quad \bm{\mathrm{x}}'\sim\mathcal{N}(0, \mathbb{E}_{\bm{\mathrm{x}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]), \quad \bm{\eta}\sim \mathcal{N}(0, \bm{I}_p), \end{align}\tag{11}

💭 Click to ask about this equation

where

\kappa_1, \kappa_*

are constants that depend on the distribution of the data and on the activation function

\sigma

whose formula we recall

The expectation function are with respect to

z\sim \mathcal{N}(0, 1)

and

\sigma^2_{\bm{\mathrm{x}}}= \operatorname{Tr}(\boldsymbol{\Sigma})/d

. The Gaussian Equivalence Principle (GEP) holds if the distribution

P_{\bm{\mathrm{x}}}

of the vector

\bm{\mathrm{x}}

verifies

(i) $P_{\bm{\mathrm{x}}}(\bm{\mathrm{x}})$ has sub-Gaussian tails: there exists a constant $C>0$ such that for all $A \ge 0$ and each entry $\bm{\mathrm{x}}_i$ ,

(ii) The data covariance matrix $\boldsymbol{\Sigma}=\mathbb{E}_{\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]$ is bounded: there exists a constant $K>0$ independent of $d$ such that $\lambda_{\textrm{max}}(\boldsymbol{\Sigma})<K$ and $\frac{\operatorname{Tr} \boldsymbol{\Sigma}}{d}<K$ where $\lambda_{\textrm{max}}(\boldsymbol{\Sigma})$ denotes the spectral norm of $\boldsymbol{\Sigma}$ .

In this section, we outline the derivation of the Gaussian Equivalence Principle (GEP) for the matrices

\bm{\mathrm{U}}, \tilde{\bm{\mathrm{U}}}, \bm{\mathrm{V}}

and

\tilde{\bm{\mathrm{V}}}

under arbitrary input covariance. This generalizes the approach developed in [17], which considered only the case of data drawn from

\mathcal{N}(0, \bm{I}_d)

. A more rigorous approach, which would consist in following [52], is left for future works. We will make use of the Mehler kernel formula [61] which states that for

f

a test function defined on

\mathbb{R}^2

where the expectation on the left-hand side is taken over jointly Gaussian random variables

u

and

v

with zero mean, unit variance, and correlation

\gamma

, while on the right-hand side the expectation is taken over independent standard Gaussian variables.

He_s

denotes the

s

-th Hermite polynomial. We recall some useful properties of the Hermite polynomials [62]:

They form an orthogonal base of $L^2(\mathbb{R}, \frac{e^{-x^2/2}}{\sqrt{2\pi}} \mathrm{d} x)$ .
The first Hermite polynomials are $He_0(x)=1, He_1(x)=x$ .

Lemma 4: Gaussian Equivalence Principle for

\bm{\mathrm{U}}

In the limit

n, p, d\rightarrow\infty

with

\psi_p=p/d, \psi_n=n/d

and with a dataset

\{\bm{\mathrm{x}}^\nu\}_{\nu=1}^n

sampled from a distribution

P_{\bm{\mathrm{x}}}

which verifies assumptions (i) and (ii) with

\boldsymbol{\Sigma}=\mathbb{E}_{P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]

, the matrix

has the same spectrum as its Gaussian equivalent

where

\bm{\mathrm{X}}'\in\mathbb{R}^{d\times n}

is a matrix whose columns

\bm{\mathrm{x}}'^\nu

are sampled according to

\mathcal{N}(0, \boldsymbol{\Sigma})

and

\bm{\Omega}\in \mathbb{R}^{p\times d}

has gaussian entries independent of

\bm{\mathrm{X}}

and

\bm{\mathrm{W}}

Proof: For the sake of clarity, in this proof we explicitly make the covariance of the data

\boldsymbol{\Sigma}

appear by writing the data points are written as

\bm{\mathrm{x}}^\nu= \boldsymbol{\Sigma}^{1/2} \bm{\mathrm{z}}^\nu

where the vectors

\bm{\mathrm{z}}^\nu

have variance 1. Let us focus on the element of

\bm{\mathrm{U}}

in position

(i, j)

\begin{align} \bm{\mathrm{U}}_{ij}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma\left(\frac{\bm{\mathrm{W}}_{ik} (e^{-t}(\Sigma^{1/2})_{kl} \bm{\mathrm{z}}^\nu_l+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}}_{jk'} (e^{-t}(\Sigma^{1/2})_{k'l'} \bm{\mathrm{z}}^\nu_{l'}+\sqrt{\Delta_t} \bm{\xi}_{k'})}{\sqrt{d}}\right)], \end{align}

💭 Click to ask about this equation

where repeated indices mean that there is a hidden sum. We introduce the random variable

\chi_i^\nu=\frac{\bm{\mathrm{W}}_{ik} (e^{-t}(\Sigma^{1/2})_{kl} \bm{\mathrm{z}}^\nu_l+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}

. In the high dimensional limit it converges to a Gaussian random variable by the Central Limit Theorem (since the tails of the data distribution are sub-Gaussian). If

i=j

, the diagonal terms concentrate with respect to the data points and we can thus replace the sum by an average

The finite

n

corrections can be discarded because they cannot change the spectrum of

\bm{\mathrm{U}}

\chi

can be taken Gaussian with mean 0 and covariance

\mathbb{E}_{\bm{\mathrm{W}}_i, \bm{\mathrm{z}}, \xi}[\chi^2]=\mathbb{E}_{\bm{\mathrm{W}}_i, \bm{\mathrm{z}}, \xi}[\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}]=\frac{\operatorname{Tr}(\boldsymbol{\Sigma}_t)}{d}=\Gamma_t^2

hence

i\neq j

, we denote

\eta_i^\nu=e^{-t}\frac{\bm{\mathrm{W}}_i^T \Sigma^{1/2} \bm{\mathrm{z}}}{\sqrt{d}}

. For now we consider

\bm{\mathrm{W}}

and the

\bm{\mathrm{z}}^\nu

fixed and look at

\xi

. We use the Mehler Kernel formula for the random variables

u= \bm{\mathrm{W}}_i^T \bm{\xi}/\sqrt{d}

and

v= \bm{\mathrm{W}}_j^T \bm{\xi}/\sqrt{d}

that have correlation

\mathbb{E}_{\bm{\xi}}[uv]=\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}

\begin{align} &\mathbb{E}_{u, v}[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &=\sum_{s=1}^\infty\frac{(\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j/d)^s}{s!}\mathbb{E}_u[He_s(u)\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_u[He_s(v)\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]. \end{align}

💭 Click to ask about this equation

We truncate at order

s=1

since the corrections are order

\mathcal{O}(1/d)

\begin{align} \bm{\mathrm{U}}_{ij}&=\frac{1}{n}\sum_{\nu=1}^n \mathbb{E}_u[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &+\frac{1}{n}\sum_{\nu=1}^n\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}\mathbb{E}_u[u\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[v\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &=\frac{1}{n}\sum_{\nu=1}^n \mathbb{E}_u[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &+\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}\mathbb{E}_{\bm{\eta}}[\mathbb{E}_u[u\sigma\left(\bm{\eta}_i+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[v\sigma\left(\bm{\eta}_j+\sqrt{\Delta_t}v\right)]]. \end{align}

💭 Click to ask about this equation

by neglecting

\mathcal{O}(1/d)

corrections and where the law of

\eta

can be considered Gaussian with zero mean correlation

\mathbb{E}[\eta_i^\nu \eta_j^\nu]=\frac{e^{-2t} \operatorname{Tr}(\boldsymbol{\Sigma})}{d}\delta_{ij}=e^{-2t}\sigma_{\bm{\mathrm{x}}}^2\delta_{ij}

. The coefficient in front of

\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}

is therefore

Denote

\sigma_0(\eta)=\mathbb{E}_u[\sigma(\eta+\sqrt{\Delta_t}u)]

. We now focus on

We use the GEP on

\sigma_0

\begin{align} \sigma_0\left(\frac{e^{-t} \bm{\mathrm{W}}_i^T \bm{\mathrm{x}}^\nu}{\sqrt d}\right) \to a_t e^{-t}\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{x}}'^\nu}{\sqrt d} + \ v_t \bm{\Omega}_i^\nu, \quad \bm{\mathrm{x}}'^\nu\sim\mathcal{N}(0, \boldsymbol{\Sigma}), \quad \bm{\Omega}_i^\nu\sim \mathcal{N}(0, \bm{I}_{p}), \end{align}

💭 Click to ask about this equation

with

a_t=\mathbb{E}_u[\sigma_0(e^{-t}\sigma_{\bm{\mathrm{x}}}u)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}]=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}} u+\sqrt{\Delta_t }v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}]

and

v_t^2=\mathbb{E}_u[\sigma_0(e^{-t}\sigma_{\bm{\mathrm{x}}}u)^2]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2

. Hence the truncated expansion yields for

i\neq j

\begin{align} \bm{\mathrm{U}}_{ij}=\frac{1}{n}\sum_{\nu=1}^n\left(a_te^{-t}\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{x}}'^\nu}{\sqrt{d}}+v_t \bm{\Omega}_i^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}}_j^T \bm{\mathrm{x}}'^\nu}{\sqrt{d}}+v_t \bm{\Omega}_j^\nu \right)^T+b_t^2\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}. \end{align}

💭 Click to ask about this equation

Now we need to deal with the diagonal term. We need to substract

The Gaussian equivalent of

\bm{\mathrm{U}}

reads

with

s_t^2=\lVert \sigma \rVert^2- a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2

Lemma (GEP for $\tilde{U}$ ).

Let

where the expectation value is taken

\bm{\mathrm{y}}\sim P_t

. Then the GEP of

\tilde{U}

reads

where

\mu_1^2(t)

and

\lVert \sigma \rVert^2

are defined in Appendix C.1.

Proof: For a vector

\bm{\mathrm{y}}

sampled from

P_t

, the

\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}}

are asymptotically Gaussian with 0 mean, variance

\mathbb{E}_{\bm{\mathrm{y}}}[\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}} \frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}}]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}\sim\Gamma_t^2

and correlation

\mathbb{E}_{\bm{\mathrm{y}}}[\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}} \frac{\bm{\mathrm{W}}_j^T \bm{\mathrm{y}}}{\sqrt{d}}]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_j}{d}

. We apply Mehler Kernel formula to

\tilde{\bm{\mathrm{U}}}

where the expectation on

u

and

v

is standard Gaussian. We keep only terms at order

\mathcal{O}(1/\sqrt{d})

. If

i\neq j

we keep the terms up to order

s=1

For

i=j

we cannot truncate because all terms are

\mathcal{O}_d(1)

. Hence the diagonals terms are asymptotically

Taking care of the diagonal terms, the Gaussian Equivalent matrix reads

where

\mu_1(t)=\mathbb{E}_{u}[\sigma(\Gamma_tu)u]

Building on the GEP of

\tilde{\bm{\mathrm{U}}}

, we prove the following lemma on the scaling of the eigenvalues in the bulk.

Lemma 5: Scaling of the bulk of

\tilde{\bm{\mathrm{U}}}

We assume that

\boldsymbol{\Sigma}

is positive definite and that the spectral norm

\lambda_{\mathrm{max}}(\boldsymbol{\Sigma})

stays

\mathcal{O}_d(1)

. In the high dimensional limit

p>d\gg 1

, the spectrum of

\tilde{\bm{\mathrm{U}}}

is asymptotically equal to

where

\rho_{\mathrm{bulk}}(\lambda)

is an atomless measure whose support is of order

\mathcal{O}(\psi_p)

Proof: Since

p>d

and

\bm{\mathrm{W}}\in\mathbb{R}^{p\times d}

and

\boldsymbol{\Sigma}\in \mathbb{R}^{d\times d}

, the spectrum admits a Dirac mass at

\lambda=\lVert \sigma \rVert^2-\mu_1^2(t)

with weight

(p-d)/p

. For the order of magnitude of the eigenvalues in the bulk, let us first observe that the bulk of

\frac{\bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}}{d}

is the same as the one of

\frac{\boldsymbol{\Sigma}_t^{1/2} \bm{\mathrm{W}} \bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t^{1/2}}{d}

. We can bound the spectral norm of the product by the product of the spectral norms

since we assumed that

\lambda_{\mathrm{max}}(\boldsymbol{\Sigma}_t)=e^{-2t}\lambda_{\mathrm{max}}(\boldsymbol{\Sigma}_t)+\Delta_t=\mathcal{O}(1)

and since

\lambda_{\mathrm{max}}(\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d})=\mathcal{O}(\psi_p)

is given by the Marchenko-Pastur law [63]. To bound the norm from below we use the following inequality

Since

\boldsymbol{\Sigma}_t

is positive definite, the bound is also of order

\psi_p

. This concludes that the support of the bulk is of order

\psi_p

Lemma (GEP for $\bm{\mathrm{V}}$ and $\tilde{\bm{\mathrm{V}}}$ ).

Let

\begin{align} & \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}) \bm{\xi}^T], \\ &\tilde{\bm{\mathrm{V}}}=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t(\bm{\xi})}{\sqrt{d}}) \bm{\xi}^T]. \end{align}

💭 Click to ask about this equation

They can be replaced by their Gaussian Equivalence Principle in the train and test losses.

Proof: The two matrices only differ element-wise by quantity of order

\mathcal{O}(1/n)

and therefore have the same Gaussian Equivalent matrix. We focus on

\tilde{\bm{\mathrm{V}}}

. Introduce the random variable

\bm{\eta}_i=\frac{\bm{\mathrm{W}}_{ik}(e^{-t} \bm{\mathrm{x}}_k+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}

. Its has 0 mean, covariance

\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\bm{\eta}_i^2]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}\sim\Gamma_t^2

and correlation with

\bm{\xi}

\gamma_{ij}=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\bm{\eta}_i \bm{\xi}_j]=\frac{\sqrt{\Delta_t} \bm{\mathrm{W}}_{ij}}{\sqrt{d}}

. We apply the Mehler Kernel formula

\begin{align} \tilde{\bm{\mathrm{V}}}_{ij}&=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\sigma\left(\Gamma_t(\frac{\bm{\mathrm{W}}_{ik}(e^{-t}(\boldsymbol{\Sigma}_t)_{kl} \bm{\mathrm{z}}_l+\sqrt{\Delta_t}\xi_l)}{\Gamma_t\sqrt{d}})\right)\xi_{j}]\\ &=\sum_s\frac{1}{s!}\left(\frac{\bm{\mathrm{W}}_{ij}\sqrt{\Delta_t}}{\Gamma_t\sqrt{d}}\right)^s\mathbb{E}_u[\sigma(\Gamma_t u)He_s(u)]\mathbb{E}_v[vHe_s(v)]\\ &=0+\frac{\sqrt{\Delta_t}}{\Gamma_t}\frac{\bm{\mathrm{W}}_{ij}}{\sqrt{d}}\mathbb{E}_u[\sigma(\Gamma_t u)u]\mathbb{E}_v[v^2]+\mathcal{O}(\frac{1}{d})\\ &=\frac{\sqrt{\Delta_t}\mu_1(t)}{\Gamma_t}\frac{\bm{\mathrm{W}}_{ij}}{\sqrt{d}}. \end{align}

💭 Click to ask about this equation

C.4 Proof of Theorem 1

We recall the Theorem 1 of the MT.

Theorem 6

Let

q(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}

r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2})

and

s(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}})

, with

z\in\mathbb{C}

. Let

Then

q(z), r(z)

and

s(z)

satisfy the following set of three equations:

\begin{align} &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}\psi_p} {\psi_n} r+\frac{\psi_p v_t^2}{\psi_n} q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0, \end{align}

💭 Click to ask about this equation

The eigenvalue distribution of

\bm{\mathrm{U}}

\rho(\lambda)

, can then be obtained using the Sokhotski–Plemelj inversion formula

\rho(\lambda)=\underset{\varepsilon \rightarrow0^+}{\lim}\frac{1}{\pi}\operatorname{Im}q(\lambda+i\varepsilon)

We first show that the equations of the Stieltjes transform of

\rho

found in Ref. [17] with linear pencils [53] in the case

P_{\bm{\mathrm{x}}}=\mathcal{N}(0, \bm{I}_d)

i.e.

\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1)

can be reduced to the equations of Theorem 1 with our definitions of

\mu_1(t), s_t

and

v_t

. The equations of [17] read

\begin{align} &\zeta_1(s_t^2-z+e^{-2t}\mu_1^2\zeta_2\zeta_3+v_t^2\zeta_2+\Delta_t\mu_1^2\zeta_4)-1=0\\ &\zeta_2(\psi_n+v_t^2\psi_p\zeta_1-e^{-t}\mu_1\zeta_3)-\psi_n=0\\ &e^{-t}\mu_1\psi_p\zeta_1(1+e^{-t}\mu_1\zeta_2\zeta_3)+(1+(\Delta_t\mu_1^2\psi_p\zeta_1)\zeta_3)=0\\ &e^{-2t}\mu_1^2\psi_p\zeta_1\zeta_2\zeta_4+(1+\Delta_t\mu_1^2\psi_p\zeta_1)\zeta_4-1=0, \end{align}

💭 Click to ask about this equation

with

\zeta_1=q

and

\zeta_{2, 3, 4}

auxiliary variables. We make the following change of variables

r=-\frac{\zeta_3}{e^{-t}\mu_1\psi_p}

. The second equations relates

\zeta_2

q

and

r

Injecting this into the second equations gives the second equation of Theorem 1. The fourth equation gives

Injecting this into the first equation gives

After some massaging we find back the first equation of Theorem 1.

We now prove Theorem 1 using a replica computation, inspired by the calculation done in Ref. [44].

Proof: Our goal is to compute the Stieltjes transform of the matrix

\bm{\mathrm{U}}

\begin{align} q&=\underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}]\\ &=-\partial_z \underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\log\det (\bm{\mathrm{U}}-z \bm{I}_p)]\\ &=2\partial_z\underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\log\det (\bm{\mathrm{U}}-z \bm{I}_p)^{-1/2}]. \end{align}

💭 Click to ask about this equation

The so-called replica trick consists of replacing the

\log x

\underset{s\rightarrow\infty}{\lim}\frac{x^s-1}{s}

. Applying this identity, we obtain

where as usual with replica computations we have inverted the order of the limits

p\rightarrow \infty

and

s\rightarrow 0

. We define the partition function

\mathcal{Z}

We replace

\bm{\mathrm{U}}

by its Gaussian equivalent proved in Lemma 4 and write the partition function for an arbitrary integer

s

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[e^{-\frac{1}{2}{\phi^a}^T(\bm{\mathrm{U}}-z \bm{I}_p)\phi^a}]\\ &=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}}e^{\frac{1}{2}{\phi^a}^T(z-s_t^2)\phi^a}\nonumber \\ &\ \ \ \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}e^{-\frac{b_t^2}{2d}{\phi^a}^T \bm{\mathrm{W}} \bm{\mathrm{W}}^T\phi^a}]. \end{align}

💭 Click to ask about this equation

We first perform the computation for integer values of

s

, and then analytically continue the result to the limit

s \to 0

. To compute the expectation over

\bm{\mathrm{X}}

\bm{\mathrm{W}}

, and

\bm{\Omega}

, we need the following standard result from Gaussian integration

where

\bm{\mathrm{G}}

is a square matrix and

\bm{\mathrm{J}}

a vector. Averaging over the data set. The dataset dependence enters through

\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]\nonumber \\ &=\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{a_t^2e^{-2t}}{2nd}{\phi^a}^T \bm{\mathrm{W}} \bm{\mathrm{X}}^\nu {\bm{\mathrm{X}}^\nu}^T \bm{\mathrm{W}}^T\phi^a}e^{-\frac{a_te^{-t}v_t}{2\sqrt{d}n}{\phi^a}^T(\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu \bm{\Omega}^T+ \bm{\Omega} {\bm{\mathrm{X}}^\nu}^T \bm{\mathrm{W}}^T)\phi^a}] e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}. \end{align}

💭 Click to ask about this equation

We introduce for each replica

\phi^a

a Fourier transform of the delta function by using the auxiliary variables

\omega^a, \hat{\omega}^a \in \mathbb{R}^{d}

as²

Throughout the computation, we discard non-exponential prefactors, as they give subleading contributions.

In the following, we do the change of variable

\bm{\mathrm{X}}^\nu= \Sigma^{1/2} \bm{\mathrm{Z}}^\nu

with

\bm{\mathrm{Z}}^\nu

d

dimensional Gaussian random variable with unit variance.

\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]\nonumber \\ &=\mathbb{E}_{\bm{\mathrm{Z}}}[e^{-\frac{a_t^2e^{-2t}p}{2nd}{\omega^a}^T \bm{\mathrm{Z}}^\nu {\bm{\mathrm{Z}}^\nu}^T\omega^a}e^{-\frac{a_t e^{-t}v_t\sqrt{p}}{\sqrt{d}n}\sum_{a, \nu} \bm{\Omega}^\nu \phi^a \omega^a\cdot \bm{\mathrm{Z}}^\nu}] e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}. \end{align}

💭 Click to ask about this equation

Denote

\bm{\mathrm{G}}_{\bm{\mathrm{Z}}}=\frac{a_t^2p}{dn}\sum_a\omega^a{\omega^a}^T

and

(\bm{\mathrm{J}}_{\bm{\mathrm{Z}}})^\nu=\frac{a_te^{-t}v_t\sqrt{p}}{\sqrt{d}n}\sum_{a}(\bm{\Omega}^\nu\cdot \phi^a) \omega^a

, then

\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]=\nonumber \\ &e^{-\frac{n}{2}\log \det(1+ \bm{\mathrm{G}}_{\bm{\mathrm{Z}}})}e^{\frac{a_te^{-t}v_tp}{2dn^2} \sum_\nu (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_h(1+ \bm{\mathrm{G}}_{\bm{\mathrm{Z}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}, \end{align}

💭 Click to ask about this equation

where repeated indices mean that there is an implicit summation. Averaging over $\bm{\Omega}$ . The terms that depend on

\bm{\Omega}

are

\begin{align} &\mathbb{E}_{\bm{\Omega}}[e^{\frac{a_te^{-t}v_t}{2dn^2} \sum_\nu (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_k(1+ \bm{\mathrm{G}}_{\bm{\mathrm{X}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}]\nonumber \\ &=(\mathbb{E}_{\bm{\Omega}^\nu}[e^{\frac{a_te^{-t}v_tp}{2dn^2} (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_k(1+ \bm{\mathrm{G}}_{\bm{\mathrm{X}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega}^\nu {\bm{\Omega}^\nu}^T{\phi^a}}])^n\\ &=e^{-\frac{n}{2}\log \det (1+ \bm{\mathrm{G}}_{\bm{\Omega}})}, \end{align}

💭 Click to ask about this equation

with

We are left with

\begin{align} &\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}} \mathrm{d}\omega^a d\hat{\omega}^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}\nonumber \\ &\mathbb{E}_{\bm{\mathrm{W}}}[e^{i\hat{\omega}^a(\sqrt{p}\omega^a- {\phi^a}^T \bm{\mathrm{W}} \Sigma^{1/2})}\ e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}]. \end{align}

💭 Click to ask about this equation

Averaging over the random features $\bm{\mathrm{W}}$ .

\bm{\mathrm{W}}

only appears through

e^{-i\hat{\omega}^a \bm{\mathrm{W}}^T{\phi^a} \Sigma^{1/2}}

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}}[e^{i\sum_a\hat{\omega}^a(\sqrt{p}\omega^a- \bm{\mathrm{W}}^T{\phi^a} \Sigma^{1/2}_t)}]&=e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a}(\mathbb{E}_{\bm{\mathrm{W}}}[e^{-i \hat{\omega}_k^a{\phi^a}_i \bm{\mathrm{W}}_{li}(\Sigma^{1/2}_t)_{kl}}])\\ &=e^{i\sqrt{p}\hat{\omega}^a\cdot\omega^a}e^{-\frac{1}{2}\hat{\omega}^a_k(\boldsymbol{\Sigma})_{kl}\hat{\omega}^b_{l}\phi^a_i\phi^b_i}\\ &=e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a} e^{-\frac{1}{2}\sum_{a, b} \hat{\omega}^a \boldsymbol{\Sigma} \hat{\omega}^b \phi^a\cdot \phi^b}. \end{align}

💭 Click to ask about this equation

We end up with

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a d\hat{\omega}^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a}\nonumber \\ & e^{-\frac{1}{2}\sum_{a, b} \hat{\omega}^a \boldsymbol{\Sigma} \hat{\omega}^b \phi^a\cdot \phi^b} e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}. \end{align}

💭 Click to ask about this equation

Averaging over the $\hat{\omega}^a$ . We can integrate with respect to

\hat{\omega}

. The only terms that appear with it are

Denote

\bm{\mathrm{J}}_i^a=i\sqrt{p}\omega_i^a

and

\bm{\mathrm{G}}_{kl}^{ab}= \boldsymbol{\Sigma}_{kl} \ \phi^a\cdot\phi^b

, then the integral is of the form

\begin{align} \int \prod_a \mathrm{d}\hat{\omega}^a e^{\sum_{i, a} \bm{\mathrm{J}}_i^a \hat{\omega}_i^a} e^{-\frac{1}{2}\sum_{i, j, a, b} \hat{\omega}_i^a \bm{\mathrm{G}}_{ij}^{ab}\hat{\omega}_j^b} =e^{-\frac{1}{2}\log \det(\bm{\mathrm{G}})+\frac{1}{2} \bm{\mathrm{J}}^T \bm{\mathrm{G}}^{-1} \bm{\mathrm{J}}}. \end{align}

💭 Click to ask about this equation

This gives

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}\nonumber \\ & e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}e^{-\frac{1}{2}\log \det(\bm{\mathrm{G}})+\frac{1}{2} \bm{\mathrm{J}}^T \bm{\mathrm{G}}^{-1} \bm{\mathrm{J}}}. \end{align}

💭 Click to ask about this equation

Introducing the order parameters. We define the order parameters as

\bm{\mathrm{Q}}^{ab} = \frac{1}{p} \phi^a \cdot \phi^b

and

\bm{\mathrm{R}}^{ab} = \frac{1}{d} \omega^a \cdot \omega^b

. To enforce these constraints, we use the following delta function representations

\begin{align} &1=\int \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}, \\ &1=\int \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}, \end{align}

💭 Click to ask about this equation

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, Y, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab}\nonumber \\ &e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2g}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{1}{2}\omega_k^a \boldsymbol{\Sigma}^{-1}_{kl}(\bm{\mathrm{Q}}^{-1})_{ab}\omega_l^b}. \end{align}

💭 Click to ask about this equation

We also introduce

\bm{\mathrm{S}}^{ab}=\omega_k^a \boldsymbol{\Sigma}^{-1}\omega_l^b/d

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} \mathrm{d} \bm{\mathrm{S}}^{ab} \mathrm{d} \hat{\bm{\mathrm{S}}}^{ab}\nonumber \\ &e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{S}}}^{ab}(dS^{ab}-\omega^a \boldsymbol{\Sigma}^{-1}\omega^b)}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2 p}{2} \operatorname{Tr}(\bm{\mathrm{S}})}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2v_t^2p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{d}{2} \operatorname{Tr} (\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1})}. \end{align}

💭 Click to ask about this equation

The integration over

\mathrm{d} \phi^a

and

\mathrm{d} \omega^a

gives

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} \mathrm{d} \bm{\mathrm{S}}^{ab} \mathrm{d} \hat{\bm{\mathrm{S}}}^{ab}\nonumber \\ &e^{\frac{p}{2} \operatorname{Tr}(\hat{\bm{\mathrm{Q}}} \bm{\mathrm{Q}})}e^{-\frac{p}{2}\log \det \hat{\bm{\mathrm{Q}}}}e^{\frac{d}{2}\hat{\bm{\mathrm{R}}}^{ab} \bm{\mathrm{R}}^{ab}}e^{\frac{d}{2}\hat{\bm{\mathrm{S}}}^{ab} \bm{\mathrm{S}}^{ab}}\nonumber \\ &e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2 p}{2} \operatorname{Tr}(\bm{\mathrm{S}})}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{d}{2} \operatorname{Tr} (\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1})}. \end{align}

💭 Click to ask about this equation

We need to combine

e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}

and

e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})}

\begin{align} e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})} &=e^{-\frac{1}{2}\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}}\otimes \bm{I}_d+ \bm{\mathrm{Q}}\hat{\bm{\mathrm{R}}}\otimes \boldsymbol{\Sigma})}\\ &=e^{-\frac{d}{2}\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})}e^{-\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})} \end{align}

💭 Click to ask about this equation

Then for

e^{-\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})}

, we can introduce

\rho_{\boldsymbol{\Sigma}}(\lambda)

the density of eigenvalues of

\boldsymbol{\Sigma}

\begin{align} -\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})= &-\frac{1}{2} \operatorname{Tr} \log(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})\\ &=-\frac{1}{2}\sum_{l\ge0}\frac{(-1)^l}{l!}(\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})^l\otimes \boldsymbol{\Sigma}^l\\ &=-\frac{d}{2}\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda)\sum_{l\ge0}\frac{(-1)^l}{l!} \operatorname{Tr}((\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})^l)\lambda^l\\ &=-\frac{d}{2}\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}). \end{align}

💭 Click to ask about this equation

We end up with

\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^m]=\int \mathrm{d} \bm{\mathrm{Q}} \mathrm{d}\hat{\bm{\mathrm{Q}}} \mathrm{d} \bm{\mathrm{R}} \mathrm{d}\hat{\bm{\mathrm{R}}} \mathrm{d} \bm{\mathrm{S}} \mathrm{d}\hat{\bm{\mathrm{S}}}e^{-\frac{d}{2}S(\bm{\mathrm{Q}}, \hat{\bm{\mathrm{Q}}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})}, \end{align}

💭 Click to ask about this equation

where the action reads

\begin{align} S(\bm{\mathrm{Q}}, \hat{\bm{\mathrm{Q}}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})&=\psi_p \log \det\hat{\bm{\mathrm{Q}}}-\psi_p \operatorname{Tr}(\bm{\mathrm{Q}}\hat{\bm{\mathrm{Q}}})- \operatorname{Tr}(\bm{\mathrm{R}}\hat{\bm{\mathrm{R}}})- \operatorname{Tr}(\bm{\mathrm{S}}\hat{\bm{\mathrm{S}}})\nonumber \\ &-\psi_p(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}+\psi_n\log \det(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})+b_t^2\psi_p \operatorname{Tr} \bm{\mathrm{S}}\nonumber \\ &+\psi_n \log(\bm{I}_s+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})\nonumber \\ &+\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})+\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}) + \operatorname{Tr}(\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1}). \end{align}

💭 Click to ask about this equation

In the high dimensional limit, the partition function is dominated by the saddle point. By derivating with respect to

\hat{\bm{\mathrm{Q}}}

we get

which yields

\begin{align} S(\bm{\mathrm{Q}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})&=-\psi_p \log \det \bm{\mathrm{Q}}- \operatorname{Tr}(\bm{\mathrm{R}}\hat{\bm{\mathrm{R}}})- \operatorname{Tr}(\bm{\mathrm{S}}\hat{\bm{\mathrm{S}}})\nonumber \\ &-\psi_p(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}+\psi_n\log \det(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})+b_t^2\psi_p \operatorname{Tr} \bm{\mathrm{S}}\nonumber \\ &+\psi_n \log(\bm{I}_s+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})\nonumber \\ &+\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})+\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})\nonumber \\ &+ \operatorname{Tr}(\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1}). \end{align}

💭 Click to ask about this equation

As a sanity check, if

\boldsymbol{\Sigma}= \bm{I}_d

, differentiation with respect to

\hat{\bm{\mathrm{R}}}

and

\hat{\bm{\mathrm{S}}}

yields

and we find back the same action as before. RS Ansatz. As before we introduce a RS ansatz for all the the matrices and moreover suppose that only the diagonal terms are non vanishing i.e. they are of the form

\bm{\mathrm{Q}}=q \bm{I}_s

. This ansatz yields

\begin{align} S(q, r, \hat{r}, s, \hat{s})/s&=-\psi_p \log q-r\hat{r}-s\hat{s}\nonumber \\ &-\psi_p(z-s_t^2)q+\psi_n\log(1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q)+b_t^2\psi_p s\nonumber \\ &+\log (q)+\int \mathrm{d}\lambda\; \rho_{\boldsymbol{\Sigma}}(\lambda)\log(\hat{s}+\lambda\hat{r})+\frac{s}{q}. \end{align}

💭 Click to ask about this equation

Let us differentiate with respect to the 5 variables

\begin{align} &\frac{\partial S}{\partial s }=-\hat{s}+b_t^2\psi_p+\frac{1}{q}, \\ &\frac{\partial S}{\partial r }=-\hat{r}+\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}, \\ &\frac{\partial S}{\partial \hat{s}}=-s+\int \mathrm{d}\lambda \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}+\lambda\hat{r}}, \\ &\frac{\partial S}{\partial \hat{r}}=-r+\int \mathrm{d}\lambda \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}+\lambda\hat{r}}, \\ &\frac{\partial S}{\partial q}=-\frac{\psi_p}{q}-\psi_p(z-s_t^2)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}+\frac{1}{q}-\frac{s}{q^2}. \end{align}

💭 Click to ask about this equation

Hence the saddle point equations read

\begin{align} &\hat{s}=b_t^2\psi_p+\frac{1}{q}, \\ &\hat{r}=\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}, \\ &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}+\lambda\hat{r}}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}+\lambda\hat{r}}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0. \end{align}

💭 Click to ask about this equation

Finally, we observe that the solution

q^*

to the saddle point equations corresponds to the Stieltjes transform of

\rho

C.5 Proof of Theorem 2

We recall Theorem 2 of the MT.

Theorem 7: Informal

Let

\rho

denote the spectral density of

\bm{\mathrm{U}}

Regime I (overparametrized): $\psi_p>\psi_n\gg 1$ .

Regime II (underparametrized): $\psi_n>\psi_p\gg 1$ .

where

\rho_1

is a atomless measure with support

and

\rho_2

coincides with the asymptotic eigenvalue bulk density of the population covariance

\tilde{\bm{\mathrm{U}}}=\mathbb{E}_{\bm{\mathrm{X}}}[\bm{\mathrm{U}}]

;

\rho_2

is independent of

\psi_n

and its support is on the scale

\psi_p

The eigenvectors associated with

\delta(\lambda-{s_t^2})

leave both training and test losses unchanged and are therefore irrelevant. In the limit

\psi_p\gg \psi_n

, the supports of

\rho_1

and

\rho_2

are respectively on the scales

\psi_p/\psi_n

and

\psi_p

, i.e. they are well separated.

We now proceed to prove Theorem 2.

Proof: Delta peak. We first account for the delta peak in the spectrum. We use the Gaussian equivalence for

\bm{\mathrm{U}}

computed in Lemma 4. Let

\bm{\Omega}^\nu\in\mathbb{R}^p

be the

\nu

th column of

\bm{\Omega}

and

\bm{\mathrm{W}}_i\in\mathbb{R}^p

the

i

th row of

\bm{\mathrm{W}}

. Suppose a vector

\bm{\mathrm{v}}\in\mathbb{R}^p

lies in the kernel of all these

then

\bm{\mathrm{U}} \bm{\mathrm{v}}=s_t^2 \bm{\mathrm{v}}

. These are

n+d

linear constraints on a vector of size

p

hence there are non trivial solutions for

n+d\le p

. Hence a delta‐peak at

s_t^2

appears as soon as

\psi_p \ge \psi_n+1

. Next, we extract its weight. Recall that the Stieltjes transform satisfies

and a point mass of weight

f

\lambda = s_t^2

contributes

\tfrac{-f}{z - s_t^2}\approx \tfrac{f}{\varepsilon}

z \to s_t^2 - \varepsilon

. Meanwhile

s(z) \;=\;\frac1p \operatorname{Tr}\!\bigl[\bm{\mathrm{W}}^T(\bm{\mathrm{U}} - z \bm{I})^{-1} \bm{\mathrm{W}}\bigr], \quad r(z) \;=\;\frac1p \operatorname{Tr}\!\bigl[{\boldsymbol{\Sigma}}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}} - z \bm{I})^{-1} \bm{\mathrm{W}}{\boldsymbol{\Sigma}}^{1/2}\bigr]

💭 Click to ask about this equation

remain finite in that limit, since the corresponding eigenvectors satisfy

\bm{\mathrm{W}}\, \bm{\mathrm{v}}=0

. We substitute this Ansatz into the equations of Theorem 1. The first equation reads

and simplifies to

It readily gives

Thus the point mass at

s_t^2

has weight

1-\frac{1}{\psi_p}-\frac{\psi_n}{\psi_p}

, in agreement with the counting of degrees of freedom presented above.

Finally, one checks that these isolated eigenvalues do not contribute to the train and test losses. After expanding the square they read

\begin{align} &\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{U}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{V}})\\ &\mathcal{L}_\mathrm{test}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{U}}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{V}}}) \end{align}

💭 Click to ask about this equation

The terms that appear in the loss are of the form

\operatorname{Tr}(\bm{\mathrm{A}}^T \bm{\mathrm{A}}...)

and

\operatorname{Tr}(\bm{\mathrm{A}} \bm{\mathrm{W}})

. The trace can be decomposed on the basis of eigenvectors of

\bm{\mathrm{U}}

. The eigenvectors associated with the delta peak satisfy

\bm{\mathrm{W}}^T \bm{\mathrm{v}}=0

. Looking at the expression of the matrix

\bm{\mathrm{A}}= \bm{\mathrm{W}}^T...+ \bm{\mathrm{A}}_0

, one can easily see that, for initial conditions

\bm{\mathrm{A}}_0=0

, one has

\bm{\mathrm{v}}^T \bm{\mathrm{A}}^T=0

and the subspace corresponding to these isolated eigenvalues does not contribute to the loss.

First bulk. Using the expression for

q=\frac{1}{p} \operatorname{Tr}\frac{1}{\bm{\mathrm{U}}-z \bm{I}_p}

and

r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I})^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2})

we make the following Ansatz in the large

\psi_p

limit:

In this limit the saddle point equations becomes at leading order in

\psi_p

We can focus only on the last equation on

q

only. This is a quadratic polynomial in

q

. If its discriminant is negative then the solutions are imaginary and thus the density of eigenvalues is non-zero. The edge of the bulk are where the discriminant vanishes

It vanishes for

which are the edges of the first bulk

\rho_1

. We have checked this result, and hence validated the Ansatz solving numerically the equations on

r, q

. Interestingly at leading order the expression of the first bulk is independent of

\rho_{\boldsymbol{\Sigma}}

Second Bulk. We scale

q=\mathcal{O}_{\psi_p}(1/\psi_p)

and

r=\mathcal{O}_{\psi_p}(1/\psi_p)

. The equations on

\hat{s}

and

\hat{r}

lead to

This yields the following equation on

q

We denote the shifted variable

z'=z-s_t^2-v_t^2

. This yields

We decompose the integral

\begin{align} \int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}&=\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)(1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})-q\psi_p(b_t^2+\lambda a_t^2e^{-2t}))}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}\\ &=1-q\psi_p\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)(b_t^2+\lambda a_t^2e^{-2t})}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})} \end{align}

💭 Click to ask about this equation

By plugging this back in the equation we find

We do the change of variable

\mu=b_t^2+\lambda a_t^2e^{-2t}

. This yields

An integration by parts give that

b_t^2=\Delta_t\mu_1^2(t)

a_t^2=\mu_1^2(t)/\sigma_{\bm{\mathrm{x}}}^2

. We thus realize that the integral is over the eigenvalue distribution of

\mu_1^2(t)(e^{-2t} \boldsymbol{\Sigma}+\Delta_t \bm{I}_d)

We recognize the Bai-Silverstein equations [64, 65] for the eigenvalue density of the matrix

which is the population version of

\bm{\mathrm{U}}

and is thus independent of

n

. Lemma 5 concludes on the order of the eigenvalues in the bulk of

\rho_2

C.6 Dynamics on the fast timescales

In the following we denote for a matrix

\bm{\mathrm{A}}\in \mathbb{R}^{p\times p}

the operator norm and

the Frobenius norm. Before deriving the fast‐time behavior, we need the following lemma.

Lemma 8

The operator norm of

\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}

satisfies

when

p\gg n\gg d

Proof: On the one hand,

\begin{align} \bm{\mathrm{U}}=e^{-2t}a_t^2\frac{\bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T}{d}+v_t^2\frac{\bm{\Omega} \bm{\Omega}^T}{n}+\frac{e^{-t}a_tv_t}{n\sqrt{d}}\left(\bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T+ \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T\right)+(s_t^2+v_t^2) \bm{I}_p \end{align}

💭 Click to ask about this equation

and on the other hand,

We also note the identities

b_t^2=\Delta_t\mu_1^2(t)

and

a_t^2=\mu_1^2(t)

\begin{align} \bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}=a_t^2e^{-2t}\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}+v_t^2(\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)+\frac{a_tv_te^{-t}}{n\sqrt{d}}(\bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T). \end{align}

💭 Click to ask about this equation

We can bound its operator norm

\begin{align} \lVert (\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}})&\rVert_{\mathrm{op}}\le C_1 \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}+C_2\lVert (\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)\rVert_{\mathrm{op}} \nonumber \\ &\qquad+\frac{C_3}{n\sqrt{d}}\lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{op}}, \end{align}

💭 Click to ask about this equation

where

C_1, C_2, C_3

are constants independent of

p, n, d

. We bound each of the three terms on the right hand side. We will use the fact that for a symmetric matrix, the operator norm

\lVert .\rVert_{\mathrm{op}}

is equal to its largest eigenvalue.

First term.

We observe that

\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}

and

\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})

have the same eigenvalues up to the multiplicity of 0³. We then use the sub-multiplicativity of the operator norm

They both have the same moments

\operatorname{Tr}(.)^k

owing to the cyclicity of the trace.

\begin{align} \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}\le\lVert \frac{\bm{\mathrm{W}}^T}{\sqrt{d}} \frac{\bm{\mathrm{W}}}{\sqrt{d}}\rVert_{\mathrm{op}}\lVert(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\rVert_{\mathrm{op}}. \end{align}

💭 Click to ask about this equation

We can do the same operation by introducing

\bm{\mathrm{X}}= \boldsymbol{\Sigma} \bm{\mathrm{Z}}

with

\bm{\mathrm{Z}}\in\mathbb{R}^{d\times n}

with standard Gaussian entries,

\begin{align} \lVert(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\rVert_{\mathrm{op}}=\lVert \boldsymbol{\Sigma}^{1/2}(\frac{\bm{\mathrm{Z}} \bm{\mathrm{Z}}^T}{n}- \bm{I}_d) \boldsymbol{\Sigma}^{1/2}\rVert_{\mathrm{op}}\le\lVert(\frac{\bm{\mathrm{Z}} \bm{\mathrm{Z}}^T}{n}- \bm{I}_d)\rVert_{\mathrm{op}}\lVert \boldsymbol{\Sigma}\rVert_{\mathrm{op}}. \end{align}

💭 Click to ask about this equation

Among our assumptions, we had

\lVert \boldsymbol{\Sigma}\rVert_{\mathrm{op}}<\mathcal{O}(1)

. The spectrum of

(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \bm{I}_d)

is the Marchenko-Pastur law whose largest eigenvalue is of order

\sqrt{d/n}

while for

\frac{\bm{\mathrm{W}}^T \bm{\mathrm{W}}}{d}

it is order

\frac{p}{d}

. The bound reads

Second term.

We observe that the spectrum of

\bm{\Omega} \bm{\Omega}^T/n- \bm{I}_p

is Marchenko-Pastur and thus its largest eigenvalue is order

\mathcal{O}(p/n)

yielding

Third term.

We first bound the operator norm by the Frobenius norm.

We expand the square

The Central Limit Theorem yields

hence

Putting all the contributions together yields

Proposition (Informal).

On timescales

1\ll \tau\ll \psi_n

, both the train and test losses satisfy

Proof: According to the spectral analysis of

\bm{\mathrm{U}}

conducted previously, there are two bulks in the spectrum that contribute to the dynamics: a first bulk with eigenvalues of order

\frac{\psi_p}{\psi_n}

and a second bulk with eigenvalues of order

\psi_p

in the

\psi_p, \psi_n\gg1

limit. Hence, in the regime

1\ll\tau\ll \psi_n

e^{-\lambda\frac{\Delta_t\tau}{\psi_p}}\sim0

\lambda

is in the second bulk and is

e^{-\lambda\frac{\Delta_t\tau}{\psi_p}}\sim1

\lambda

is in the first bulk. We remind the expressions of the train and test loss

\begin{align} &\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{U}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{V}})\\ &\mathcal{L}_\mathrm{test}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{U}}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{V}}}) \end{align}

💭 Click to ask about this equation

and use the expression of

\bm{\mathrm{A}}(\tau)

in Proposition 3 that we expand on the basis of eigenvectors

\{\bm{\mathrm{v}}_\lambda\}_{\lambda\in Sp(\bm{\mathrm{U}})}

\bm{\mathrm{U}}

\begin{align} \frac{\bm{\mathrm{A}}(\tau)}{\sqrt{p}}&=\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}(e^{-\frac{2\Delta_t}{d} \bm{\mathrm{U}} \tau}- \bm{I}_p)\\ &=\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}\sum_\lambda(e^{-\frac{2\Delta_t}{d}\lambda \tau}-1) \bm{\mathrm{v}}_\lambda \bm{\mathrm{v}}_\lambda^T\\ &\sim -\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}\sum_{\lambda\in \rho_2} \bm{\mathrm{v}}_\lambda \bm{\mathrm{v}}_\lambda^T, \end{align}

💭 Click to ask about this equation

where

\lambda\in\rho_2

means that the eigenvalue

\lambda

belongs to the second bulk. We also have that

\bm{\mathrm{V}}

and

\tilde{\bm{\mathrm{V}}}

have the same GEP

\frac{\mu_1(t)\sqrt{\Delta_t}}{\Gamma_t}\frac{\bm{\mathrm{W}}}{\sqrt{d}}

and they thus cancel each other when computing the generalization loss

\mathcal{L}_{\mathrm{gen}}=\mathcal{L}_{\mathrm{test}}-\mathcal{L}_{\mathrm{train}}

. It reads

\begin{align} \mathcal{L}_\mathrm{gen}&=-\frac{\mu_1^2(t)\Delta_t}{\Gamma_t^2d} \operatorname{Tr}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'} \bm{\mathrm{v}}_{\lambda'}^T \bm{\mathrm{U}}^{-1}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}))\\ &=-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T \bm{\mathrm{U}}^{-1}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\\ &=-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\\ \end{align}\tag{12}

💭 Click to ask about this equation

We then use Lemma 8 --- which states that the operator norm of

\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}

in the subspace spanned by the eigenvectors of the second bulk is bounded by

\mathcal{O}(\frac{\psi_p}{\sqrt{\psi_n}})

--- to bound

\mathcal{L}_{\mathrm{gen}}

\begin{align} \lvert \mathcal{L}_{\mathrm{gen}} \rvert &\le \lVert \frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\rVert_{\mathrm{op}}\\ &\le \frac{\mu_1^2\Delta_t}{\Gamma_t^2d} d\frac{1}{\psi_p^2} \lVert \frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\rVert_{\mathrm{op}}\frac{\psi_p}{\sqrt{\psi_n}}\le\mathcal{O}(\frac{d\psi_p^2}{d\psi_p^2\sqrt{\psi_n}})=\mathcal{O}(\frac{1}{\sqrt{\psi_n}}). \end{align}

💭 Click to ask about this equation

We also used the fact that the sums contain

d

terms --- the only terms that matter are the diagonal ones --- and that the eigenvalues scale as

\psi_p

. The bound yield that

\mathcal{L}_{\mathrm{gen}}

vanishes asymptotically in the large number of data and large number of parameters regime. Therefore, on the fast timescale we find

\mathcal{L}_\mathrm{train}\simeq \mathcal{L}_\mathrm{test}

. Let us now focus on

\mathcal{L}_\mathrm{train}

\begin{align} \mathcal{L}_\mathrm{train}& = 1+\frac{\mu_1^2\Delta_t}{\Gamma_t^2 d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T \bm{\mathrm{U}} \bm{\mathrm{v}}_{\lambda'})-\frac{2\Delta_t\mu_1^2}{\Gamma_t^2d}\sum_{\lambda\in\rho_2} \bm{\mathrm{v}}_\lambda^T\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_\lambda\\ &=1-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}\sum_{\lambda\in\rho_2}\frac{1}{\lambda} \bm{\mathrm{v}}_\lambda^T\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{v}}_\lambda. \end{align}

💭 Click to ask about this equation

There are

d

values in the sum and the eigenvalues of

\bm{\mathrm{U}}

and

\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}

are both order

\mathcal{O}(\psi_p)

hence the sum divided by

d

is a positive

\mathcal{O}(1)

quantity thus in this training time regime,

1\ll\tau\ll \psi_n

, we obtain:

D. Numerical experiments for Random Features

Details on the numerical experiments. All the numerical experiments for the RFNN were conducted using

\sigma=\tanh

and

\sigma_{\bm{\mathrm{x}}}=1

unless specified. At each step, the gradient of the loss was computed using the full batch of data points. The train loss was estimated by adding noise to each data point

N=100

times. The test loss was computed by drawing

n

new points from the data distribution and noising each one

N

times. The error on the score was evaluated by drawing 10, 000 points from the noisy distribution

P_t=\mathcal{N}(0, \Gamma_t^2 \bm{I}_d)

Effect of $t$ .

We present plots for different diffusion times

t

in Figure 11 and show that the rescaling of the training times

\tau

\tau_{\mathrm{mem}}=\psi_p/\Delta_t\lambda_\mathrm{min}

also makes the loss curves collapse. Of particular interest is the behavior of

\tau_{\mathrm{mem}}

, and more specifically the ratio

\tau_{\mathrm{mem}} / \tau_{\mathrm{gen}}

, at small

t

. Recall that

In the overparameterized regime

p \gg n

, this ratio is independent of

t

since

v_t^2\sim\mu_*^2

and

s_t^2\sim{t}

. However, when

p \sim n

, a nontrivial scaling emerges: since

\lambda_{\min} \sim s_t^2 \sim t

, it follows that

implying that the two timescales become increasingly separated. It is unclear whether this behavior is related to specific properties of the learned score function, and is related to the approach of the interpolation threshold. We leave this question for future investigation.

Experiments with $\sigma_{\bm{\mathrm{x}}}^2\neq1$ . In Figure 12, we present train and test loss curves for

\sigma_{\bm{\mathrm{x}}}\neq1

. We see that our prediction of the timescale of memorization computed in the MT holds for general data variance.

Scaling of $\mathcal{E}_{\mathrm{score}}$ with $n$ . In the RF model, the error with respect to the true score, as defined in the main text,

serves as a measure of the generalization capability of the generative process. As shown in [45], the Kullback–Leibler divergence between the true data distribution

P_{\bm{\mathrm{x}}}

and the generated distribution

\hat{P}

can be upper bounded

where the integral is taken over all estimations of the parameter matrix

\bm{\mathrm{A}}

at all diffusion times

t

. This bound assumes that the reverse dynamics are integrated exactly, starting from infinite time. In practical settings, however, one typically relies on an approximate scheme and initiates the reverse process at a large but finite time

T

. A generalization of this bound under such conditions can be found in [46]. We have numerically investigate the behaviour of

\mathcal{E}_{\mathrm{score}}

on Figure 13. On the fast timescale

\tau_{\mathrm{gen}}

, it decreases until a minimal value

\mathcal{E}_{\mathrm{score}}^*

that depends only on

\psi_n

with a power-law

\psi_n^{-\eta}

with

\eta \simeq 0.59

. We leave for future work performing an accurate numerical estimate of

\eta

and a developing a theory for it.

Figure 13: Effect of $\psi_n$ on $\mathcal{E}_{\mathrm{Score}}^*$ . (Left) Error between the learned score and the true score $\mathcal{E}_{\mathrm{Score}}$ for $\psi_p = 32$ , $t = 0.01$ , and various values of $\psi_n$ . (Right) Minimum score error $\mathcal{E}_{\mathrm{Score}}^* = \underset{\tau}{\min}[\mathcal{E}_{\mathrm{Score}}(\tau)]$ as a function of $\psi_n$ , showing a power-law decay with exponent approximately $-0.59$ . The error bars correspond to thrice the standard deviation over 10 runs with new initial conditions.

💭 Click to ask about this figure

Spectrum of U. In Figure 14, we compare the solutions of the equations of Theorem 1 to the histogram of finite size realizations of

\bm{\mathrm{U}}

Effect of Adam optimization. Numerical experiments with RFNN on Gaussian data show that the linear scaling of the memorization time with

n

holds also for the Adam optimizer as shown in Figure 15.

References

[1] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France. PMLR.

[2] Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models.

[3] Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

[4] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Score-based generative modeling through stochastic differential equations.

[5] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.

[6] Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S.-H., and Kweon, I. S. (2023). A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai.

[7] Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., and Sun, L. (2024). Sora: A review on background, technology, limitations, and opportunities of large vision models.

[8] Li, T., Biferale, L., Bonaccorso, F., and et al. (2024b). Synthetic lagrangian turbulence by generative diffusion models. Nat Mach Intell, 6:393–403.

[9] Price, I., Sanchez-Gonzalez, A., Alet, F., and et al. (2025). Probabilistic weather forecasting with machine learning. Nature, 637:84–90.

[10] Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations.

[11] Li, S., Chen, S., and Li, Q. (2024a). A good score does not lead to a good generative model.

[12] Biroli, G., Bonnaire, T., de Bortoli, V., and Mézard, M. (2024). Dynamical regimes of diffusion models. Nature Communications, 15(9957). Open access.

[13] Kadkhodaie, Z., Guth, F., Simoncelli, E. P., and Mallat, S. (2024). Generalization in diffusion models arises from geometry-adaptive harmonic representations. In The Twelfth International Conference on Learning Representations.

[14] Kamb, M. and Ganguli, S. (2024). An analytic theory of creativity in convolutional diffusion models.

[15] Shah, K., Kalavasis, A., Klivans, A. R., and Daras, G. (2025). Does generation require memorization? creative diffusion models using ambient diffusion.

[16] Wu, Y.-H., Marion, P., Biau, G., and Boyer, C. (2025). Taking a big step: Large learning rates in denoising score matching prevent memorization.

[17] George, A. J., Veiga, R., and Macris, N. (2025). Denoising score matching with random features: Insights on diffusion models from precise learning curves.

[18] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. (2019). On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR.

[19] Gu, X., Du, C., Pang, T., Li, C., Lin, M., and Wang, Y. (2023). On memorization in diffusion models.

[20] Yoon, T., Choi, J. Y., Kwon, S., and Ryu, E. K. (2023). Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling.

[21] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F., editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham. Springer International Publishing.

[22] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).

[23] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., and Wallace, E. (2023). Extracting training data from diffusion models. In Proceedings of the 32nd USENIX Conference on Security Symposium, SEC '23, USA. USENIX Association.

[24] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. (2023a). Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. (2023b). Understanding and mitigating copying in diffusion models. Advances in Neural Information Processing Systems, 36:47783–47803.

[26] Achilli, B., Ventura, E., Silvestri, G., Pham, B., Raya, G., Krotov, D., Lucibello, C., and Ambrogioni, L. (2024). Losing dimensions: Geometric memorization in generative diffusion.

[27] Ventura, E., Achilli, B., Silvestri, G., Lucibello, C., and Ambrogioni, L. (2025). Manifolds, random matrices and spectral gaps: The geometric phases of generative diffusion.

[28] Cui, H., Krzakala, F., Vanden-Eijnden, E., and Zdeborova, L. (2024). Analysis of learning a flow-based generative model from limited sample complexity. In The Twelfth International Conference on Learning Representations.

[29] Cui, H., Pehlevan, C., and Lu, Y. M. (2025). A precise asymptotic analysis of learning diffusion models: theory and insights.

[30] Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., and Qu, Q. (2024). Diffusion models learn low-dimensional distributions via subspace clustering.

[31] Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc.

[32] Li, P., Li, Z., Zhang, H., and Bian, J. (2025). On the generalization properties of diffusion models.

[33] Zhi-Qin John Xu, Z.-Q. J. X., Yaoyu Zhang, Y. Z., Tao Luo, T. L., Yanyang Xiao, Y. X., and Zheng Ma, Z. M. (2020). Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767.

[34] Wang, B. (2025). An analytical theory of power law spectral bias in the learning dynamics of diffusion models.

[35] Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709.

[36] Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674.

[37] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.

[38] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, ICLR (Poster).

[39] Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the design space of diffusion-based generative models.

[40] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[41] Song, J., Meng, C., and Ermon, S. (2022). Denoising diffusion implicit models.

[42] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium.

[43] Mei, S., Misiakiewicz, T., and Montanari, A. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Beygelzimer, A. and Hsu, D., editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388–2464. PMLR.

[44] D'Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. (2020). Double trouble in double descent: Bias and variance(s) in the lazy regime. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2280–2290. PMLR.

[45] Song, Y., Durkan, C., Murray, I., and Ermon, S. (2021a). Maximum likelihood training of score-based diffusion models.

[46] Bortoli, V. D. (2022). Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research. Expert Certification.

[47] Péché, S. (2019). A note on the pennington-worah distribution. Electronic Communications in Probability, 24:1–7.

[48] Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., Mézard, M., and Zdeborová, L. (2021). The gaussian equivalence of generative models for learning with shallow neural networks.

[49] Hu, H. and Lu, Y. M. (2023). Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964.

[50] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8024–8035. Curran Associates, Inc.

[51] Gerace, F., Loureiro, B., Krzakala, F., Mezard, M., and Zdeborova, L. (2020). Generalisation error in learning with random features and the hidden manifold model. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3452–3462. PMLR.

[52] Mei, S. and Montanari, A. (2020). The generalization error of random features regression: Precise asymptotics and double descent curve.

[53] Bodin, A. P. M. (2024). Random Matrix Methods for High-Dimensional Machine Learning Models. Phd thesis, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.

[54] Mézard, M., Parisi, G., and Virasoro, M. A. (1987). Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 of Lecture Notes in Physics. World Scientific Publishing Company, Singapore.

[55] Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework for flows and diffusions.

[56] Ho, J. and Salimans, T. (2022). Classifier-free diffusion guidance.

[57] Adomaityte, U., Defilippis, L., Loureiro, B., and Sicuro, G. (2024). High-dimensional robust regression under heavy-tailed data: asymptotics and universality. Journal of Statistical Mechanics: Theory and Experiment, 2024(11):114002.

[58] Favero, A., Sclocchi, A., and Wyart, M. (2025). Bigger isn't always memorizing: Early stopping overparameterized diffusion models.

[59] Wen, Y., Liu, Y., Chen, C., and Lyu, L. (2024). Detecting, explaining, and mitigating memorization in diffusion models.

[60] Chen, C., Liu, D., and Xu, C. (2024). Towards memorization-free diffusion models.

[61] Kibble, W. F. (1945). An extension of a theorem of mehler’s on hermite polynomials. Mathematical Proceedings of the Cambridge Philosophical Society, 41(1):12–15.

[62] Bach, F. (2023). Polynomial magic iii: Hermite polynomials. https://francisbach.com/hermite-polynomials/ Accessed: 2025-10-09.

[63] Potters, M. and Bouchaud, J.-P. (2020). A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists. Cambridge University Press.

[64] Silverstein, J. and Bai, Z. (1995). On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):175–192.

[65] Bai, Z. and Zhou, W. (2008). Large sample covariance matrices without independence structures in columns. Statistica Sinica, 18(2):425–442.

Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Abstract

1. Introduction

2. Generalization and memorization during training of diffusion models

3. Training dynamics of a Random Features Network

4. Conclusions

Acknowledgments

Appendix

A. Numerical experiments on CelebA

A.1 Details on the numerical setup

A.2 Batch-size effect: repetition vs. memorization

A.3 What about Adam?

B. Generalization--memorization transition in the Gaussian Mixture Model

B.1 Settings

B.2 Scaling of τmem\tau_\mathrm{mem}τmem​ and τgen\tau_\mathrm{gen}τgen​ with nnn and WWW

B.3 Discussion on conditional diffusion models

C. Proofs of the analytical results

C.1 Notations

C.2 Closed form of the learning dynamics

C.3 Gaussian Equivalence Principle

C.4 Proof of Theorem 1

C.5 Proof of Theorem 2

C.6 Dynamics on the fast timescales

D. Numerical experiments for Random Features

References

B.2 Scaling of $\tau_\mathrm{mem}$ and $\tau_\mathrm{gen}$ with $n$ and $W$