Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Tony Bonnaire^\dagger, LPENS, Université PSL, Paris, [email protected]
Raphaël Urfin^\dagger, LPENS, Université PSL, Paris, [email protected]
Giulio Biroli, LPENS, Université PSL, Paris, [email protected]
Marc Mézard, Department of Computing Sciences, Bocconi University, Milano, [email protected]
\daggerEqual contribution.

Abstract

Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time τgen\tau_\mathrm{gen} at which models begin to generate high-quality samples, and a later time τmem\tau_\mathrm{mem} beyond which memorization emerges. Crucially, we find that τmem\tau_\mathrm{mem} increases linearly with the training set size nn, while τgen\tau_\mathrm{gen} remains constant. This creates a growing window of training times with nn where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when nn becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

1. Introduction

Diffusion Models [DMs, 1, 2, 3, 4] achieve state-of-the-art performance in a wide variety of AI tasks such as the generation of images [5], audios [6], videos [7], and scientific data [8, 9]. This class of generative models, inspired by out-of-equilibrium thermodynamics [1], corresponds to a two-stage process: the first one, called forward, gradually adds noise to a data, whereas the second one, called backward, generates new data by denoising Gaussian white noise samples. In DMs, the reverse process typically involves solving a stochastic differential equation (SDE) with a force field called score. However, it is also possible to define a deterministic transport through an ordinary differential equation (ODE), treating the score as a velocity field, an approach that is for instance followed in flow matching [10].
Understanding the generalization properties of score-based generative methods is a central issue in machine learning, and a particularly important question is how memorization of the training set is avoided in practice. A model without regularization achieving zero training loss only learns the empirical score, and is bound to reproduce samples of the training dataset at the end of the backward process. This memorization regime [11, 12] is empirically observed when the training set is small and disappears when it increases beyond a model-dependent threshold [13]. Understanding the mechanisms controlling this change of regimes from memorization to generalization is a central challenge for both theory and applications. Model regularization and inductive biases imposed by the network architecture were shown to play a role [14, 15], as well as a dynamical regularization due to the finiteness of the learning rate [16]. However, the regime shift described above is consistently observed even in models where all these regularization mechanisms are present. This suggests that the core mechanism behind the transition from memorization to generalization lies elsewhere. In this work, we demonstrate -- first through numerical experiments, and then via the theoretical analysis of a simplified model -- that this transition is driven by an implicit dynamical bias towards generalizing solutions emerging in the training, which allows to avoid the memorization phase.

Figure 1: Qualitative summary of our contributions. (Left) Illustration of the training dynamics of a diffusion model. Depending on the training time τ\tau, we identify three regimes measured by the inverse quality of the generated samples (blue curve) and their memorization fraction (red curve). The generalization regime extends over a large window of training times which increases with the training set size nn. On top, we show a one dimensional example of the learned score function during training (orange). The gray line gives the exact empirical score, at a given noise level, while the black dashed line corresponds to the true (population) score. (Right) Phase diagram in the (n,p)(n, p) plane illustrating three regimes of diffusion models: Memorization when nn is sufficiently small at fixed pp, Architectural Regularization for n>n(p)n>n^{\star}(p) (which is model and dataset dependent, as discussed in [17, 14]), and Dynamical Regularization, corresponding to a large intermediate generalization regime obtained when the training dynamics is stopped early, i.e. τ[τgen,τmem]\tau \in \left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right].

💭 Click to ask about this figure
Contributions and theoretical picture. We investigate the dynamics of score learning using gradient descent, both numerically and analytically, and study the generation properties of the score depending on the time τ\tau at which the training is stopped. The theoretical picture built from our results and combining several findings from the recent literature is illustrated in Figure 1. The two main parameters are the size of the training set nn and the expressivity of the class of score functions on which one trains the model, characterized by a number of parameters pp; when both nn and pp are large one can identify three main regimes. Given pp, if nn is larger than n(p)n^*(p) (which depends on the training set and on the class of scores), the score model is not expressive enough to represent the empirical score associated to nn data, and instead provides a smooth interpolation, approximately independent of the training set. In this regime, even with a very large training time τ\tau\to\infty, memorization does not occur because the model is regularized by its architecture and the finite number of parameters. When n<n(p)n<n^*(p) the model is expressive enough to memorize, and two timescales emerge during training: one, τgen\tau_\mathrm{gen}, is the minimum training time required to achieve high-quality data generation; the second, τmem>τgen\tau_\mathrm{mem}>\tau_\mathrm{gen}, signals when further training induces memorization, and causes the model to increasingly reproduce the training samples (left panel). The first timescale, τgen\tau_\mathrm{gen}, is found independent of nn, whereas the second, τmem\tau_\mathrm{mem}, grows approximately linearly with nn, thus opening a large window of training times during which the model generalizes if early stopped when τ[τgen,τmem]\tau \in [\tau_\mathrm{gen}, \tau_\mathrm{mem}].
Our results shows that implicit dynamical regularization in training plays a crucial role in score-based generative models, substantially enlarging the generalization regime (see right panel of Figure 1), and hence allowing to avoid memorization even in highly overparameterized settings. We find that the key mechanism behind the widening gap between τgen\tau_\mathrm{gen} and τmem\tau_\mathrm{mem} is the irregularity of the empirical score at low noise level and large nn. In this regime the models used to approximate the score provide a smooth interpolation that remains stable for a long period of training times and closely approximates the population score, a behavior likely rooted in the spectral bias of neural networks [18]. Only at very long training times do the dynamics converge to the low lying minimum corresponding to the empirical score, leading to memorization (as illustrated in the one-dimensional examples in the left panel of Figure 1).
The theoretical picture described above is based on our numerical and analytical results, and builds up on previous works, in particular numerical analysis characterizing the memorization--generalization transition [19, 20], analytical works on memorization of DMs [17, 14, 13], and studies on the spectral bias of deep neural networks [18]. Our numerical experiments use a class of scores based on a realistic U-Net [21] trained on downscaled images of the CelebA dataset [22]. By varying nn and pp, we measure the evolution of the sample quality (through FID) and the fraction of memorization during learning, which support the theoretical scenario presented in Figure 1. Additional experimental results on synthetic data are provided in Supplemental Material (SM, Sects. Appendix A and Appendix B). On the analytical side, we focus on a class of scores constructed from random features and simplified models of data, following [17]. In this setting, the timescales of training dynamics correspond directly to the inverse eigenvalues of the random feature correlation matrix. Leveraging tools from random matrix theory, we compute the spectrum in the limit of large datasets, high-dimensional data, and overparameterized models. This analysis reveals, in a fully tractable way, how the theoretical picture of Figure 1 emerges within the random feature framework.
Related works. - The memorization transition in DMs has been the subject of several recent empirical investigations [23, 24, 25] which have demonstrated that state-of-the-art image DMs -- including Stable Diffusion and DALL·E -- can reproduce a non-negligible portion of their training data, indicating a form of memorization. Several additional works [19, 20] examined how this phenomenon is influenced by factors such as data distribution, model configuration, and training procedure, and provide a strong basis for the numerical part of our work.
  • A series of theoretical studies in the high-dimensional regime have analyzed the memorization--generalization transition during the generative dynamics under the empirical score assumption [12, 26, 27], showing how trajectories are attracted to the training samples. Within this high-dimensional framework, [28, 29, 30, 17] study the score learning for various model classes. In particular, [17] uses a Random Feature Neural Network [31]. The authors compute the asymptotic training and test losses for τ\tau\rightarrow\infty and relate it to memorization.
    The theoretical part of our work generalizes this approach to study the role of training dynamics and early stopping in the memorization--generalization transition.
  • Recent works have also uncovered complementary sources of implicit regularization explaining how DMs avoid memorization. Architectural biases and limited network capacity were for instance shown to constrain memorization in [14, 13], and finiteness of the learning rate prevents the model from learning the empirical score in [16]. Also related to our analysis, [32] provides general bounds showing the beneficial role of early stopping the training dynamics to enhance generalization for finitely supported target distributions, as well as a study of its effect for one-dimensional gaussian mixtures.
  • Finally, previous studies on supervised learning [18, 33], and more recently on DMs [34], have shown that deep neural networks display a frequency-dependent learning speed, and hence a learning bias towards low frequency functions.
    This fact plays an important role in the results we present since the empirical score contains a low frequency part that is close to the population score, and a high-frequency part that is dataset-dependent. To the best of our knowledge, the training time to learn the high-frequency part and hence memorize, that we find to scale with nn, has not been studied from this perspective in the context of score-based generative methods.
Setting: generative diffusion and score learning. Standard DMs define a transport from a target distribution P0P_0 in Rd\mathbb{R}^d to a Gaussian white noise N(0,Id)\mathcal{N}(0, \bm{I}_d) through a forward process defined as an Ornstein-Uhlenbeck (OU) stochastic differential equation (SDE):
dx=x(t)dt+dB(t),(1)\begin{align} \mathrm{d} \bm{\mathrm{x}} = - \bm{\mathrm{x}}(t) \mathrm{d} t + \mathrm{d} \bm{\mathrm{B}}(t), \end{align}\tag{1}
💭 Click to ask about this equation
where dB(t)\mathrm{d} \bm{\mathrm{B}}(t) is square root of two times a Wiener process. Generation is performed by time-reversing the SDE Equation 1 using the score function s(x,t)=xlogPt(x)\bm{\mathrm{s}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log P_t(\bm{\mathrm{x}}),
dx=[x(t)+2s(x,t)]dt+dB(t),(2)\begin{align} - \mathrm{d} \bm{\mathrm{x}} = \left[\bm{\mathrm{x}}(t) + 2 \bm{\mathrm{s}}(\bm{\mathrm{x}}, t) \right] \mathrm{d} t + \mathrm{d} \bm{\mathrm{B}}(t), \end{align}\tag{2}
💭 Click to ask about this equation
where Pt(x)P_t(\bm{\mathrm{x}}) is the probability density at time tt along the forward process, and the noise dB(t)\mathrm{d} \bm{\mathrm{B}}(t) is also the square root of two times a Wiener process. As shown in the seminal works [35, 36], s(x,t)\bm{\mathrm{s}}(\bm{\mathrm{x}}, t) can be obtained by minimizing the score matching loss
s^(x,t)=argminsExP0,ξN(0,Id)[Δts(x(t),t)+ξ2],(3)\begin{align} \hat{\bm{\mathrm{s}}}(\bm{\mathrm{x}}, t)= \arg\min_{\bm{\mathrm{s}}} \mathbb{E}_{\bm{\mathrm{x}} \sim P_0, \bm{\xi} \sim \mathcal{N}(0, \bm{I}_d)} \left[\lVert \sqrt{\Delta_t} \bm{\mathrm{s}}(\bm{\mathrm{x}}(t), t)+ \bm{\xi}\rVert^2 \right], \end{align}\tag{3}
💭 Click to ask about this equation
where Δt=1e2t\Delta_t=1-e^{-2t}. In practice, the optimization problem is restricted to a parametrized class of functions sθ(x(t),t)\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}(t), t) defined, for example, by a neural network with parameters θ\bm{\theta}. The expectation over x\bm{\mathrm{x}} is replaced by the empirical average over the training set (nn iid samples xν\bm{\mathrm{x}}^\nu drawn from P0P_0),
Lt(θ,{xν}ν=1n)=1nν=1nEξN(0,Id)[Δtsθ(xν(t))+ξ2],(4)\begin{align} \mathcal{L}_t(\bm{\theta}, \{\bm{\mathrm{x}}^\nu\}_{\nu=1}^n) = \frac{1}{n} \sum_{\nu=1}^n \mathbb{E}_{\bm{\xi} \sim \mathcal{N}(0, \bm{I}_d)} \left[\lVert \sqrt{\Delta_t} \bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}^\nu(t))+ \bm{\xi}\rVert^2\right], \end{align}\tag{4}
💭 Click to ask about this equation
where xtν(ξ)=etxν+Δtξ\bm{\mathrm{x}}^\nu_t(\bm{\xi})=e^{-t} \bm{\mathrm{x}}^\nu+\sqrt{\Delta_t} \bm{\xi}. The loss in (Equation 4) can be minimized with standard optimizers, such as stochastic gradient descent [SGD, 37] or Adam [38]. In practice, a single model conditioned on the diffusion time tt is trained by integrating (Equation 4) over time [39]. The solution of the minimization of Equation 4 is the so-called empirical score (e.g. [12, 11]), defined as semp(x,t)=xlogPtemp(x)\bm{\mathrm{s}}_{\mathrm{emp}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log P_t^\mathrm{emp}(\bm{\mathrm{x}}), with
Ptemp(x)=1n(2πΔt)d/2ν=1ne12Δtxxνet22.(5)P_t^\mathrm{emp}(\bm{\mathrm{x}}) = \frac{1}{n\left(2\pi\Delta_t\right)^{d/2}} \sum_{\nu=1}^n e^{-\frac{1}{2\Delta_t} \lVert \bm{\mathrm{x}}- \bm{\mathrm{x}}^{\nu} e^{-t} \rVert^2_2}.\tag{5}
💭 Click to ask about this equation
This solution is known to inevitably recreate samples of the training set at the end of the generative process (i.e., it perfectly memorizes), unless nn grows exponentially with the dimension dd [12]. However, this is not the case in many practical applications where memorization is only observed for relatively small values of nn, and disappears well before nn becomes exponentially large in dd. The empirical minimization performed in practice, within a given class of models and a given minimization procedure, does not drive the optimization to the global minimum of Equation 4, but instead to a smoother estimate of the score that is independent of the training set with good generalization properties [13], as the global minimum of Equation 3 would do. Understanding how it is possible, and in particular the role played by the training dynamics to avoid memorization, is the central aim of the present work.

2. Generalization and memorization during training of diffusion models

**Figure 2:** **Memorization transition as a function of the training set size $n$ for U-Net score models on CelebA.** *(Left)* FID (solid lines, left axis) and memorization fraction $f_\mathrm{mem}$ (dashed lines, right axis) against training time $\tau$ for various $n$. Inset: normalized memorization fraction $f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max})$ with the rescaled time $\tau/n$. *(Middle)* Training (solid lines) and test (dashed lines) loss with $\tau$ for several $n$ at fixed $t=0.01$. Inset: both losses plotted against $\tau/n$. Error bars on the losses are imperceptible. *(Right)* Generated samples from the model trained with $n=1024$ for $\tau=100$ K or $\tau=1.62$ M steps, along with their nearest neighbors in the training set.

Figure 2: Memorization transition as a function of the training set size nn for U-Net score models on CelebA. (Left) FID (solid lines, left axis) and memorization fraction fmemf_\mathrm{mem} (dashed lines, right axis) against training time τ\tau for various nn. Inset: normalized memorization fraction fmem(τ)/fmem(τmax)f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max}) with the rescaled time τ/n\tau/n. (Middle) Training (solid lines) and test (dashed lines) loss with τ\tau for several nn at fixed t=0.01t=0.01. Inset: both losses plotted against τ/n\tau/n. Error bars on the losses are imperceptible. (Right) Generated samples from the model trained with n=1024n=1024 for τ=100\tau=100 K or τ=1.62\tau=1.62 M steps, along with their nearest neighbors in the training set.

💭 Click to ask about this figure
Data & architecture. We conduct our experiments on the CelebA face dataset [22], which we convert to grayscale downsampled images of size d=32×32d=32\times32, and vary the training set size nn from 128 up to 32768. Our score model has a U-Net architecture [21] with three resolution levels and a base channel width of WW with multipliers 1, 2 and 3 respectively. All our networks are DDPMs [2] trained to predict the injected noise at diffusion time tt using SGD with momentum at fixed batch size min(n,512)\min(n, 512). The models are all conditioned on tt, i.e. a single model approximates the score at all times, and make use of a standard sinusoidal position embedding [40] that is added to the features of each resolution. More details about the numerical setup can be found in SM (Appendix A).
Evaluation metrics. To study the transition from generalization to memorization during training, we monitor the loss Equation 4 during training using a fixed diffusion time t=0.01t = 0.01. At various numbers of SGD updates τ\tau, we compute the loss on all nn training examples (training loss) and on a held-out test set of 2048 images (test loss). To characterize the score obtained after a training time τ\tau, we assess the originality and quality of samples by generating 10K samples using a DDIM accelerated sampling [41]. We compute (i) the Fréchet-Inception Distance [FID, 42] against 10K test samples which we use to identify the generalization time τgen\tau_\mathrm{gen}; and (ii) the fraction of memorized generated samples fmem(τ)f_\mathrm{mem}(\tau) granting access to τmem\tau_\mathrm{mem}, the memorization time. Following previous numerical studies [20, 19], a generated sample xτ\bm{\mathrm{x}}_\tau is considered memorized if
Exτ[xτaμ12xτaμ22]<k,(6)\mathbb{E}_{\bm{\mathrm{x}}_\tau} \left[\frac{\lVert \bm{\mathrm{x}}_\tau - \bm{\mathrm{a}}^{\mu_1}\rVert_2}{\lVert \bm{\mathrm{x}}_\tau - \bm{\mathrm{a}}^{\mu_2}\rVert_2} \right] < k,\tag{6}
💭 Click to ask about this equation
where aμ1\bm{\mathrm{a}}^{\mu_1} and aμ2\bm{\mathrm{a}}^{\mu_2} are the nearest and second nearest neighbors of xτ\bm{\mathrm{x}}_\tau in the training set in the L2L_2 sense. In what follows, we choose to work with k=1/3k=1/3 [20, 19], but we checked that varying kk to 1/21/2 or 1/41/4 does not impact the claims about the scaling. Error bars in the figures correspond to twice the standard deviation over 5 different test sets for FIDs, and 5 noise realizations for Ltrain\mathcal{L}_\mathrm{train} and Ltest\mathcal{L}_\mathrm{test}. For fmemf_\mathrm{mem}, we report the 95% CIs on the mean evaluated with 1, 000 bootstrap samples.
Role of training set size on the learning dynamics. At fixed model capacity (p=4×106p=4\times 10^6, base width W=32W=32), we investigate how the training set size nn impacts the previous metrics. In the left panel of Figure 2, we first report the FID (solid lines) and fmem(τ)f_\mathrm{mem}(\tau) (dashed lines) for various nn. All trainings dynamics exhibit two phases. First, the FID quickly decreases to reach a minimum value on a timescale τgen\tau_\mathrm{gen} (100\approx100 K) that does not depend on nn. In the right panel, the generated samples at τ=100\tau=100 K clearly differ from their nearest neighbors in the training set, indicating that the model generalizes correctly. Beyond this time, the FID remains flat. fmem(τ)f_\mathrm{mem}(\tau) is zero until a later time τmem\tau_\mathrm{mem} after which it increases, clearly signaling the entrance into a memorization regime, as illustrated by the generated samples in the right-most panel of Figure 2, very close to their nearest neighbors. Both the transition time τmem\tau_\mathrm{mem} and the value of the final fraction fmem(τmax)f_\mathrm{mem}(\tau_\mathrm{max}) (with τmax\tau_\mathrm{max} being one to four million SGD steps) vary with nn. The inset plot shows the normalized memorization fraction fmem(τ)/fmem(τmax)f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max}) against the rescaled time τ/n\tau/n, making all curves collapse and increase at around τ/n300\tau/n \approx 300, showing that τmemn\tau_\mathrm{mem} \propto n, and demonstrating the existence of a generalization window for τ[τgen,τmem]\tau \in \left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right] that widens linearly with nn, as illustrated in the left panel of Figure 1.
As highlighted in the introduction, memorization in DMs is ultimately driven by the overfitting of the empirical score smem(x,t)\bm{\mathrm{s}}_\mathrm{mem}(\bm{\mathrm{x}}, t). The evolution of Ltrain(τ)\mathcal{L}_\mathrm{train}(\tau) and Ltest(τ)\mathcal{L}_\mathrm{test}(\tau) at fixed t=0.01t=0.01 are shown in the middle panel of Figure 2 for nn ranging from 512 to 32768. Initially, the two losses remain nearly indistinguishable, indicating that the learned score sθ(x,t)\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}, t) does not depend on the training set. Beyond a critical time, Ltrain\mathcal{L}_\mathrm{train} continues to decrease while Ltest\mathcal{L}_\mathrm{test} increases, leading to a nonzero generalization loss whose magnitude depends on nn. As nn increases, this critical time also increases and, eventually, the training and test loss gap shrinks: for n=32768n=32768, the test loss remains close to the training loss, even after 11 million SGD steps. The inset shows the evolution of both losses with τ/n\tau/n, demonstrating that the overfitting time scales linearly with the training set size nn, just like τmem\tau_\mathrm{mem} identified in the left panel. Moreover, there is a consistent lag between the overfitting time and τmem\tau_\mathrm{mem} at fixed nn, reflecting the additional training required for the model to overfit the empirical score sufficiently to reproduce the training samples, and therefore to impact the memorization fraction.
Memorization is not due to data repetition. We must stress that this delayed memorization with nn is not due to the mere repetition of training samples, as a first intuition could suggest. In SM Sects. Appendix A and Appendix B, we show that full-batch updates still yield τmemn\tau_\mathrm{mem}\propto n. In other words, even if at fixed τ\tau all models have processed each sample equally often, larger nn consistently postpone memorization. This confirms that memorization in DMs is driven by a fundamental nn-dependent change in the loss landscape -- not by a sample repetition during training.

Figure 3: Effect of the number of parameters in the U-Net architecture on the timescales of the training dynamics. (Left) FID (panels A, B) and normalized memorization fraction fmem(τ)/fmem(τmax)f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max}) (panels C, D) for various nn and WW during training. In panels B and D, time is rescaled such that all curves collapse. (Right) (n,p)(n, p) phase diagram of generalization vs memorization for U-Nets trained on CelebA. Curves show, for τ{τgen,3τgen,8τgen}\tau \in \{\tau_\mathrm{gen}, 3\tau_\mathrm{gen}, 8\tau_\mathrm{gen}\}, the minimal dataset size n(p)n(p) satisfying fmem(τ)=0f_\mathrm{mem}(\tau)=0. The shaded background indicates the memorization--generalization boundary for τ=τgen\tau=\tau_\mathrm{gen}.

💭 Click to ask about this figure
Effect of the model capacity. To study more precisely the role of the model capacity on the memorization--generalization transition, we vary the number of parameters pp by changing the U-Nets base width W{8,16,32,48,64}W \in \{8, 16, 32, 48, 64\}, resulting in a total of p{0.26,1,4,9,16}×106p\in\{0.26, 1, 4, 9, 16\}\times10^6 parameters. In the left panel of Figure 3, we plot both the FID (top row) and the normalized memorization fraction (bottom row) as functions of τ\tau for several width WW and training set sizes nn. Panels A and C demonstrate that higher-capacity networks (larger WW) achieve high-quality generation and begin to memorize earlier than smaller ones. Panels B and D show that the two characteristic timescales simply scale as τgenW1\tau_\mathrm{gen} \propto W^{-1} and τmemnW1\tau_\mathrm{mem} \propto nW^{-1}. In particular, this implies that, for W>8W>8, the critical training set size ngm(p)n_\mathrm{gm}(p) at which τmem=τgen\tau_\mathrm{mem}=\tau_\mathrm{gen} is approximately independent of pp (at least on the limited values of pp we focused on). When n>ngm(p)n>n_\mathrm{gm}(p), the interval [τgen,τmem]\left[\tau_\mathrm{gen}, \tau_\mathrm{mem}\right] opens up, so that early stopping within this window yields high quality samples without memorization. In the right panel of Figure 3, we display this boundary (solid line) in the (n,p)(n, p) plane by fixing the training time to τ=τgen\tau=\tau_\mathrm{gen}, that we identify numerically using the collapse of all FIDs at around Wτgen3×106W\tau_\mathrm{gen}\approx 3\times 10^6 (see panel B), and computing the smallest nn such that fmem(τ)=0f_\mathrm{mem}(\tau)=0. The resulting solid curve delineates two regimes: below the curve, memorization already starts at τgen\tau_\mathrm{gen}; above the curve, the models generalize perfectly under early stopping. We repeat this experiment for τ=3τgen\tau=3\tau_\mathrm{gen} and τ=8τgen\tau=8\tau_\mathrm{gen}, showing saturation to larger and larger pp as τ\tau increases. Eventually, for τ\tau \to \infty, we expect these successive boundaries to converge to the architectural regularization threshold n(p)n^\star(p), i.e. the point beyond which the network avoids memorization because it is not expressive enough, as found in [17] and highlighted in the right panel of Figure 1. In order to estimate n(p)n^\star(p), we measure for a given τ\tau the largest n(τ)n(\tau) yielding fmem0f_\mathrm{mem}\approx0. The curve n(τ)n(\tau) approaches n(p)n^\star(p) for large τ\tau. We therefore estimate n(p)n^\star(p) by measuring the asymptotic values of n(τ)n(\tau), which in practice is reached already at τ=τmax=2\tau=\tau_\mathrm{max}=2 M updates for the values of WW we focus on.

3. Training dynamics of a Random Features Network

Notations. We use bold symbols for vectors and matrices. The L2L^2 norm of a vector x\bm{\mathrm{x}} is denoted by x=(ixi2)1/2\lVert \bm{\mathrm{x}} \rVert = (\sum_i \bm{\mathrm{x}}_i^2)^{1/2}. We write f=O(g)f = \mathcal{O}(g) to mean that in the limit n,pn, p \to \infty, there exists a constant CC such that fCg\lvert f \rvert \leq C \lvert g \rvert. Setting. We study analytically a model introduced in [17], where the data lie in dd dimensions. We parametrize the score with a Random Features Neural Network [RFNN, 31]
sA(x)=Apσ(Wxd).(7)\begin{align} \bm{\mathrm{s}}_{\bm{\mathrm{A}}}(\bm{\mathrm{x}})=\frac{\bm{\mathrm{A}}}{\sqrt{p}}\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}}{\sqrt{d}}\right). \end{align}\tag{7}
💭 Click to ask about this equation
An RFNN, illustrated in Figure 4 (left), is a two-layer neural-network whose first layer weights (WRp×d\bm{\mathrm{W}}\in\mathbb{R}^{p\times d}) are drawn from a Gaussian distribution and remain frozen while the second layer weights (ARd×p\bm{\mathrm{A}} \in \mathbb{R}^{d\times p}) are learned during training. This model has already served as theoretical framework for studying several behaviors of deep neural network such as the double descent phenomenon [43, 44]. σ\sigma is an element-wise non-linear activation function. We consider a training set of nn iid samples xνPx\bm{\mathrm{x}}^\nu \sim P_{\bm{\mathrm{x}}} for ν=1,,n\nu = 1, \ldots, n and we focus on the high-dimensional limit d,p,nd, p, n\rightarrow\infty with the ratios ψp=p/d,ψn=n/d\psi_p=p/d, \psi_n=n/d kept fixed. We study the training dynamics associated to the minimization of the empirical score matching loss defined in (Equation 4) at a fixed diffusion time tt. This is a simplification compared to practical methods, which use a single model for all tt. It has been already studied in previous theoretical works [28, 17]. The loss (Equation 4) is rescaled by a factor 1/d1/d in order to ensure a finite limit at large dd. We also study the evolution of the test loss evaluated on test points and the distance to the exact score s(x)=logPx\bm{\mathrm{s}}(\bm{\mathrm{x}})=\nabla\log P_{\bm{\mathrm{x}}},
Ltest=1dEx,ξ[ΔtsA(xt(ξ))+ξ2],Escore=1dEx[sA(x)logPx2],(8)\begin{align} \mathcal{L}_\mathrm{test}=\frac{1}{d}\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}\left[\lVert \sqrt{\Delta_t} \bm{\mathrm{s}}_{\bm{\mathrm{A}}}(\bm{\mathrm{x}}_t(\bm{\xi}))+ \bm{\xi}\rVert ^2\right], \quad\mathcal{E}_\mathrm{score}=\frac{1}{d}\mathbb{E}_{\bm{\mathrm{x}}}\left[\lVert \bm{\mathrm{s}}_{\bm{\mathrm{A}}}(\bm{\mathrm{x}})-\nabla\log P_{\bm{\mathrm{x}}}\rVert^2\right], \end{align}\tag{8}
💭 Click to ask about this equation
where the expectations Ex,ξ\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}} are computed over xPx\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}} and ξN(0,Id)\bm{\xi} \sim \mathcal{N}(0, \bm{I}_d). The generalization loss, defined as Lgen=LtestLtrain\mathcal{L}_\mathrm{gen} = \mathcal{L}_\mathrm{test} - \mathcal{L}_\mathrm{train}, indicates the degree of overfitting in the model while the distance to the exact score Escore\mathcal{E}_\mathrm{score} measures the quality of the generation as it is an upper bound on the Kullback–Leibler divergence between the target and generated distributions [45, 46]. The weights A\bm{\mathrm{A}} are updated via gradient descent
A(k+1)=A(k)ηALtrain(A(k)),(9)\begin{align} \bm{\mathrm{A}}^{(k+1)}= \bm{\mathrm{A}}^{(k)}-\eta\nabla_{\bm{\mathrm{A}}}\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}}^{(k)}), \end{align}\tag{9}
💭 Click to ask about this equation
where η\eta is the learning rate. In the high-dimensional limit, as the learning rate η0\eta \to 0, and after rescaling time as τ=kη/d2\tau = k\eta / d^2, the discrete-time dynamics converges to the following continuous-time gradient flow:
A˙(τ)=d2ALtrain(A(τ))=2ΔtdpAU2dΔtpVT,(10)\begin{align} \dot{\bm{\mathrm{A}}}(\tau)=-d^2\nabla_{\bm{\mathrm{A}}}\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}}(\tau)) =-2\Delta_t\frac{d}{p} \bm{\mathrm{A}} \bm{\mathrm{U}}-\frac{2d\sqrt{\Delta_t}}{\sqrt{p}} \bm{\mathrm{V}}^T, \end{align}\tag{10}
💭 Click to ask about this equation
with
U=1nν=1nEξ[σ(Wxtν(ξ)d)σ(Wxtν(ξ)d)T],V=1nν=1nEξ[σ(Wxtν(ξ)d)ξT].\begin{align} \bm{\mathrm{U}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)^T\right], \quad \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right) \bm{\xi}^T\right]. \end{align}
💭 Click to ask about this equation
**Figure 4:** (*Left*) **Illustration of an RFNN.** (*Middle/Right*) **Spectrum of $\bm{\mathrm{U}}$.** Density $\rho(\lambda)$ from Theorem 1 in the overparameterized Regime I described in Theorem 2, with $\psi_p = 64$, $\psi_n = 8$, $t = 0.01$, and $\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1)$. The bulk of the spectrum (orange) is between $\lambda\approx10$ and $\lambda\approx45$. The histogram shows the eigenvalues from a single realization of $\bm{\mathrm{U}}$ at $d = 100$. Inset: zoom near $\lambda = 0$ (in blue) showing the first bulk $\rho_1$ and the delta peak at $\lambda = s_t^2$. (*Right*) Same as (*Middle*), but with $\rho_{\boldsymbol{\Sigma}}(\lambda) = \frac{1}{2}\delta(\lambda - 0.5) + \frac{1}{2}\delta(\lambda - 1.5)$. The first bulk in blue remains unchanged, as it depends only on $\sigma_{\bm{\mathrm{x}}}^2 = \operatorname{Tr}(\boldsymbol{\Sigma})/d = 1$ in both cases, while the second bulk varies with $\boldsymbol{\Sigma}$.

Figure 4: (Left) Illustration of an RFNN. (Middle/Right) Spectrum of U\bm{\mathrm{U}}. Density ρ(λ)\rho(\lambda) from Theorem 1 in the overparameterized Regime I described in Theorem 2, with ψp=64\psi_p = 64, ψn=8\psi_n = 8, t=0.01t = 0.01, and ρΣ(λ)=δ(λ1)\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1). The bulk of the spectrum (orange) is between λ10\lambda\approx10 and λ45\lambda\approx45. The histogram shows the eigenvalues from a single realization of U\bm{\mathrm{U}} at d=100d = 100. Inset: zoom near λ=0\lambda = 0 (in blue) showing the first bulk ρ1\rho_1 and the delta peak at λ=st2\lambda = s_t^2. (Right) Same as (Middle), but with ρΣ(λ)=12δ(λ0.5)+12δ(λ1.5)\rho_{\boldsymbol{\Sigma}}(\lambda) = \frac{1}{2}\delta(\lambda - 0.5) + \frac{1}{2}\delta(\lambda - 1.5). The first bulk in blue remains unchanged, as it depends only on σx2=Tr(Σ)/d=1\sigma_{\bm{\mathrm{x}}}^2 = \operatorname{Tr}(\boldsymbol{\Sigma})/d = 1 in both cases, while the second bulk varies with Σ\boldsymbol{\Sigma}.

💭 Click to ask about this figure
Assumptions. For our analytical results to hold, we make the following mathematical assumptions which are standard when studying Random Features [47, 48, 49] namely (i) the activation function σ\sigma admits a Hermite polynomial expansion σ(x)=s=0αss!Hes(x)\sigma(x)=\sum_{s=0}^\infty\frac{\alpha_s}{s!}He_s(x); and (ii) the data distribution PxP_{\bm{\mathrm{x}}} has sub-Gaussian tails and a covariance Σ=EPx[xxT]\boldsymbol{\Sigma}=\mathbb{E}_{P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T] with bounded spectrum. We assume that the empirical distribution of eigenvalues of Σ\boldsymbol{\Sigma} converges weakly in the high dimensional limit to a deterministic density ρΣ(λ)\rho_{\boldsymbol{\Sigma}}(\lambda) and that Tr(Σ)/d\operatorname{Tr}(\boldsymbol{\Sigma})/d converges to a finite limit (for a more precise mathematical statement see SM Appendix C.3). Moreover, we make additional assumptions that are not essential to the proofs but which simplify the analysis: (iii) the activation function σ\sigma verifies μ0=Ez[σ(z)]=0\mu_0=\mathbb{E}_z[\sigma(z)]=0; and (iv) the second layer A\bm{\mathrm{A}} is initialized with zero weights A(τ=0)=0\bm{\mathrm{A}}(\tau=0)=0. In numerical applications, unless specified, we use σ(z)=tanh(z)\sigma(z)=\tanh(z) and Px=N(0,Id)P_{\bm{\mathrm{x}}}=\mathcal{N}(0, \bm{I}_d).
**Figure 5:** **Evolution of the training and test losses for the RFNN.** (A) Distance to the true score $\mathcal{E}_\mathrm{score}$ against training time $\tau$ for $\psi_n=4, 8, 16, 32$, $\psi_p=64, t=0.1$ and $d=100$. In the inset, the training time is rescaled by $\tau_\mathrm{mem}=\psi_p/\Delta_t\lambda_\mathrm{min}$. (B) Training (solid) and test (dashed) losses for various $\psi_n$. The inset shows both losses rescaled by $\tau_\mathrm{mem}$. (C) Heatmaps of $\mathcal{L}_\mathrm{gen}$ for $\tau=10^{3}$ (top) and $\tau=10^4$ (bottom) as a function of $\psi_n$ and $\psi_p$. All the curves use Pytorch [50] gradient descent. More numerical details can be found in SM Section D.

Figure 5: Evolution of the training and test losses for the RFNN. (A) Distance to the true score Escore\mathcal{E}_\mathrm{score} against training time τ\tau for ψn=4,8,16,32\psi_n=4, 8, 16, 32, ψp=64,t=0.1\psi_p=64, t=0.1 and d=100d=100. In the inset, the training time is rescaled by τmem=ψp/Δtλmin\tau_\mathrm{mem}=\psi_p/\Delta_t\lambda_\mathrm{min}. (B) Training (solid) and test (dashed) losses for various ψn\psi_n. The inset shows both losses rescaled by τmem\tau_\mathrm{mem}. (C) Heatmaps of Lgen\mathcal{L}_\mathrm{gen} for τ=103\tau=10^{3} (top) and τ=104\tau=10^4 (bottom) as a function of ψn\psi_n and ψp\psi_p. All the curves use Pytorch [50] gradient descent. More numerical details can be found in SM Section D.

💭 Click to ask about this figure
Emergence of the two timescales during training. We first show in Figure 5 that the behavior of training and test losses in the RF model mirrors the one found in realistic cases in Section 2, with a separation of timescales τgen\tau_\mathrm{gen} and τmem\tau_\mathrm{mem} which increases with nn. Equation (Equation 10) is linear in A\bm{\mathrm{A}} and hence it can be solved exactly (see SM). The timescales of the training dynamics are given by the inverse eigenvalues of the p×pp\times p matrix ΔtU/ψp\Delta_t \bm{\mathrm{U}}/\psi_p. Building on the Gaussian Equivalence Principle [GEP, 51, 48, 52] and the theory of linear pencils [53], [17] ([17]) derive a coupled system of equations characterizing the Stieltjes transform of the eigenvalue density ρ(λ)\rho(\lambda) of U\bm{\mathrm{U}} for isotropic Gaussian data that lie in a DD-dimensional subspace with DdD\le d and D=O(d)D=\mathcal{O}(d). We offer an alternative derivation presented in SM for general variance using the replica method [54] -- a heuristic method from the statistical physics of disordered systems -- yielding the more compact formulation for obtaining the spectrum stated in Theorem 1. Before stating the theorem, we introduce
bt=Eu,v[vσ(etσxu+Δtv)],at=Eu,v[σ(etσxu+Δtv)uetσx],vt2=Eu,v,w[σ(etσxu+Δtv)σ(etσxu+Δtw)]at2e2tσx2,st2=Eu[σ(Γtu)2]at2e2tσx2vt2bt2,\begin{align} &b_t=\mathbb{E}_{u, v}[v\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)], \quad a_t=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}], \\ &v_t^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2, \\ &s_t^2= \mathbb{E}_u[\sigma(\Gamma_t u)^2]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2, \end{align}
💭 Click to ask about this equation
where σx2=Tr(Σ)d\sigma_{\bm{\mathrm{x}}}^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma})}{d}, Γt=e2tσx2+Δt=1+e2t(σx21)\Gamma_t=e^{-2t}\sigma_{\bm{\mathrm{x}}}^2+\Delta_t=1+e^{-2t}(\sigma_{\bm{\mathrm{x}}}^2-1) and the expectation is over the u,v,wu, v, w random variables which are independent standard Gaussian N(0,1)\mathcal{N}(0, 1).

Theorem 1

Let q(z)=1pTr(UzIp)1q(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}, r(z)=1pTr(Σ1/2WT(UzIp)1WΣ1/2)r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2}) and s(z)=1pTr(WT(UzIp)1W)s(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}}), with zCz\in\mathbb{C}. Let
s^(q)=bt2ψp+1q,r^(r,q)=ψpat2e2t1+at2e2tψpψnr+ψpvt2ψnq.\begin{align} &\hat{s}(q)=b_t^2\psi_p+\frac{1}{q}, \\ &\hat{r}(r, q)=\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}\psi_p }{\psi_n }r+\frac{\psi_p v_t^2}{\psi_n }q}. \end{align}
💭 Click to ask about this equation
Then q(z),r(z)q(z), r(z) and s(z)s(z) satisfy the following set of three equations:
s=dρΣ(λ)1s^(q)+λr^(r,q),r=dρΣ(λ)λs^(q)+λr^(r,q),ψp(st2z)+ψpvt21+at2e2tψpψnr+ψpvt2ψnq+1ψpqsq2=0,\begin{align} &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}\psi_p} {\psi_n} r+\frac{\psi_p v_t^2}{\psi_n} q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0, \end{align}
💭 Click to ask about this equation
The eigenvalue distribution of U\bm{\mathrm{U}}, ρ(λ)\rho(\lambda), can then be obtained using the Sokhotski–Plemelj inversion formula ρ(λ)=limε0+1πImq(λ+iε)\rho(\lambda)=\underset{\varepsilon \rightarrow0^+}{\lim}\frac{1}{\pi}\operatorname{Im}q(\lambda+i\varepsilon).
We now focus on the asymptotic regime ψp,ψn1\psi_p, \psi_n \gg 1, typical for strongly over‑parameterized models trained on large data sets. In this limit, the spectrum of U\bm{\mathrm{U}} can be described analytically by the following Theorem 2.

Theorem 2: Informal

Let ρ\rho denote the spectral density of U\bm{\mathrm{U}}.
  • Regime I (overparametrized): ψp>ψn1\psi_p>\psi_n\gg 1.
ρ(λ)=(11+ψnψp)δ(λst2)+ψnψpρ1(λ)+1ψpρ2(λ).\rho(\lambda)=\Bigl(1-\frac{1+\psi_n}{\psi_p}\Bigr)\delta(\lambda-{s_t^2}) +\frac{\psi_n}{\psi_p}\, \rho_1(\lambda) +\frac{1}{\psi_p}\, \rho_2(\lambda).
💭 Click to ask about this equation
  • Regime II (underparametrized): ψn>ψp1\psi_n>\psi_p\gg 1.
ρ(λ)=(11ψp)ρ1(λ)+1ψpρ2(λ).\rho(\lambda)=\Bigl(1-\frac{1}{\psi_p}\Bigr)\rho_1(\lambda) +\frac{1}{\psi_p}\, \rho_2(\lambda).
💭 Click to ask about this equation
where ρ1\rho_1 is an atomless measure with support
[st2+vt2(1ψp/ψn)2,  st2+vt2(1+ψp/ψn)2],\left[s_t^2 + v_t^2\left(1-\sqrt{\psi_p/\psi_n}\right)^{2}, \; s_t^2 + v_t^2\left(1+\sqrt{\psi_p/\psi_n}\right)^{2}\right],
💭 Click to ask about this equation
and ρ2\rho_2 coincides with the asymptotic eigenvalue bulk density of the population covariance U~=EX[U]\tilde{\bm{\mathrm{U}}}=\mathbb{E}_{\bm{\mathrm{X}}}[\bm{\mathrm{U}}]; ρ2\rho_2 is independent of ψn\psi_n and its support is on the scale ψp\psi_p.
The eigenvectors associated with δ(λst2)\delta(\lambda-{s_t^2}) leave both training and test losses unchanged and are therefore irrelevant. In the limit ψpψn\psi_p\gg \psi_n, the supports of ρ1\rho_1 and ρ2\rho_2 are respectively on the scales ψp/ψn\psi_p/\psi_n and ψp\psi_p, i.e. they are well separated.
The proofs of both theorems are shown in SM (Appendix C). We recall that training timescales are directly related to eigenvalues λ\lambda via the relation τ1=ψp/Δtλmin\tau^{-1}= \psi_p / \Delta_t \lambda_\mathrm{\min} . Theorem 2 therefore demonstrates the emergence of the two training timescales τmem\tau_{\mathrm{mem}} and τgen\tau_{\mathrm{gen}} in the overparametrized regime of the RFNN model. They are respectively associated to the measures ρ1\rho_1 and ρ2\rho_2, which are well separated in regime I, for ψpψn1\psi_p\gg\psi_n\gg1, as shown in Figure 4.
Generalization: The timescale τgen\tau_{\mathrm{gen}} on which the first relaxation takes place is associated to the formation of the generalization regime. It is related to the bulk ρ2\rho_2 and is or order 1/Δt1/\Delta_t. This regime only depends on the population covariance Σ\boldsymbol{\Sigma} of the data and is independent of the specific realization of the dataset. On this timescale, which is of order one, both the training Ltrain\mathcal{L}_{\mathrm{train}} and test Ltest\mathcal{L}_{\mathrm{test}} losses decrease. The generalization loss Lgen=LtestLtrain\mathcal{L}_\mathrm{gen} = \mathcal{L}_{\mathrm{test}}-\mathcal{L}_{\mathrm{train}} is zero, and Escore\mathcal{E}_\mathrm{score} tends to a value that we find to scale as O(ψnη)\mathcal{O}(\psi_n^{-\eta}) with η0.59\eta\simeq0.59 numerically (see Figure 5).
Memorization: The timescale τmem\tau_{\mathrm{mem}}, on which the second stage of the dynamics takes place, is associated to overfitting and memorization. It is related to the bulk ρ1\rho_1, and scales as ψp/Δtλmin\psi_p / \Delta_t \lambda_\mathrm{\min}, where λmin\lambda_\mathrm{\min} is the left edge of ρ1\rho_1 . In the overparameterized regime pnp \gg n, τmem\tau_{\mathrm{mem}} becomes large and of order ψn/Δt\psi_n/\Delta_t, thus implying a scaling of τmem\tau_{\mathrm{mem}} with nn. On this timescale, the training loss decreases while the test loss increases, converging to their respective asymptotic values as computed in [17]. Figure 5 indeed shows that all training and test curves separate, correspondingly the generalization loss Lgen\mathcal{L}_{\mathrm{gen}} increases, at a time that scales with ψp/Δtλmin\psi_p /\Delta_t \lambda_{\min}, as shown in the inset.
As nn increases, the asymptotic (τ\tau \rightarrow \infty) generalization loss Lgen\mathcal{L}_{\mathrm{gen}} decreases, indicating a reduced overfitting. For n>n(p)=pn>n^*(p) = p, although some overfitting remains (i.e., Lgen>0\mathcal{L}_{\mathrm{gen}} > 0), the value of Lgen\mathcal{L}_{\mathrm{gen}} is sensibly reduced, and the model is no longer expressive enough to memorize the training data, as shown in [17]. This regime corresponds to the Architectural Regularization phase in Figure 1. We show in Figure 5 (panel C) how the generalization loss Lgen\mathcal{L}_{\mathrm{gen}} varies in the (n,p)(n, p) plane depending on the time τ\tau at which training is stopped. In agreement with the above results, we find that the generalization--memorization transition line depends on τ\tau and moves upward for larger values of τ\tau, similarly to the numerical results exposed in Figure 3 and the illustration in Figure 1.

4. Conclusions

We have shown that the training dynamics of neural network-based score functions display a form of implicit regularization that prevents memorization even in highly overparameterized diffusion models. Specifically, we have identified two well-separated timescales in the learning: τgen\tau_\mathrm{gen}, at which models begins to generate high-quality, novel samples, and τmem\tau_\mathrm{mem}, beyond which they start to memorize the training data. The gap between these timescales grows with the size of the training set, leading to a broad window where early stopped models generate novel samples of high-quality. We have demonstrated that this phenomenon happens in realistic settings, for controlled synthetic data, and in analytically tractable models. Although our analysis focuses on DMs, the underlying score‑learning mechanism we uncover is common to all score‑based generative models such as stochastic interpolants [55] or flow matching [10]; we therefore expect our results to generalize to this broader class.
Limitations and future works. - While we derived our results under SGD optimization, most DMs are trained in practice with Adam [38]. In SM Sects. Appendix A and Appendix D, we show that the two key timescales still arise using Adam, although with much fewer optimization steps. Studying how different optimizers shift these timescales would be valuable for practical usage.
  • All experiments in Section 2 are conducted with unconditional DMs. We additionally verify in SM Sect. B, using a toy Gaussian mixture dataset and classifier-free guidance [56], that the same scaling of τmem\tau_\mathrm{mem} with nn holds in the conditional settings. Understanding precisely how the absolute timescales τmem\tau_\mathrm{mem} and τgen\tau_\mathrm{gen} depend on the conditioning remains an open question.
  • Our numerical experiments cover a range of pp between 1M and 16M. Exploring a wider range is essential to map the full (n,p)(n, p) phase diagram sketched in Figure 1 and understand the precise effect of expressivity on dynamical regularization.
  • Finally, our theoretical analysis rely on well-controlled data and score models that reproduce the core effects. Extending these analytical frameworks to richer data distributions (such as Gaussian mixtures or data from the hidden manifold model) and to structured architectures would be valuable to further characterize the implicit dynamical regularization of training score-functions. In particular investigating how heavy-tailed data distribution [57] affect the picture described here could be valuable.
  • Although DMs trained on large and diverse datasets likely avoid the memorization regime we study here, some industrial models were shown to exhibit partial memorization [23, 24]. Our results provide practical guidelines (early-stopping, control the network capacity) to train DMs robustly and hence avoid memorization, which can be especially helpful in data-scarce domains (e.g., physical sciences).

Acknowledgments

The authors thank Valentin De Bortoli for initial motivating discussions on memorization--generalization transitions. RU thanks Beatrice Achilli, Jérome Garnier-Brun, Carlo Lucibello and Enrico Ventura for insightful discussions. RU is grateful to Bocconi University for its hospitality during his stay, during which part of this work was conducted. This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-AD011016319). GB acknowledges support from the French government under the management of the Agence Nationale PR[AI] RIE-PSAI (ANR-23-IACL-0008). MM acknowledges the support of the PNRR-PE-AI FAIR project funded by the NextGeneration EU program. After completing this work, we became aware that A. Favero, A. Sclocchi, and M. Wyart [58] had also been investigating the memorization--generalization transition from a similar perspective.

Appendix

Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training **Supplementary Material (SM)** Tony Bonnaire$^\dagger$, Raphaël Urfin$^\dagger$, Giulio Biroli, Marc Mézard
This document provides detailed derivations and additional experiments supporting the main text (MT). In Appendix A, we give details about the numerical experiments carried out in Section 2. In Appendix B we provide additional numerical experiments on simplified score and data models. Appendix C gives formal proofs of the main theorems of Section 3. Finally, Appendix D exposes more details on the numerical experiments of Section 3.

A. Numerical experiments on CelebA

A.1 Details on the numerical setup

Dataset. All numerical experiments in Section 2 of the MT use the CelebA face dataset [22]. We center-crop each RGB image to 32×3232\times 32 pixels and convert to grayscale images in order to accelerate the training of our Diffusion Models (DMs). To precisely control the samples seen by a model, no data augmentation is applied, and we vary the training set size nn in the window [128,32768]\left[128, 32768\right]. Examples of training samples are shown in the left-most block of Figure 6.
Architecture. As commonly done in DDPMs implementations [e.g., 2, 4], the network approximating the score function is a U-Net [21] made of three resolution levels, each containing two residual blocks with channel multipliers {1,2,4}\{1, 2, 4\} respectively. We apply attention to the two coarsest resolutions, and embed the diffusion time via sinusoidal position embedding [40]. The base channel width WW varies from 1616 to 6464 depending on the experiment, resulting in a total of 11 to 1616 million trainable parameters.
Time reparameterization. Compared to the framework presented in the MT, the DDPMs we train make use of a time reparameterization of the forward and backward processes with a variance schedule {βt}t=1T\{\beta_{t'}\}_{t'=1}^T, where TT is the time horizon given as a number of steps, fixed to 1000 in our experiments. The variance is evolving linearly from β1=104\beta_1 = 10^{-4} to β1000=2×102\beta_{1000} = 2 \times 10^{-2}. A sample at time tt', denoted x(t)\bm{\mathrm{x}}(t'), can be expressed from x(0)\bm{\mathrm{x}}(0) as the following interpolation
x(t)=α(t)x(0)+1α(t)ξ,\bm{\mathrm{x}}(t') = \sqrt{\overline{\alpha}(t')} \bm{\mathrm{x}}(0) + \sqrt{1 - \overline{\alpha}(t')} \bm{\xi},
💭 Click to ask about this equation
where α(t)=s=1t(1βs)\overline{\alpha}(t') = \prod_{s=1}^{t'} (1-\beta_s), and ξ\bm{\xi} is a standard and centered Gaussian noise. This is a reparameterization of the Ornstein-Uhlenbeck process from Equation 1 defined through time tt in the MT, with
t=12log(α(t)).t = -\frac{1}{2} \log \left(\overline{\alpha}(t')\right).
💭 Click to ask about this equation
Training. All DMs are trained with Stochastic Gradient Descent (SGD) at fixed learning rate η=0.01\eta = 0.01, fixed momentum β=0.95\beta=0.95 and batch size B=min(n,512)B=\min(n, 512). We focus on SGD to facilitate the analysis of time scaling, avoiding problems that may cause alternative adaptive optimization schemes like Adam [38]. We train each model for at least 2M SGD steps, sometimes more for large values of nn displaying memorization only later. We do not employ exponential moving average or learning-rate warm-up.
Generation. To accelerate sampling while preserving FID, we employ the DDIM sampler of [41] ([41]) which replaces the Markovian reverse SDE with a deterministic, non-Markovian update. Given a trained denoiser ξθ(xt,t)\bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t), we iterate for t=T,,1t=T', \ldots, 1
xt1=α(t1) xt1α(t)  ξθ(xt,t)α(t)+1α(t1)ξθ(xt,t),\bm{\mathrm{x}}_{t-1} = \sqrt{\overline{\alpha}(t-1)}\ \frac{\bm{\mathrm{x}}_t - \sqrt{1-\overline{\alpha}(t)}\; \bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t)}{\sqrt{\overline{\alpha}(t)}} + \sqrt{1-\overline{\alpha}(t-1)} \bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t),
💭 Click to ask about this equation
with T=200T'=200. During training, we generate at 40 milestones a set of 10, 000 samples to assess generalization and memorization. Examples of samples obtained from a model trained on n=1024n=1024 samples with base width W=32W=32 are shown in the middle and right blocks from Figure 6 for two training times, τ=190\tau=190 K and τ=1.62\tau=1.62 M. At τ=190\tau=190 K the model generalizes (fmem=0%f_\mathrm{mem}=0\%) and achieve a test FID of 35.1. After too much training, memorization sets in and, by τ=1.62\tau=1.62 M steps, nearly half the generated samples reproduce training images (fmem=47.2%f_\mathrm{mem}=47.2\%).
Statistical evaluation. FIDs [42] are computed1 using 10, 000 generated samples and 10, 000 test samples, averaged over 5 independent runs with disjoint test sets. Error bars in the MT denote twice the standard deviation. Training and test losses are estimated similarly over 5 repeated evaluation on nn training samples and 2048 test samples, and give negligible confidence intervals. For the memorization fraction fmem(τ)f_\mathrm{mem}(\tau), we report the standard error on the mean obtained via bootstrap resampling of the 10, 000 generated samples. We also verified that the scaling in the memorization time τmem\tau_\mathrm{mem} is insensitive to the choice of the threshold kk used to define fmemf_\mathrm{mem} in Equation 6 by testing larger and lower values.
Using the pytorch-fid Python package.
Computing resources. Most trainings were performed on Nvidia H100 GPUs (80GB of memory). A typical run of 2M steps takes approximately 50 hours on two GPUs and vary with the model size (defined through its base width WW). In total, we train 18 distinct models for the several n,Wn, W configurations of the MT. The longest training (n=32768n=32768 and W=32W=32 in Figure 2) ran for 11M steps. The generation of 10,00010, 000 samples over 40 training times takes around an additional hour per model on the same hardware support.
**Figure 6:** **Training and generation on CelebA.** The left-most block shows random training images. Middle and right blocks show generated samples in the left column (after $\tau=190$ K and $\tau=1.62$ M SGD updates respectively), alongside each sample's nearest neighbor in the training set in the right column. All generated images come from model trained on $n=1024$ with base width $W=32$.

Figure 6: Training and generation on CelebA. The left-most block shows random training images. Middle and right blocks show generated samples in the left column (after τ=190\tau=190 K and τ=1.62\tau=1.62 M SGD updates respectively), alongside each sample's nearest neighbor in the training set in the right column. All generated images come from model trained on n=1024n=1024 with base width W=32W=32.

💭 Click to ask about this figure

Figure 7: Impact of batch size and optimizer on the scaling of τmem\tau_\mathrm{mem}. FID (solid lines, left axis) and memorization fraction fmemf_\mathrm{mem} (in %, dashed lines, right axis) against training time τ\tau for various nn. Inset: normalized memorization fraction fmem(τ)/fmem(τmax)f_\mathrm{mem}(\tau)/f_\mathrm{mem}(\tau_\mathrm{max}) with the rescaled time τ/n\tau/n. (Left) Memorization scaling for B=nB=n. (Right) Generalization--Memorization transition with Adam optimizer for W=64W=64.

💭 Click to ask about this figure

A.2 Batch-size effect: repetition vs. memorization

All the experiments in the MT use a fixed batch size B=512B=512, and in Sect Section 2 we emphasize that the observed O(n)\mathcal{O}(n) scaling of τmem\tau_\mathrm{mem} cannot be explained by repetition over training samples. To validate this statement, the left panel of Figure 7 shows FID and memorization fraction curves when we train the models with full-batch updates (B=nB=n) for n[128,512]n\in\left[128, 512\right]. At any fixed τ\tau, every sample has been seen exactly τ\tau times. Yet τmem\tau_\mathrm{mem} continues to grow linearly with nn, as shown in the inset. This demonstrates that larger datasets reshape the loss landscape -- requiring proportionally more updates to overfit -- rather than simply increasing memorization through repeated exposure of training samples.

A.3 What about Adam?

We conclude this section by repeating our analysis at fixed W=64W=64 using the Adam optimizer [38] instead of SGD with momentum. The learning rate is η=1×104\eta = 1\times10^{-4}, gradient averages take values (β1,β2)=(0.9,0.999)(\beta_1, \beta_2)=(0.9, 0.999), and batch size B=min(512,n)B=\min(512, n). We keep all other settings and evaluation metrics as above. As shown in the right panel of Figure 7, Adam yields the same two-phase training dynamics with first a generalization regime with fmem=0f_\mathrm{mem}=0 and good performances (small FID), and later a memorization phase at τmemn\tau_\mathrm{mem}\propto n, as shown in the inset. The only difference is that both τgen\tau_\mathrm{gen} and τmem\tau_\mathrm{mem} occur after much fewer steps compared to SGD. This also points out that the emergence of the two well-separated timescales and their scaling is a fundamental property of the loss landscape.

B. Generalization--memorization transition in the Gaussian Mixture Model

The aim of this section is to show our results hold for other data distributions than natural images, and alternative score model that U-Net architectures.

B.1 Settings

Data distribution. We focus on data iid sampled from a dd-dimensional Gaussian Mixture Model (GMM) made of two balanced Gaussians centered on ±μ\pm \bm{\mu} with unit covariance, i.e.,
P0=12N(μ,Id)+12N(μ,Id).\mathbb{P}_0 = \frac{1}{2} \mathcal{N}(\bm{\mu}, \bm{I}_d) + \frac{1}{2}\mathcal{N}(- \bm{\mu}, \bm{I}_d).
💭 Click to ask about this equation
In what follows, we choose to work with μ=1d\bm{\mu} = \bm{1}_d, with 1d=[1,,1]TRd\bm{1}_d = \left[1, \ldots, 1\right] ^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^d. In this controlled setup, the generalization score can be computed analytically from P0\mathbb{P}_0 and reads
sgen(xt,t)=μettanh(xtμet)xt.\bm{\mathrm{s}}_\mathrm{gen}(\bm{\mathrm{x}}_t, t) = \bm{\mu} e^{-t} \tanh\left(\bm{\mathrm{x}}_t \cdot \bm{\mu} e^{-t}\right) - \bm{\mathrm{x}}_t.
💭 Click to ask about this equation
Score model. The denoise ξθ(xt,t)\bm{\xi}_{\bm{\theta}}(\bm{\mathrm{x}}_t, t) is implemented as a lightweight residual multi-layer neural network (see Figure 8): an input layer projecting RdRW\mathbb{R}^d \to \mathbb{R}^W, followed by three identical residual blocks and an output layer projecting back to Rd\mathbb{R}^d. Each block consists of two fully connected layers of width WW, a skip connection, and a layer normalization. We encode the diffusion time tt via sinusoidal position embedding and add it to the first feature of each block. The total number of parameter in the network is p(d,W)=W(2d+13)+d+6W2p(d, W) = W(2d + 13) + d + 6W^2. For d=8d=8, and W=128W=128, the reference setting of this section, this yields p=102,024p=102, 024 trainable parameters.
**Figure 8:** **Basic ResNet architecture of the GMM numerical experiments.** Residual network with three residual blocks, each made of two fully-connected layers followed by a layer normalization. The width of the network is $W$, and the input and output sizes are $d$.

Figure 8: Basic ResNet architecture of the GMM numerical experiments. Residual network with three residual blocks, each made of two fully-connected layers followed by a layer normalization. The width of the network is WW, and the input and output sizes are dd.

💭 Click to ask about this figure
Training and computing resources. Unless otherwise specified, we train every model of this section with SGD at fixed learning rate η=6×103\eta = 6\times 10^{—3} and momentum β=0.95\beta=0.95 using full-batch updates B=nB=n for n{128,256,512,1024,2048,4096}n\in\{128, 256, 512, 1024, 2048, 4096\}, running for up to 4M updates. All experiments are executed on an Nvidia RTX 2080 Ti, with the largest n=4096n=4096 requiring around 10 hours to complete.
Generalization and memorization metrics. In addition to the memorization fraction fmem(τ)f_\mathrm{mem}(\tau), we exploit this controlled setting where we know the true data distribution P0\mathbb{P}_0 to directly measure how closely it matches the generated distribution Pθ\mathbb{P}_{\bm{\theta}} via the Kullback-Leibler (KL) divergence
DKL(PθP0)=dxPθ(x)logPθ(x)dxPθ(x)logP0(x).D_\mathrm{KL}(\mathbb{P}_{\bm{\theta}} | \mathbb{P}_0) = \int \mathrm{d} \bm{\mathrm{x}} \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}}) \log \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}}) - \int \mathrm{d} \bm{\mathrm{x}} \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}}) \log \mathbb{P}_{0}(\bm{\mathrm{x}}).
💭 Click to ask about this equation
The cross-entropy term EPθ[logP0]\mathbb{E}_{\mathbb{P}_{\bm{\theta}}}\left[\log\mathbb{P}_0\right] is easy to estimate using Monte Carlo,
dxPθ(x)logP0(x)1Nμ=1NlogP0(x~μ),\int \mathrm{d} \bm{\mathrm{x}} \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}}) \log \mathbb{P}_{0}(\bm{\mathrm{x}}) \approx \frac{1}{N} \sum_{\mu=1}^N \log \mathbb{P}_0(\tilde{\bm{\mathrm{x}}}_\mu),
💭 Click to ask about this equation
where {x~μ}μ=1N\{\tilde{\bm{\mathrm{x}}}_\mu\}_{\mu=1}^N are N=10,000N=10, 000 samples drawn from the model with parameters θ(τ)\bm{\theta}(\tau) at training time τ\tau. Estimating the negative entropy term EPθ[logPθ]\mathbb{E}_{\mathbb{P}_{\bm{\theta}}}\left[\log\mathbb{P}_{\bm{\theta}}\right] is more challenging, since DMs only give access to the score function sθ(x,t)=xlogPθ(x)\bm{\mathrm{s}}_{\bm{\theta}}(\bm{\mathrm{x}}, t) = \nabla_{\bm{\mathrm{x}}} \log \mathbb{P}_{\bm{\theta}}(\bm{\mathrm{x}}) and not the underlying probability distribution Pθ\mathbb{P}_{\bm{\theta}}. We can however employ time integration to express it as a function of the score only,
EPθ[logPθ]0TdtI(t)d2log(2πe),\mathbb{E}_{\mathbb{P}_{\bm{\theta}}}\left[\log\mathbb{P}_{\bm{\theta}}\right] \approx \int_{0}^T \mathrm{d} t I(t) - \frac{d}{2} \log \left(2\pi e\right),
💭 Click to ask about this equation
with
I(t)=βt2Nμ=1N[x~μsθ(x~μ,t)+sθ(x~μ,t)2].I(t) = \frac{\beta_t}{2N}\sum_{\mu=1}^N \left[\tilde{\bm{\mathrm{x}}}_\mu \bm{\mathrm{s}}_{\bm{\theta}}(\tilde{\bm{\mathrm{x}}}_\mu, t) + \bm{\mathrm{s}}_{\bm{\theta}}(\tilde{\bm{\mathrm{x}}}_\mu, t)^2 \right].
💭 Click to ask about this equation
This expression assumes that the model learns an accurate representation of the score function. It is noteworthy to mention that samples are generated using standard Euler-Maruyama discretization of the backward process Equation 2 of the MT over T=1000T=1000 timesteps.

B.2 Scaling of τmem\tau_\mathrm{mem} and τgen\tau_\mathrm{gen} with nn and WW

In Figure 9, the left panel shows how the KL divergence and memorization fraction evolve with training time τ\tau for different training set sizes nn at fixed width W=128W=128, while the right panel fixes n=2048n=2048 and varies WW. In both cases, we observe two distinct phases. First, the KL divergence decreases to near zero on a timescale τgen\tau_\mathrm{gen} independent of nn during which the model fully generalizes (fmem=0f_\mathrm{mem}=0). Beyond τgen\tau_\mathrm{gen}, both DKL(PθP0)D_\mathrm{KL}(\mathbb{P}_{\bm{\theta}}|\mathbb{P}_0) and fmemf_\mathrm{mem} begin to rise at a time τmem\tau_\mathrm{mem} that scales linearly with nn, as highlighted by the inset of the left panel. In contrast, τmem\tau_\mathrm{mem} scales with W1W^{-1}, as shown in the inset of the right panel. While the precise dependence of τgen\tau_\mathrm{gen} with WW remains inconclusive in this setting and require a more careful analysis, these results on the GMM mirror the main findings of the MT: the training dynamics of DMs unfolds first in a generalization phase and only later -- at τmemn/W\tau_\mathrm{mem} \propto n/W -- memorization begins.

Figure 9: Generalization--Memorization transition as a function of the training set size nn and width WW for ResNet score models on GMM (d=8d=8). KL divergences (solid lines, left axis) and memorization fraction fmemf_\mathrm{mem} (in %, dashed lines, right axis) against training time τ\tau for various (Left) n{256,512,1024,2048,4096}n \in \{256, 512, 1024, 2048, 4096\} at fixed W=128W=128. (Right) W{64,128,256}W \in \{64, 128, 256\} at fixed n=2048n=2048. Insets: DKL(PθP0)D_\mathrm{KL}(\mathbb{P}_{\bm{\theta}} | \mathbb{P}_0) and fmemf_\mathrm{mem} against the rescaled time τ/n\tau/n (left) and τW\tau W (right).

💭 Click to ask about this figure

B.3 Discussion on conditional diffusion models

Conditional generation aims to sample from distributions of the form p(xy)p(\bm{\mathrm{x}} | \bm{\mathrm{y}}), where y\bm{\mathrm{y}} denotes a conditioning variable such as a class label, a text embedding, or any other contextual information. DMs can naturally be extended to this setting using for instance classifier-free guidance [56]. Although conditioning often improves sample quality in practice, memorization effects have also been observed in models trained conditionally [25, 59, 60]. Our analysis does not rely on the model being unconditional since these variables typically enter the model as additional inputs and we expect our result to hold in this setting as well. To investigate it, we train a classifier-free guidance model to generate sample from our Gaussian mixture conditionally on the class label, and compute the memorized fraction as a function of τ\tau that we report in Figure 10. In the inset, when rescaling the training time by nn, the curves for n{256,512,1024}n \in \{256, 512, 1024\} all collapse perfectly, confirming that the phenomenon persists in the conditional setting. For more complex datasets, τmem\tau_\mathrm{mem} and τgen\tau_\mathrm{gen} may in fact depend on the conditioning variable and intermediate regimes could exist where certain classes have already entered the generalization (or memorization) phase while others have not yet.
**Figure 10:** **Effect of guidance on $\tau_\mathrm{mem}$.** Evolution of $f_\mathrm{mem}$ as a function of $\tau$ for $n \in \{256, 512, 1024\}$ at fixed $W=64$.

Figure 10: Effect of guidance on τmem\tau_\mathrm{mem}. Evolution of fmemf_\mathrm{mem} as a function of τ\tau for n{256,512,1024}n \in \{256, 512, 1024\} at fixed W=64W=64.

💭 Click to ask about this figure

C. Proofs of the analytical results

In the following we provide the mathematical arguments and the proofs for the statement in the MT. The section using the replica method is not mathematically rigorous but uses a well established method of theoretical physics, which has been already shown to provide correct results in several cases. The final result is rigorous, since it can alternatively be obtained from the rigorous free random matrix approach of [17], as shown in Appendix C.4.

C.1 Notations

We recall here the notations used throughout Section 3 of the MT and Appendix C of the SM.
d:Data dimensionn:Numbers of data pointsp:Dimension of the hidden layers of the RFNNId:Identity matrix in dimension dσ(x):Activation function of the modelPx:Distribution of the data pointsPt:Distribution of the noisy data points at diffusion timet.ψn=ndψp=pdΔt=1e2tΣ=ExPx[xxT]Σt=e2tΣ+ΔtIdΓt2=Tr(Σt)dσx2=Tr(Σ)d\begin{align} &d: \text{Data dimension}\\ &n:\text{Numbers of data points}\\ &p:\text{Dimension of the hidden layers of the RFNN}\\ &\bm{I}_d:\text{Identity matrix in dimension } d\\ &\sigma(x):\text{Activation function of the model}\\ &P_{\bm{\mathrm{x}}}:\text{Distribution of the data points}\\ &P_{t}:\text{Distribution of the noisy data points at diffusion time} t \text{.}\\ &\psi_n=\frac{n}{d}\\ &\psi_p=\frac{p}{d}\\ &\Delta_t=1-e^{-2t}\\ & \boldsymbol{\Sigma}=\mathbb{E}_{\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]\\ & \boldsymbol{\Sigma}_t=e^{-2t} \boldsymbol{\Sigma}+\Delta_t \bm{I}_d\\ &\Gamma_t^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma}_t)}{d}\\ &\sigma_{\bm{\mathrm{x}}}^2=\frac{\operatorname{Tr}(\boldsymbol{\Sigma})}{d} \end{align}
💭 Click to ask about this equation
σ2=Ez[σ(Γtz)2]bt2=(Eu,v[vσ(etσxu+Δtv)])2at=Eu,v[σ(etσxu+Δtv)uetσx]vt2=Eu,v,w[σ(etσxu+Δtv)σ(etσxu+Δtw)]at2e2tσx2st2=σ2at2e2tσx2vt2bt2μ1(t)=Eu[σ(Γtu)u]=e2tσx2at2+bt2.\begin{align} &\lVert \sigma\rVert^2=\mathbb{E}_z[\sigma(\Gamma_t z)^2]\\ &b_t^2=\left(\mathbb{E}_{u, v}[v\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)] \right)^2\\ &a_t=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}]\\ &v_t^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2\\ &s_t^2=\lVert \sigma \rVert^2-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2\\ &\mu_1(t)=\mathbb{E}_{u}[\sigma(\Gamma_tu)u]=\sqrt{e^{-2t}\sigma_{\bm{\mathrm{x}}}^2a_t^2+b_t^2}. \end{align}
💭 Click to ask about this equation
Unless specified, all the expectation values are taken for standard Gaussian variables. We will denote
X=[x1...xn]Rd×n\begin{align} \bm{\mathrm{X}}=[\bm{\mathrm{x}}^1\lvert...\rvert \bm{\mathrm{x}}^n]\in\mathbb{R}^{d\times n} \end{align}
💭 Click to ask about this equation
the matrix whose columns are the data point vectors and likewise we decompose W\bm{\mathrm{W}} as
W=[(W1) ⁣(Wp) ⁣]Rp×d,\begin{align} \bm{\mathrm{W}} = \begin{bmatrix} (\bm{\mathrm{W}}_1)^{\!\top} \\ \vdots \\ (\bm{\mathrm{W}}_p)^{\!\top} \end{bmatrix} \in \mathbb{R}^{p\times d}, \end{align}
💭 Click to ask about this equation
where WiRd\bm{\mathrm{W}}_i \in \mathbb{R}^{ d} denotes the ii th row of W\bm{\mathrm{W}}.
We recall the definitions of the matrices U\bm{\mathrm{U}} and V\bm{\mathrm{V}}
U=1nν=1nEξ[σ(Wxtν(ξ)d)σ(Wxtν(ξ)d)T],V=1nν=1nEξ[σ(Wxtν(ξ)d)ξT].\begin{align} & \bm{\mathrm{U}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right)^T\right], \\ & \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}\left[\sigma\left(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}\right) \bm{\xi}^T\right]. \end{align}
💭 Click to ask about this equation

C.2 Closed form of the learning dynamics

Proposition 3

Let A(τ)\bm{\mathrm{A}}(\tau) be the solution of the gradient flow (Equation 10) defined in the MT with initial conditions A(τ=0)=A0\bm{\mathrm{A}}(\tau=0)= \bm{\mathrm{A}}_0, then
A(τ)p=1ΔtVTU1+(1ΔtVTU1+A0p)e2ΔtψpUτ\begin{align} \frac{\bm{\mathrm{A}}(\tau)}{\sqrt{p}}=-\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}+(\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}+\frac{\bm{\mathrm{A}}_0}{\sqrt{p}})e^{-\frac{2\Delta_t}{\psi_p} \bm{\mathrm{U}} \tau} \end{align}
💭 Click to ask about this equation
with
V=1nν=1nEξ[σ(Wxtν(ξ)d)ξT].\begin{align} \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}) \bm{\xi}^T]. \end{align}
💭 Click to ask about this equation
Proof: We expand the square in the training loss
Ltrain(A)=1+ΔtdTr(ATpApU)+2ΔtdTr(ApV)\begin{align} \mathcal{L}_\mathrm{train}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{U}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{V}}) \end{align}
💭 Click to ask about this equation
and compute the gradient
ALtrain(A(τ))=2ΔtdApU+2ΔtdpVT.\begin{align} \nabla_{\bm{\mathrm{A}}}\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}}(\tau))=\frac{2\Delta_t}{d}\frac{\bm{\mathrm{A}}}{p} \bm{\mathrm{U}}+\frac{2\sqrt{\Delta_t}}{d\sqrt{p}} \bm{\mathrm{V}}^T. \end{align}
💭 Click to ask about this equation
Solving this ordinary differential equation yields the desired result. Consequently, the timescales of the dynamics of A(τ)\bm{\mathrm{A}}(\tau) is determined by the inverse of the eigenvalues of ΔtU/ψp\Delta_t \bm{\mathrm{U}} / \psi_p.

C.3 Gaussian Equivalence Principle

As explained in [47, 48, 49], the Gaussian Equivalence Principle which applies in the high dimensional setting considered here establishes an equivalence between the spectral properties of the original model and those of a Gaussian covariate model in which the nonlinear activation function is replaced by a linear term and a nonlinear term that acting as noise,
σ(Wxd)κ1Wxd+κη,xN(0,Ex[xxT]),ηN(0,Ip),(11)\begin{align} \sigma\left(\frac{ \bm{\mathrm{W}} \bm{\mathrm{x}}}{\sqrt d}\right) \to \kappa_1 \frac{\bm{\mathrm{W}} \bm{\mathrm{x}}'}{\sqrt d} + \kappa_* \bm{\eta}, \quad \bm{\mathrm{x}}'\sim\mathcal{N}(0, \mathbb{E}_{\bm{\mathrm{x}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T]), \quad \bm{\eta}\sim \mathcal{N}(0, \bm{I}_p), \end{align}\tag{11}
💭 Click to ask about this equation
where κ1,κ\kappa_1, \kappa_* are constants that depend on the distribution of the data and on the activation function σ\sigma whose formula we recall
κ1=Ez[σ(σxz)zσx],κ=Ez[σ(σxz)2]κ12σx2.\begin{align} &\kappa_1=\mathbb{E}_z[\sigma(\sigma_{\bm{\mathrm{x}}}z)\frac{z}{\sigma_{\bm{\mathrm{x}}}}], \\ &\kappa_*=\sqrt{\mathbb{E}_z[\sigma(\sigma_{\bm{\mathrm{x}}}z)^2]-\kappa_1^2\sigma_{\bm{\mathrm{x}}}^2}. \end{align}
💭 Click to ask about this equation
The expectation function are with respect to zN(0,1)z\sim \mathcal{N}(0, 1) and σx2=Tr(Σ)/d\sigma^2_{\bm{\mathrm{x}}}= \operatorname{Tr}(\boldsymbol{\Sigma})/d. The Gaussian Equivalence Principle (GEP) holds if the distribution PxP_{\bm{\mathrm{x}}} of the vector x\bm{\mathrm{x}} verifies
  • (i) Px(x)P_{\bm{\mathrm{x}}}(\bm{\mathrm{x}}) has sub-Gaussian tails: there exists a constant C>0C>0 such that for all A0A \ge 0 and each entry xi\bm{\mathrm{x}}_i,
P(xiA)2eA2/C.\begin{align} \mathbb{P}(\lvert \bm{\mathrm{x}}_i \rvert \ge A) \le 2 \, e^{-A^2 / C}. \end{align}
💭 Click to ask about this equation
  • (ii) The data covariance matrix Σ=ExPx[xxT]\boldsymbol{\Sigma}=\mathbb{E}_{\bm{\mathrm{x}}\sim P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T] is bounded: there exists a constant K>0K>0 independent of dd such that λmax(Σ)<K\lambda_{\textrm{max}}(\boldsymbol{\Sigma})<K and TrΣd<K\frac{\operatorname{Tr} \boldsymbol{\Sigma}}{d}<K where λmax(Σ)\lambda_{\textrm{max}}(\boldsymbol{\Sigma}) denotes the spectral norm of Σ\boldsymbol{\Sigma}.
In this section, we outline the derivation of the Gaussian Equivalence Principle (GEP) for the matrices U,U~,V\bm{\mathrm{U}}, \tilde{\bm{\mathrm{U}}}, \bm{\mathrm{V}} and V~\tilde{\bm{\mathrm{V}}} under arbitrary input covariance. This generalizes the approach developed in [17], which considered only the case of data drawn from N(0,Id)\mathcal{N}(0, \bm{I}_d). A more rigorous approach, which would consist in following [52], is left for future works. We will make use of the Mehler kernel formula [61] which states that for ff a test function defined on R2\mathbb{R}^2,
Eu,vPγ[f(u,v)]=s=1γss!Eu,vN(0,I2)[Hes(u)Hes(v)f(u,v)],\begin{align} \mathbb{E}_{u, v\sim P^\gamma}[f(u, v)]=\sum_{s=1}^\infty \frac{\gamma^s}{s!}\mathbb{E}_{u, v\sim\mathcal{N}(0, \bm{I}_2)}[He_s(u)He_s(v) f(u, v)], \end{align}
💭 Click to ask about this equation
where the expectation on the left-hand side is taken over jointly Gaussian random variables uu and vv with zero mean, unit variance, and correlation γ\gamma, while on the right-hand side the expectation is taken over independent standard Gaussian variables. HesHe_s denotes the ss-th Hermite polynomial. We recall some useful properties of the Hermite polynomials [62]:
  • They form an orthogonal base of L2(R,ex2/22πdx)L^2(\mathbb{R}, \frac{e^{-x^2/2}}{\sqrt{2\pi}} \mathrm{d} x).
  • The first Hermite polynomials are He0(x)=1,He1(x)=xHe_0(x)=1, He_1(x)=x.

Lemma 4: Gaussian Equivalence Principle for U\bm{\mathrm{U}}

In the limit n,p,dn, p, d\rightarrow\infty with ψp=p/d,ψn=n/d\psi_p=p/d, \psi_n=n/d and with a dataset {xν}ν=1n\{\bm{\mathrm{x}}^\nu\}_{\nu=1}^n sampled from a distribution PxP_{\bm{\mathrm{x}}} which verifies assumptions (i) and (ii) with Σ=EPx[xxT]\boldsymbol{\Sigma}=\mathbb{E}_{P_{\bm{\mathrm{x}}}}[\bm{\mathrm{x}} \bm{\mathrm{x}}^T], the matrix
U=1nν=1nEξ[σ(Wxtν(ξ)d)σ(Wxtν(ξ)d)T]\begin{align} \bm{\mathrm{U}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}})\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}})^T] \end{align}
💭 Click to ask about this equation
has the same spectrum as its Gaussian equivalent
U=GnGTn+bt2WWTd+st2Ip\begin{align} \bm{\mathrm{U}}=\frac{\bm{\mathrm{G}}}{\sqrt{n}}\frac{\bm{\mathrm{G}}^T}{\sqrt{n}}+b_t^2\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}+s_t^2 \bm{I} _p \end{align}
💭 Click to ask about this equation
where
G=etatWdX+vtΩ,\begin{align} \bm{\mathrm{G}}=e^{-t} a_t\frac{\bm{\mathrm{W}}}{\sqrt{d}} \bm{\mathrm{X}}'+v_t \bm{\Omega}, \end{align}
💭 Click to ask about this equation
XRd×n\bm{\mathrm{X}}'\in\mathbb{R}^{d\times n} is a matrix whose columns xν\bm{\mathrm{x}}'^\nu are sampled according to N(0,Σ)\mathcal{N}(0, \boldsymbol{\Sigma}) and ΩRp×d\bm{\Omega}\in \mathbb{R}^{p\times d} has gaussian entries independent of X\bm{\mathrm{X}} and W\bm{\mathrm{W}}.
Proof: For the sake of clarity, in this proof we explicitly make the covariance of the data Σ\boldsymbol{\Sigma} appear by writing the data points are written as xν=Σ1/2zν\bm{\mathrm{x}}^\nu= \boldsymbol{\Sigma}^{1/2} \bm{\mathrm{z}}^\nu where the vectors zν\bm{\mathrm{z}}^\nu have variance 1. Let us focus on the element of U\bm{\mathrm{U}} in position (i,j)(i, j)
Uij=1nν=1nEξ[σ(Wik(et(Σ1/2)klzlν+Δtξk)d)σ(Wjk(et(Σ1/2)klzlν+Δtξk)d)],\begin{align} \bm{\mathrm{U}}_{ij}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma\left(\frac{\bm{\mathrm{W}}_{ik} (e^{-t}(\Sigma^{1/2})_{kl} \bm{\mathrm{z}}^\nu_l+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}\right)\sigma\left(\frac{\bm{\mathrm{W}}_{jk'} (e^{-t}(\Sigma^{1/2})_{k'l'} \bm{\mathrm{z}}^\nu_{l'}+\sqrt{\Delta_t} \bm{\xi}_{k'})}{\sqrt{d}}\right)], \end{align}
💭 Click to ask about this equation
where repeated indices mean that there is a hidden sum. We introduce the random variable χiν=Wik(et(Σ1/2)klzlν+Δtξk)d\chi_i^\nu=\frac{\bm{\mathrm{W}}_{ik} (e^{-t}(\Sigma^{1/2})_{kl} \bm{\mathrm{z}}^\nu_l+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}. In the high dimensional limit it converges to a Gaussian random variable by the Central Limit Theorem (since the tails of the data distribution are sub-Gaussian). If i=ji=j, the diagonal terms concentrate with respect to the data points and we can thus replace the sum by an average
Uii=Eχ[σ(χ)2]+O(1/n).\begin{align} \bm{\mathrm{U}}_{ii}=\mathbb{E}_{\chi}[\sigma(\chi)^2]+\mathcal{O}(1/n). \end{align}
💭 Click to ask about this equation
The finite nn corrections can be discarded because they cannot change the spectrum of U\bm{\mathrm{U}}. χ\chi can be taken Gaussian with mean 0 and covariance EWi,z,ξ[χ2]=EWi,z,ξ[WiTΣtWid]=Tr(Σt)d=Γt2\mathbb{E}_{\bm{\mathrm{W}}_i, \bm{\mathrm{z}}, \xi}[\chi^2]=\mathbb{E}_{\bm{\mathrm{W}}_i, \bm{\mathrm{z}}, \xi}[\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}]=\frac{\operatorname{Tr}(\boldsymbol{\Sigma}_t)}{d}=\Gamma_t^2 hence
Uii=Eχ[σ(χ)2]+O(1/n)=EuN(0,1)[σ(Γtu)2]=σ2.\begin{align} \bm{\mathrm{U}}_{ii}=\mathbb{E}_{\chi}[\sigma(\chi)^2]+\mathcal{O}(1/n)=\mathbb{E}_{u\sim\mathcal{N}(0, 1)}[\sigma(\Gamma_tu)^2]=\lvert \lvert \sigma\rvert\rvert^2. \end{align}
💭 Click to ask about this equation
If iji\neq j, we denote ηiν=etWiTΣ1/2zd\eta_i^\nu=e^{-t}\frac{\bm{\mathrm{W}}_i^T \Sigma^{1/2} \bm{\mathrm{z}}}{\sqrt{d}}. For now we consider W\bm{\mathrm{W}} and the zν\bm{\mathrm{z}}^\nu fixed and look at ξ\xi. We use the Mehler Kernel formula for the random variables u=WiTξ/du= \bm{\mathrm{W}}_i^T \bm{\xi}/\sqrt{d} and v=WjTξ/dv= \bm{\mathrm{W}}_j^T \bm{\xi}/\sqrt{d} that have correlation Eξ[uv]=WiTWjd\mathbb{E}_{\bm{\xi}}[uv]=\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}
Eu,v[σ(ηiν+Δtu)σ(ηjν+Δtv)]=s=1(WiTWj/d)ss!Eu[Hes(u)σ(ηiν+Δtu)]Eu[Hes(v)σ(ηjν+Δtv)].\begin{align} &\mathbb{E}_{u, v}[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &=\sum_{s=1}^\infty\frac{(\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j/d)^s}{s!}\mathbb{E}_u[He_s(u)\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_u[He_s(v)\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]. \end{align}
💭 Click to ask about this equation
We truncate at order s=1s=1 since the corrections are order O(1/d)\mathcal{O}(1/d).
Uij=1nν=1nEu[σ(ηiν+Δtu)]Ev[σ(ηjν+Δtv)]+1nν=1nWiTWjdEu[uσ(ηiν+Δtu)]Ev[vσ(ηjν+Δtv)]=1nν=1nEu[σ(ηiν+Δtu)]Ev[σ(ηjν+Δtv)]+WiTWjdEη[Eu[uσ(ηi+Δtu)]Ev[vσ(ηj+Δtv)]].\begin{align} \bm{\mathrm{U}}_{ij}&=\frac{1}{n}\sum_{\nu=1}^n \mathbb{E}_u[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &+\frac{1}{n}\sum_{\nu=1}^n\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}\mathbb{E}_u[u\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[v\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &=\frac{1}{n}\sum_{\nu=1}^n \mathbb{E}_u[\sigma\left(\eta_i^\nu+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[\sigma\left(\eta_j^\nu+\sqrt{\Delta_t}v\right)]\\ &+\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}\mathbb{E}_{\bm{\eta}}[\mathbb{E}_u[u\sigma\left(\bm{\eta}_i+\sqrt{\Delta_t}u\right)]\mathbb{E}_v[v\sigma\left(\bm{\eta}_j+\sqrt{\Delta_t}v\right)]]. \end{align}
💭 Click to ask about this equation
by neglecting O(1/d)\mathcal{O}(1/d) corrections and where the law of η\eta can be considered Gaussian with zero mean correlation E[ηiνηjν]=e2tTr(Σ)dδij=e2tσx2δij\mathbb{E}[\eta_i^\nu \eta_j^\nu]=\frac{e^{-2t} \operatorname{Tr}(\boldsymbol{\Sigma})}{d}\delta_{ij}=e^{-2t}\sigma_{\bm{\mathrm{x}}}^2\delta_{ij}. The coefficient in front of WiTWjd\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d} is therefore
bt2=(Eu,v[vσ(etσxu+Δtv)])2.\begin{align} b_t^2=(\mathbb{E}_{u, v}[v\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)])^2. \end{align}
💭 Click to ask about this equation
Denote σ0(η)=Eu[σ(η+Δtu)]\sigma_0(\eta)=\mathbb{E}_u[\sigma(\eta+\sqrt{\Delta_t}u)]. We now focus on
1nνσ0(ηiν)σ0(ηjν).\begin{align} \frac{1}{n}\sum_\nu \sigma_0(\eta_i^\nu)\sigma_0(\eta_j^\nu). \end{align}
💭 Click to ask about this equation
We use the GEP on σ0\sigma_0
σ0(etWiTxνd)atetWiTxνd+ vtΩiν,xνN(0,Σ),ΩiνN(0,Ip),\begin{align} \sigma_0\left(\frac{e^{-t} \bm{\mathrm{W}}_i^T \bm{\mathrm{x}}^\nu}{\sqrt d}\right) \to a_t e^{-t}\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{x}}'^\nu}{\sqrt d} + \ v_t \bm{\Omega}_i^\nu, \quad \bm{\mathrm{x}}'^\nu\sim\mathcal{N}(0, \boldsymbol{\Sigma}), \quad \bm{\Omega}_i^\nu\sim \mathcal{N}(0, \bm{I}_{p}), \end{align}
💭 Click to ask about this equation
with at=Eu[σ0(etσxu)uetσx]=Eu,v[σ(etσxu+Δtv)uetσx]a_t=\mathbb{E}_u[\sigma_0(e^{-t}\sigma_{\bm{\mathrm{x}}}u)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}]=\mathbb{E}_{u, v}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}} u+\sqrt{\Delta_t }v)\frac{u}{e^{-t}\sigma_{\bm{\mathrm{x}}}}] and vt2=Eu[σ0(etσxu)2]at2e2tσx2=Eu,v,w[σ(etσxu+Δtv)σ(etσxu+Δtw)]at2e2tσx2v_t^2=\mathbb{E}_u[\sigma_0(e^{-t}\sigma_{\bm{\mathrm{x}}}u)^2]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2=\mathbb{E}_{u, v, w}[\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}v)\sigma(e^{-t}\sigma_{\bm{\mathrm{x}}}u+\sqrt{\Delta_t}w)]-a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2. Hence the truncated expansion yields for iji\neq j
Uij=1nν=1n(atetWiTxνd+vtΩiν)(atetWjTxνd+vtΩjν)T+bt2WiTWjd.\begin{align} \bm{\mathrm{U}}_{ij}=\frac{1}{n}\sum_{\nu=1}^n\left(a_te^{-t}\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{x}}'^\nu}{\sqrt{d}}+v_t \bm{\Omega}_i^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}}_j^T \bm{\mathrm{x}}'^\nu}{\sqrt{d}}+v_t \bm{\Omega}_j^\nu \right)^T+b_t^2\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{W}}_j}{d}. \end{align}
💭 Click to ask about this equation
Now we need to deal with the diagonal term. We need to substract
(at2e2tσx2+vt2+bt2)Ip.\begin{align} \left(a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2+v_t^2+b_t^2\right) \bm{I}_p. \end{align}
💭 Click to ask about this equation
The Gaussian equivalent of U\bm{\mathrm{U}} reads
U=GnGTn+bt2WWTd+st2Ip,\begin{align} \bm{\mathrm{U}}=\frac{\bm{\mathrm{G}}}{\sqrt{n}}\frac{\bm{\mathrm{G}}^T}{\sqrt{n}}+b_t^2\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}+s_t^2 \bm{I}_p, \end{align}
💭 Click to ask about this equation
with st2=σ2at2e2tσx2vt2bt2s_t^2=\lVert \sigma \rVert^2- a_t^2e^{-2t}\sigma_{\bm{\mathrm{x}}}^2-v_t^2-b_t^2.

Lemma (GEP for U~\tilde{U}).

Let
U~=Ey[σ(Wyd)σ(Wyd)T],\begin{align} \tilde{\bm{\mathrm{U}}}=\mathbb{E}_{\bm{\mathrm{y}}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{y}}}{\sqrt{d}})\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{y}}}{\sqrt{d}})^T], \end{align}
💭 Click to ask about this equation
where the expectation value is taken yPt\bm{\mathrm{y}}\sim P_t. Then the GEP of U~\tilde{U} reads
μ12(t)WΣtWTd+(σ2μ12(t))Ip,\begin{align} \mu_1^2(t)\frac{\bm{\mathrm{W}} \boldsymbol{\Sigma}_t \bm{\mathrm{W}}^T}{d}+\left(\lVert \sigma \rVert^2-\mu_1^2(t)\right) \bm{I}_p, \end{align}
💭 Click to ask about this equation
where μ12(t)\mu_1^2(t) and σ2\lVert \sigma \rVert^2 are defined in Appendix C.1.
Proof: For a vector y\bm{\mathrm{y}} sampled from PtP_t, the WiTyd\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}} are asymptotically Gaussian with 0 mean, variance Ey[WiTydWiTyd]=WiTΣtWidΓt2\mathbb{E}_{\bm{\mathrm{y}}}[\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}} \frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}}]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}\sim\Gamma_t^2 and correlation Ey[WiTydWjTyd]=WiTΣtWjd\mathbb{E}_{\bm{\mathrm{y}}}[\frac{\bm{\mathrm{W}}_i^T \bm{\mathrm{y}}}{\sqrt{d}} \frac{\bm{\mathrm{W}}_j^T \bm{\mathrm{y}}}{\sqrt{d}}]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_j}{d}. We apply Mehler Kernel formula to U~\tilde{\bm{\mathrm{U}}}
U~ij=s1s!(Wik(Σt)klWjlΓt2d)sEu[σ(Γtu)Hes(u)]Ev[σ(Γtv)Hes(v)],\begin{align} \tilde{\bm{\mathrm{U}}}_{ij}=\sum_s\frac{1}{s!}\left(\frac{\bm{\mathrm{W}}_{ik}(\boldsymbol{\Sigma}_t)_{kl} \bm{\mathrm{W}}_{jl}}{\Gamma_t^2d}\right)^s \mathbb{E}_u[\sigma(\Gamma_tu)He_s(u)]\mathbb{E}_v[\sigma(\Gamma_tv)He_s(v)], \end{align}
💭 Click to ask about this equation
where the expectation on uu and vv is standard Gaussian. We keep only terms at order O(1/d)\mathcal{O}(1/\sqrt{d}). If iji\neq j we keep the terms up to order s=1s=1.
U~ij=(Wik(Σt)klWjlΓt2d)Eu[σ(Γtu)u]2.\begin{align} \tilde{\bm{\mathrm{U}}}_{ij}=\left(\frac{\bm{\mathrm{W}}_{ik}(\boldsymbol{\Sigma}_t)_{kl} \bm{\mathrm{W}}_{jl}}{\Gamma_t^2d}\right)\mathbb{E}_u[\sigma(\Gamma_tu)u]^2. \end{align}
💭 Click to ask about this equation
For i=ji=j we cannot truncate because all terms are Od(1)\mathcal{O}_d(1). Hence the diagonals terms are asymptotically
U~ii=EuN(0,1)[σ2(Γtz)]=σ2.\begin{align} \tilde{\bm{\mathrm{U}}}_{ii}=\mathbb{E}_{u\sim\mathcal{N}(0, 1)}[\sigma^2(\Gamma_t z)]=\lVert\sigma\rVert^2. \end{align}
💭 Click to ask about this equation
Taking care of the diagonal terms, the Gaussian Equivalent matrix reads
U~=μ12(t)Γt2WΣtWd+(σ2μ12(t))Ip\begin{align} \tilde{\bm{\mathrm{U}}}=\frac{\mu_1^2(t)}{\Gamma_t^2}\frac{\bm{\mathrm{W}} \boldsymbol{\Sigma}_t \bm{\mathrm{W}}}{d}+\left(\lVert \sigma \rVert^2-\mu_1^2(t)\right) \bm{I}_p \end{align}
💭 Click to ask about this equation
where μ1(t)=Eu[σ(Γtu)u]\mu_1(t)=\mathbb{E}_{u}[\sigma(\Gamma_tu)u].
Building on the GEP of U~\tilde{\bm{\mathrm{U}}}, we prove the following lemma on the scaling of the eigenvalues in the bulk.

Lemma 5: Scaling of the bulk of U~\tilde{\bm{\mathrm{U}}}

We assume that Σ\boldsymbol{\Sigma} is positive definite and that the spectral norm λmax(Σ)\lambda_{\mathrm{max}}(\boldsymbol{\Sigma}) stays Od(1)\mathcal{O}_d(1). In the high dimensional limit p>d1p>d\gg 1, the spectrum of U~\tilde{\bm{\mathrm{U}}} is asymptotically equal to
(11ψp)δ(λ(σ2μ12(t)))+1ψpρbulk(λ),\begin{align} \left(1-\frac{1}{\psi_p}\right)\delta(\lambda-(\lVert \sigma \rVert^2-\mu_1^2(t)))+\frac{1}{\psi_p}\rho_{\mathrm{bulk}}(\lambda), \end{align}
💭 Click to ask about this equation
where ρbulk(λ)\rho_{\mathrm{bulk}}(\lambda) is an atomless measure whose support is of order O(ψp)\mathcal{O}(\psi_p).
Proof: Since p>dp>d and WRp×d\bm{\mathrm{W}}\in\mathbb{R}^{p\times d} and ΣRd×d\boldsymbol{\Sigma}\in \mathbb{R}^{d\times d}, the spectrum admits a Dirac mass at λ=σ2μ12(t)\lambda=\lVert \sigma \rVert^2-\mu_1^2(t) with weight (pd)/p(p-d)/p. For the order of magnitude of the eigenvalues in the bulk, let us first observe that the bulk of WTΣtWd\frac{\bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}}{d} is the same as the one of Σt1/2WWTΣt1/2d\frac{\boldsymbol{\Sigma}_t^{1/2} \bm{\mathrm{W}} \bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t^{1/2}}{d}. We can bound the spectral norm of the product by the product of the spectral norms
λmax(Σt1/2WWTΣt1/2d)λmax(WWTd)λmax(Σt)O(ψp),\begin{align} \lambda_{\mathrm{max}}(\frac{\boldsymbol{\Sigma}_t^{1/2} \bm{\mathrm{W}} \bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t^{1/2}}{d})\le \lambda_{\mathrm{max}}(\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}) \lambda_{\mathrm{max}}(\boldsymbol{\Sigma}_t)\lesssim\mathcal{O}(\psi_p), \end{align}
💭 Click to ask about this equation
since we assumed that λmax(Σt)=e2tλmax(Σt)+Δt=O(1)\lambda_{\mathrm{max}}(\boldsymbol{\Sigma}_t)=e^{-2t}\lambda_{\mathrm{max}}(\boldsymbol{\Sigma}_t)+\Delta_t=\mathcal{O}(1) and since λmax(WWTd)=O(ψp)\lambda_{\mathrm{max}}(\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d})=\mathcal{O}(\psi_p) is given by the Marchenko-Pastur law [63]. To bound the norm from below we use the following inequality
λmin(Σt)λmin(WWTd)λmin(Σt1/2WWTΣt1/2d).\begin{align} \lambda_{\mathrm{min}}(\boldsymbol{\Sigma}_t)\lambda_{\mathrm{min}}(\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d})\le \lambda_{\mathrm{min}}(\frac{\boldsymbol{\Sigma}_t^{1/2} \bm{\mathrm{W}} \bm{\mathrm{W}}^T \boldsymbol{\Sigma}_t^{1/2}}{d}). \end{align}
💭 Click to ask about this equation
Since Σt\boldsymbol{\Sigma}_t is positive definite, the bound is also of order ψp\psi_p. This concludes that the support of the bulk is of order ψp\psi_p.

Lemma (GEP for V\bm{\mathrm{V}} and V~\tilde{\bm{\mathrm{V}}}).

Let
V=1nν=1nEξ[σ(Wxtν(ξ)d)ξT],V~=Ex,ξ[σ(Wxt(ξ)d)ξT].\begin{align} & \bm{\mathrm{V}}=\frac{1}{n}\sum_{\nu=1}^n\mathbb{E}_{\bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t^\nu(\bm{\xi})}{\sqrt{d}}) \bm{\xi}^T], \\ &\tilde{\bm{\mathrm{V}}}=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\sigma(\frac{\bm{\mathrm{W}} \bm{\mathrm{x}}_t(\bm{\xi})}{\sqrt{d}}) \bm{\xi}^T]. \end{align}
💭 Click to ask about this equation
They can be replaced by their Gaussian Equivalence Principle in the train and test losses.
V~=V=μ1(t)ΔtΓtWd.\begin{align} \tilde{\bm{\mathrm{V}}}= \bm{\mathrm{V}}=\frac{\mu_1(t)\sqrt{\Delta_t}}{\Gamma_t}\frac{\bm{\mathrm{W}}}{\sqrt{d}}. \end{align}
💭 Click to ask about this equation
Proof: The two matrices only differ element-wise by quantity of order O(1/n)\mathcal{O}(1/n) and therefore have the same Gaussian Equivalent matrix. We focus on V~\tilde{\bm{\mathrm{V}}}. Introduce the random variable ηi=Wik(etxk+Δtξk)d\bm{\eta}_i=\frac{\bm{\mathrm{W}}_{ik}(e^{-t} \bm{\mathrm{x}}_k+\sqrt{\Delta_t} \bm{\xi}_k)}{\sqrt{d}}. Its has 0 mean, covariance Ex,ξ[ηi2]=WiTΣtWidΓt2\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\bm{\eta}_i^2]=\frac{\bm{\mathrm{W}}_i^T \boldsymbol{\Sigma}_t \bm{\mathrm{W}}_i}{d}\sim\Gamma_t^2 and correlation with ξ\bm{\xi} γij=Ex,ξ[ηiξj]=ΔtWijd\gamma_{ij}=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\bm{\eta}_i \bm{\xi}_j]=\frac{\sqrt{\Delta_t} \bm{\mathrm{W}}_{ij}}{\sqrt{d}}. We apply the Mehler Kernel formula
V~ij=Ex,ξ[σ(Γt(Wik(et(Σt)klzl+Δtξl)Γtd))ξj]=s1s!(WijΔtΓtd)sEu[σ(Γtu)Hes(u)]Ev[vHes(v)]=0+ΔtΓtWijdEu[σ(Γtu)u]Ev[v2]+O(1d)=Δtμ1(t)ΓtWijd.\begin{align} \tilde{\bm{\mathrm{V}}}_{ij}&=\mathbb{E}_{\bm{\mathrm{x}}, \bm{\xi}}[\sigma\left(\Gamma_t(\frac{\bm{\mathrm{W}}_{ik}(e^{-t}(\boldsymbol{\Sigma}_t)_{kl} \bm{\mathrm{z}}_l+\sqrt{\Delta_t}\xi_l)}{\Gamma_t\sqrt{d}})\right)\xi_{j}]\\ &=\sum_s\frac{1}{s!}\left(\frac{\bm{\mathrm{W}}_{ij}\sqrt{\Delta_t}}{\Gamma_t\sqrt{d}}\right)^s\mathbb{E}_u[\sigma(\Gamma_t u)He_s(u)]\mathbb{E}_v[vHe_s(v)]\\ &=0+\frac{\sqrt{\Delta_t}}{\Gamma_t}\frac{\bm{\mathrm{W}}_{ij}}{\sqrt{d}}\mathbb{E}_u[\sigma(\Gamma_t u)u]\mathbb{E}_v[v^2]+\mathcal{O}(\frac{1}{d})\\ &=\frac{\sqrt{\Delta_t}\mu_1(t)}{\Gamma_t}\frac{\bm{\mathrm{W}}_{ij}}{\sqrt{d}}. \end{align}
💭 Click to ask about this equation

C.4 Proof of Theorem 1

We recall the Theorem 1 of the MT.

Theorem 6

Let q(z)=1pTr(UzIp)1q(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}, r(z)=1pTr(Σ1/2WT(UzIp)1WΣ1/2)r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2}) and s(z)=1pTr(WT(UzIp)1W)s(z)=\frac{1}{p} \operatorname{Tr}(\bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I}_p)^{-1} \bm{\mathrm{W}}), with zCz\in\mathbb{C}. Let
s^(q)=bt2ψp+1q,r^(r,q)=ψpat2e2t1+at2e2tψpψnr+ψpvt2ψnq.\begin{align} &\hat{s}(q)=b_t^2\psi_p+\frac{1}{q}, \\ &\hat{r}(r, q)=\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}\psi_p }{\psi_n }r+\frac{\psi_p v_t^2}{\psi_n }q}. \end{align}
💭 Click to ask about this equation
Then q(z),r(z)q(z), r(z) and s(z)s(z) satisfy the following set of three equations:
s=dρΣ(λ)1s^(q)+λr^(r,q),r=dρΣ(λ)λs^(q)+λr^(r,q),ψp(st2z)+ψpvt21+at2e2tψpψnr+ψpvt2ψnq+1ψpqsq2=0,\begin{align} &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}(q)+\lambda\hat{r}(r, q)}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}\psi_p} {\psi_n} r+\frac{\psi_p v_t^2}{\psi_n} q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0, \end{align}
💭 Click to ask about this equation
The eigenvalue distribution of U\bm{\mathrm{U}}, ρ(λ)\rho(\lambda), can then be obtained using the Sokhotski–Plemelj inversion formula ρ(λ)=limε0+1πImq(λ+iε)\rho(\lambda)=\underset{\varepsilon \rightarrow0^+}{\lim}\frac{1}{\pi}\operatorname{Im}q(\lambda+i\varepsilon).
We first show that the equations of the Stieltjes transform of ρ\rho found in Ref. [17] with linear pencils [53] in the case Px=N(0,Id)P_{\bm{\mathrm{x}}}=\mathcal{N}(0, \bm{I}_d) i.e. ρΣ(λ)=δ(λ1)\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1) can be reduced to the equations of Theorem 1 with our definitions of μ1(t),st\mu_1(t), s_t and vtv_t. The equations of [17] read
ζ1(st2z+e2tμ12ζ2ζ3+vt2ζ2+Δtμ12ζ4)1=0ζ2(ψn+vt2ψpζ1etμ1ζ3)ψn=0etμ1ψpζ1(1+etμ1ζ2ζ3)+(1+(Δtμ12ψpζ1)ζ3)=0e2tμ12ψpζ1ζ2ζ4+(1+Δtμ12ψpζ1)ζ41=0,\begin{align} &\zeta_1(s_t^2-z+e^{-2t}\mu_1^2\zeta_2\zeta_3+v_t^2\zeta_2+\Delta_t\mu_1^2\zeta_4)-1=0\\ &\zeta_2(\psi_n+v_t^2\psi_p\zeta_1-e^{-t}\mu_1\zeta_3)-\psi_n=0\\ &e^{-t}\mu_1\psi_p\zeta_1(1+e^{-t}\mu_1\zeta_2\zeta_3)+(1+(\Delta_t\mu_1^2\psi_p\zeta_1)\zeta_3)=0\\ &e^{-2t}\mu_1^2\psi_p\zeta_1\zeta_2\zeta_4+(1+\Delta_t\mu_1^2\psi_p\zeta_1)\zeta_4-1=0, \end{align}
💭 Click to ask about this equation
with ζ1=q\zeta_1=q and ζ2,3,4\zeta_{2, 3, 4} auxiliary variables. We make the following change of variables r=ζ3etμ1ψpr=-\frac{\zeta_3}{e^{-t}\mu_1\psi_p}. The second equations relates ζ2\zeta_2 to qq and rr
ζ2=11+e2tμ12ψpψnr+vt2ψpψnq.\begin{align} \zeta_2=\frac{1}{1+\frac{e^{-2t}\mu_1^2\psi_p}{\psi_n}r+\frac{v_t^2\psi_p}{\psi_n}q}. \end{align}
💭 Click to ask about this equation
Injecting this into the second equations gives the second equation of Theorem 1. The fourth equation gives
ζ4=11+μ12ψpq(Δt+e2tζ2).\begin{align} \zeta_4=\frac{1}{1+\mu_1^2\psi_pq(\Delta_t+e^{-2t}\zeta_2)}. \end{align}
💭 Click to ask about this equation
Injecting this into the first equation gives
q(st2z+e2tμ12ζ2r(etμ1ψp)+vt2ζ2+Δtμ1211+μ12ψpq(Δt+e2tζ2))1=0.\begin{align} q(s_t^2-z+e^{-2t}\mu_1^2\zeta_2r(-e^{-t}\mu_1\psi_p)+v_t^2\zeta_2+\Delta_t\mu_1^2\frac{1}{1+\mu_1^2\psi_pq(\Delta_t+e^{-2t}\zeta_2)})-1=0. \end{align}
💭 Click to ask about this equation
After some massaging we find back the first equation of Theorem 1.
We now prove Theorem 1 using a replica computation, inspired by the calculation done in Ref. [44].
Proof: Our goal is to compute the Stieltjes transform of the matrix U\bm{\mathrm{U}}.
q=limp1pEW,X,Ω[Tr(UzIp)1]=zlimp1pEW,X,Ω[logdet(UzIp)]=2zlimp1pEW,X,Ω[logdet(UzIp)1/2].\begin{align} q&=\underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\operatorname{Tr}(\bm{\mathrm{U}}-z \bm{I}_p)^{-1}]\\ &=-\partial_z \underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\log\det (\bm{\mathrm{U}}-z \bm{I}_p)]\\ &=2\partial_z\underset{p\rightarrow\infty}{\lim}\frac{1}{p}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\log\det (\bm{\mathrm{U}}-z \bm{I}_p)^{-1/2}]. \end{align}
💭 Click to ask about this equation
The so-called replica trick consists of replacing the logx\log x by limsxs1s\underset{s\rightarrow\infty}{\lim}\frac{x^s-1}{s}. Applying this identity, we obtain
q=2zlims0limp1psEW,X,Ω[det(UzIp)s/21],\begin{align} q=2\partial_z\underset{s\rightarrow0}{\lim}\underset{p\rightarrow\infty}{\lim}\frac{1}{ps}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\det (\bm{\mathrm{U}}-z \bm{I}_p)^{-s/2}-1], \end{align}
💭 Click to ask about this equation
where as usual with replica computations we have inverted the order of the limits pp\rightarrow \infty and s0s\rightarrow 0. We define the partition function Z\mathcal{Z} as
Z=det(UzIp)1/2=dϕ(2π)p/2e12ϕT(UzIp)ϕ.\begin{align} \mathcal{Z}= \det(\bm{\mathrm{U}}-z \bm{I}_p)^{-1/2}=\int \frac{\mathrm{d}\phi}{(2\pi)^{p/2}}e^{-\frac{1}{2}{\phi}^T(\bm{\mathrm{U}}-z \bm{I}_p)\phi}. \end{align}
💭 Click to ask about this equation
We replace U\bm{\mathrm{U}} by its Gaussian equivalent proved in Lemma 4 and write the partition function for an arbitrary integer ss
EW,X,Ω[Zs]=a=1sdϕa(2π)p/2EW,X,Ω[e12ϕaT(UzIp)ϕa]=a=1sdϕa(2π)p/2e12ϕaT(zst2)ϕa   EW,X,Ω[e12nϕaT(atetWXνd+vtΩν)(atetWXνd+vtΩν)Tϕaebt22dϕaTWWTϕa].\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}}\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[e^{-\frac{1}{2}{\phi^a}^T(\bm{\mathrm{U}}-z \bm{I}_p)\phi^a}]\\ &=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}}e^{\frac{1}{2}{\phi^a}^T(z-s_t^2)\phi^a}\nonumber \\ &\ \ \ \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}e^{-\frac{b_t^2}{2d}{\phi^a}^T \bm{\mathrm{W}} \bm{\mathrm{W}}^T\phi^a}]. \end{align}
💭 Click to ask about this equation
We first perform the computation for integer values of ss, and then analytically continue the result to the limit s0s \to 0. To compute the expectation over X\bm{\mathrm{X}}, W\bm{\mathrm{W}}, and Ω\bm{\Omega}, we need the following standard result from Gaussian integration
dxe12xGxT+JxT=e12logdetG+12JG1JT,\begin{align} \int \mathrm{d} \bm{\mathrm{x}} e^{-\frac{1}{2} \bm{\mathrm{x}} \bm{\mathrm{G}} \bm{\mathrm{x}}^T+ \bm{\mathrm{J}} \bm{\mathrm{x}}^T}=e^{-\frac{1}{2}\log \det \bm{\mathrm{G}}+\frac{1}{2} \bm{\mathrm{J}} \bm{\mathrm{G}}^{-1} \bm{\mathrm{J}}^T}, \end{align}
💭 Click to ask about this equation
where G\bm{\mathrm{G}} is a square matrix and J\bm{\mathrm{J}} a vector. Averaging over the data set. The dataset dependence enters through
EX[e12nϕaT(atetWXνd+vtΩν)(atetWXνd+vtΩν)Tϕa]=EX[eat2e2t2ndϕaTWXνXνTWTϕaeatetvt2dnϕaT(WXνΩT+ΩXνTWT)ϕa]evt22nϕaTΩΩTϕa.\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]\nonumber \\ &=\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{a_t^2e^{-2t}}{2nd}{\phi^a}^T \bm{\mathrm{W}} \bm{\mathrm{X}}^\nu {\bm{\mathrm{X}}^\nu}^T \bm{\mathrm{W}}^T\phi^a}e^{-\frac{a_te^{-t}v_t}{2\sqrt{d}n}{\phi^a}^T(\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu \bm{\Omega}^T+ \bm{\Omega} {\bm{\mathrm{X}}^\nu}^T \bm{\mathrm{W}}^T)\phi^a}] e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}. \end{align}
💭 Click to ask about this equation
We introduce for each replica ϕa\phi^a a Fourier transform of the delta function by using the auxiliary variables ωa,ω^aRd\omega^a, \hat{\omega}^a \in \mathbb{R}^{d} as2
Throughout the computation, we discard non-exponential prefactors, as they give subleading contributions.
dω^aeiω^a(pωaϕaTWΣ1/2)=1.\begin{align} \int \mathrm{d}\hat{\omega}^a e^{i\hat{\omega}^a(\sqrt{p}\omega^a- {\phi^a}^T \bm{\mathrm{W}} \Sigma^{1/2})}=1. \end{align}
💭 Click to ask about this equation
In the following, we do the change of variable Xν=Σ1/2Zν\bm{\mathrm{X}}^\nu= \Sigma^{1/2} \bm{\mathrm{Z}}^\nu with Zν\bm{\mathrm{Z}}^\nu a dd dimensional Gaussian random variable with unit variance.
EX[e12nϕaT(atetWXνd+vtΩν)(atetWXνd+vtΩν)Tϕa]=EZ[eat2e2tp2ndωaTZνZνTωaeatetvtpdna,νΩνϕaωaZν]evt22nϕaTΩΩTϕa.\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]\nonumber \\ &=\mathbb{E}_{\bm{\mathrm{Z}}}[e^{-\frac{a_t^2e^{-2t}p}{2nd}{\omega^a}^T \bm{\mathrm{Z}}^\nu {\bm{\mathrm{Z}}^\nu}^T\omega^a}e^{-\frac{a_t e^{-t}v_t\sqrt{p}}{\sqrt{d}n}\sum_{a, \nu} \bm{\Omega}^\nu \phi^a \omega^a\cdot \bm{\mathrm{Z}}^\nu}] e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}. \end{align}
💭 Click to ask about this equation
Denote GZ=at2pdnaωaωaT\bm{\mathrm{G}}_{\bm{\mathrm{Z}}}=\frac{a_t^2p}{dn}\sum_a\omega^a{\omega^a}^T and (JZ)ν=atetvtpdna(Ωνϕa)ωa(\bm{\mathrm{J}}_{\bm{\mathrm{Z}}})^\nu=\frac{a_te^{-t}v_t\sqrt{p}}{\sqrt{d}n}\sum_{a}(\bm{\Omega}^\nu\cdot \phi^a) \omega^a, then
EX[e12nϕaT(atetWXνd+vtΩν)(atetWXνd+vtΩν)Tϕa]=en2logdet(1+GZ)eatetvtp2dn2ν(Ωνϕa)(Ωνϕb)ωah(1+GZ)k,l1ωblevt22nϕaTΩΩTϕa,\begin{align} &\mathbb{E}_{\bm{\mathrm{X}}}[e^{-\frac{1}{2n}{\phi^a}^T\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)\left(a_te^{-t}\frac{\bm{\mathrm{W}} \bm{\mathrm{X}}^\nu}{\sqrt{d}}+v_t \bm{\Omega}^\nu\right)^T\phi^a}]=\nonumber \\ &e^{-\frac{n}{2}\log \det(1+ \bm{\mathrm{G}}_{\bm{\mathrm{Z}}})}e^{\frac{a_te^{-t}v_tp}{2dn^2} \sum_\nu (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_h(1+ \bm{\mathrm{G}}_{\bm{\mathrm{Z}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}, \end{align}
💭 Click to ask about this equation
where repeated indices mean that there is an implicit summation. Averaging over Ω\bm{\Omega}. The terms that depend on Ω\bm{\Omega} are
EΩ[eatetvt2dn2ν(Ωνϕa)(Ωνϕb)ωak(1+GX)k,l1ωblevt22nϕaTΩΩTϕa]=(EΩν[eatetvtp2dn2(Ωνϕa)(Ωνϕb)ωak(1+GX)k,l1ωblevt22nϕaTΩνΩνTϕa])n=en2logdet(1+GΩ),\begin{align} &\mathbb{E}_{\bm{\Omega}}[e^{\frac{a_te^{-t}v_t}{2dn^2} \sum_\nu (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_k(1+ \bm{\mathrm{G}}_{\bm{\mathrm{X}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega} \bm{\Omega}^T{\phi^a}}]\nonumber \\ &=(\mathbb{E}_{\bm{\Omega}^\nu}[e^{\frac{a_te^{-t}v_tp}{2dn^2} (\bm{\Omega}^\nu\cdot \phi^a)(\bm{\Omega}^\nu\cdot \phi^b) {\omega^a}_k(1+ \bm{\mathrm{G}}_{\bm{\mathrm{X}}})^{-1}_{k, l} {\omega^b}_l}e^{-\frac{v_t^2}{2n}{\phi^a}^T \bm{\Omega}^\nu {\bm{\Omega}^\nu}^T{\phi^a}}])^n\\ &=e^{-\frac{n}{2}\log \det (1+ \bm{\mathrm{G}}_{\bm{\Omega}})}, \end{align}
💭 Click to ask about this equation
with
(GΩ)k,l=ϕa(vt2nδabatetvtpdn2ωak(1+GZ)k,l1ωbl)ϕbT.\begin{align} (\bm{\mathrm{G}}_{\bm{\Omega}})_{k, l}=\phi^a(\frac{v_t^2}{n}\delta_{ab} -\frac{a_te^{-t}v_tp}{dn^2} {\omega^a}_k(1+ \bm{\mathrm{G}}_{\bm{\mathrm{Z}}})^{-1}_{k, l}{\omega^b}_{l}){\phi^b}^T. \end{align}
💭 Click to ask about this equation
We are left with
EW,X,Ω[Zs]=a=1sdϕa(2π)p/2dωadω^ae12(zst2)ϕaϕaTebt2p2dωaTΣ1ωaEW[eiω^a(pωaϕaTWΣ1/2) en2logdet(Id+GZ)en2logdet(Id+GΩ)].\begin{align} &\mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]=\int\prod_{a=1}^s \frac{\mathrm{d}\phi^a}{(2\pi)^{p/2}} \mathrm{d}\omega^a d\hat{\omega}^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}\nonumber \\ &\mathbb{E}_{\bm{\mathrm{W}}}[e^{i\hat{\omega}^a(\sqrt{p}\omega^a- {\phi^a}^T \bm{\mathrm{W}} \Sigma^{1/2})}\ e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}]. \end{align}
💭 Click to ask about this equation
Averaging over the random features W\bm{\mathrm{W}}. W\bm{\mathrm{W}} only appears through eiω^aWTϕaΣ1/2e^{-i\hat{\omega}^a \bm{\mathrm{W}}^T{\phi^a} \Sigma^{1/2}}.
EW[eiaω^a(pωaWTϕaΣt1/2)]=eipaω^aωa(EW[eiω^kaϕaiWli(Σt1/2)kl])=eipω^aωae12ω^ka(Σ)klω^lbϕiaϕib=eipaω^aωae12a,bω^aΣω^bϕaϕb.\begin{align} \mathbb{E}_{\bm{\mathrm{W}}}[e^{i\sum_a\hat{\omega}^a(\sqrt{p}\omega^a- \bm{\mathrm{W}}^T{\phi^a} \Sigma^{1/2}_t)}]&=e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a}(\mathbb{E}_{\bm{\mathrm{W}}}[e^{-i \hat{\omega}_k^a{\phi^a}_i \bm{\mathrm{W}}_{li}(\Sigma^{1/2}_t)_{kl}}])\\ &=e^{i\sqrt{p}\hat{\omega}^a\cdot\omega^a}e^{-\frac{1}{2}\hat{\omega}^a_k(\boldsymbol{\Sigma})_{kl}\hat{\omega}^b_{l}\phi^a_i\phi^b_i}\\ &=e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a} e^{-\frac{1}{2}\sum_{a, b} \hat{\omega}^a \boldsymbol{\Sigma} \hat{\omega}^b \phi^a\cdot \phi^b}. \end{align}
💭 Click to ask about this equation
We end up with
EW,X,Ω[Zs]=a=1sdϕadωadω^ae12(zst2)ϕaϕaTebt2p2dωaTΣ1ωaeipaω^aωae12a,bω^aΣω^bϕaϕben2logdet(Id+GZ)en2logdet(Id+GΩ).\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a d\hat{\omega}^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}e^{i\sqrt{p}\sum_a\hat{\omega}^a\cdot\omega^a}\nonumber \\ & e^{-\frac{1}{2}\sum_{a, b} \hat{\omega}^a \boldsymbol{\Sigma} \hat{\omega}^b \phi^a\cdot \phi^b} e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}. \end{align}
💭 Click to ask about this equation
Averaging over the ω^a\hat{\omega}^a. We can integrate with respect to ω^\hat{\omega}. The only terms that appear with it are
adω^aeipaω^aωae12a,bω^aΣω^bϕaϕb.\begin{align} \int \prod_a \mathrm{d}\hat{\omega}^a e^{i\sqrt{p} \sum_a\hat{\omega}^a\cdot \omega^a} e^{-\frac{1}{2}\sum_{a, b} \hat{\omega}^a \boldsymbol{\Sigma} \hat{\omega}^b \phi^a\cdot \phi^b}. \end{align}
💭 Click to ask about this equation
Denote Jia=ipωia\bm{\mathrm{J}}_i^a=i\sqrt{p}\omega_i^a and Gklab=Σkl ϕaϕb\bm{\mathrm{G}}_{kl}^{ab}= \boldsymbol{\Sigma}_{kl} \ \phi^a\cdot\phi^b, then the integral is of the form
adω^aei,aJiaω^iae12i,j,a,bω^iaGijabω^jb=e12logdet(G)+12JTG1J.\begin{align} \int \prod_a \mathrm{d}\hat{\omega}^a e^{\sum_{i, a} \bm{\mathrm{J}}_i^a \hat{\omega}_i^a} e^{-\frac{1}{2}\sum_{i, j, a, b} \hat{\omega}_i^a \bm{\mathrm{G}}_{ij}^{ab}\hat{\omega}_j^b} =e^{-\frac{1}{2}\log \det(\bm{\mathrm{G}})+\frac{1}{2} \bm{\mathrm{J}}^T \bm{\mathrm{G}}^{-1} \bm{\mathrm{J}}}. \end{align}
💭 Click to ask about this equation
This gives
EW,X,Ω[Zs]=a=1sdϕadωae12(zst2)ϕaϕaTebt2p2dωaTΣ1ωaen2logdet(Id+GZ)en2logdet(Id+GΩ)e12logdet(G)+12JTG1J.\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a e^{\frac{1}{2}(z-s_t^2)\phi^a {\phi^a}^T}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}e^{-\frac{n}{2}\log \det(\bm{I}_d+ \bm{\mathrm{G}}_{Z})}\nonumber \\ & e^{-\frac{n}{2}\log \det (\bm{I}_d+ \bm{\mathrm{G}}_{\bm{\Omega}})}e^{-\frac{1}{2}\log \det(\bm{\mathrm{G}})+\frac{1}{2} \bm{\mathrm{J}}^T \bm{\mathrm{G}}^{-1} \bm{\mathrm{J}}}. \end{align}
💭 Click to ask about this equation
Introducing the order parameters. We define the order parameters as Qab=1pϕaϕb\bm{\mathrm{Q}}^{ab} = \frac{1}{p} \phi^a \cdot \phi^b and Rab=1dωaωb\bm{\mathrm{R}}^{ab} = \frac{1}{d} \omega^a \cdot \omega^b. To enforce these constraints, we use the following delta function representations
1=dQabdQ^abe12Q^ab(pQabϕaϕb),1=dRabdR^abe12R^ab(dRabωaωb),\begin{align} &1=\int \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}, \\ &1=\int \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}, \end{align}
💭 Click to ask about this equation
EW,Y,Ω[Zs]=a=1sdϕadωadQabdQ^abdRabdR^abe12Q^ab(pQabϕaϕb)e12R^ab(dRabωaωb)ep2(zst2)TrQen2logdet(Im+at2e2gpnR)ebt2p2dωaTΣ1ωaen2log(1+pn(vt2at2e2tvt2nR(1+at2e2tpnR)1)Q)e12logdet(ΣQ)e12ωkaΣkl1(Q1)abωlb.\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, Y, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab}\nonumber \\ &e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2g}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2p}{2d}{\omega^a}^T \boldsymbol{\Sigma}^{-1}\omega^a}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{1}{2}\omega_k^a \boldsymbol{\Sigma}^{-1}_{kl}(\bm{\mathrm{Q}}^{-1})_{ab}\omega_l^b}. \end{align}
💭 Click to ask about this equation
We also introduce Sab=ωkaΣ1ωlb/d\bm{\mathrm{S}}^{ab}=\omega_k^a \boldsymbol{\Sigma}^{-1}\omega_l^b/d.
EW,X,Ω[Zs]=a=1sdϕadωadQabdQ^abdRabdR^abdSabdS^abe12Q^ab(pQabϕaϕb)e12R^ab(dRabωaωb)e12S^ab(dSabωaΣ1ωb)ep2(zst2)TrQen2logdet(Im+at2e2tpnR)ebt2p2Tr(S)en2log(1+pn(vt2at2e2tvt2nR(1+at2vt2pnR)1)Q)e12logdet(ΣQ)ed2Tr(SQ1).\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d}\phi^a \mathrm{d}\omega^a \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} \mathrm{d} \bm{\mathrm{S}}^{ab} \mathrm{d} \hat{\bm{\mathrm{S}}}^{ab}\nonumber \\ &e^{\frac{1}{2}\hat{\bm{\mathrm{Q}}}^{ab}(p \bm{\mathrm{Q}}^{ab}-\phi^a \cdot \phi^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{R}}}^{ab}(d \bm{\mathrm{R}}^{ab}-\omega^a \cdot \omega^b)}e^{\frac{1}{2}\hat{\bm{\mathrm{S}}}^{ab}(dS^{ab}-\omega^a \boldsymbol{\Sigma}^{-1}\omega^b)}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2 p}{2} \operatorname{Tr}(\bm{\mathrm{S}})}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2v_t^2p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{d}{2} \operatorname{Tr} (\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1})}. \end{align}
💭 Click to ask about this equation
The integration over dϕa\mathrm{d} \phi^a and dωa\mathrm{d} \omega^a gives
EW,X,Ω[Zs]=a=1sdQabdQ^abdRabdR^abdSabdS^abep2Tr(Q^Q)ep2logdetQ^ed2R^abRabed2S^abSabe12logdet(R^Id+S^Σ1)ep2(zst2)TrQen2logdet(Im+at2e2tpnR)ebt2p2Tr(S)en2log(1+pn(vt2at2e2tvt2nR(1+at2e2tpnR)1)Q)e12logdet(ΣQ)ed2Tr(SQ1).\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^s]&=\int\prod_{a=1}^s \mathrm{d} \bm{\mathrm{Q}}^{ab} \mathrm{d}\hat{\bm{\mathrm{Q}}}^{ab} \mathrm{d} \bm{\mathrm{R}}^{ab} \mathrm{d}\hat{\bm{\mathrm{R}}}^{ab} \mathrm{d} \bm{\mathrm{S}}^{ab} \mathrm{d} \hat{\bm{\mathrm{S}}}^{ab}\nonumber \\ &e^{\frac{p}{2} \operatorname{Tr}(\hat{\bm{\mathrm{Q}}} \bm{\mathrm{Q}})}e^{-\frac{p}{2}\log \det \hat{\bm{\mathrm{Q}}}}e^{\frac{d}{2}\hat{\bm{\mathrm{R}}}^{ab} \bm{\mathrm{R}}^{ab}}e^{\frac{d}{2}\hat{\bm{\mathrm{S}}}^{ab} \bm{\mathrm{S}}^{ab}}\nonumber \\ &e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})}\nonumber \\ &e^{\frac{p}{2}(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}}e^{-\frac{n}{2}\log \det (\bm{I}_m+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})}e^{-\frac{b_t^2 p}{2} \operatorname{Tr}(\bm{\mathrm{S}})}\nonumber \\ & e^{-\frac{n}{2}\log(1+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(1+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})}\nonumber \\ &e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{d}{2} \operatorname{Tr} (\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1})}. \end{align}
💭 Click to ask about this equation
We need to combine e12logdet(ΣQ)e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})} and e12logdet(R^Id+S^Σ1)e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})},
e12logdet(ΣQ)e12logdet(R^Id+S^Σ1)=e12logdet(QS^Id+QR^Σ)=ed2logdet(QS^)e12logdet(ImId+R^S^1Σ)\begin{align} e^{-\frac{1}{2}\log \det(\boldsymbol{\Sigma} \otimes \bm{\mathrm{Q}})}e^{-\frac{1}{2}\log\det(\hat{\bm{\mathrm{R}}}\otimes \bm{I}_d+\hat{\bm{\mathrm{S}}}\otimes \boldsymbol{\Sigma}^{-1})} &=e^{-\frac{1}{2}\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}}\otimes \bm{I}_d+ \bm{\mathrm{Q}}\hat{\bm{\mathrm{R}}}\otimes \boldsymbol{\Sigma})}\\ &=e^{-\frac{d}{2}\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})}e^{-\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})} \end{align}
💭 Click to ask about this equation
Then for e12logdet(ImId+R^S^1Σ)e^{-\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})}, we can introduce ρΣ(λ)\rho_{\boldsymbol{\Sigma}}(\lambda) the density of eigenvalues of Σ\boldsymbol{\Sigma}
12logdet(ImId+R^S^1Σ)=12Trlog(ImId+R^S^1Σ)=12l0(1)ll!(R^S^1)lΣl=d2dλρΣ(λ)l0(1)ll!Tr((R^S^1)l)λl=d2dλρΣ(λ)Trlog(ImId+λR^S^1).\begin{align} -\frac{1}{2}\log \det(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})= &-\frac{1}{2} \operatorname{Tr} \log(\bm{I}_m\otimes \bm{I}_d+\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}\otimes \boldsymbol{\Sigma})\\ &=-\frac{1}{2}\sum_{l\ge0}\frac{(-1)^l}{l!}(\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})^l\otimes \boldsymbol{\Sigma}^l\\ &=-\frac{d}{2}\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda)\sum_{l\ge0}\frac{(-1)^l}{l!} \operatorname{Tr}((\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})^l)\lambda^l\\ &=-\frac{d}{2}\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}). \end{align}
💭 Click to ask about this equation
We end up with
EW,X,Ω[Zm]=dQdQ^dRdR^dSdS^ed2S(Q,Q^,R,R^,S,S^),\begin{align} \mathbb{E}_{\bm{\mathrm{W}}, \bm{\mathrm{X}}, \bm{\Omega}}[\mathcal{Z}^m]=\int \mathrm{d} \bm{\mathrm{Q}} \mathrm{d}\hat{\bm{\mathrm{Q}}} \mathrm{d} \bm{\mathrm{R}} \mathrm{d}\hat{\bm{\mathrm{R}}} \mathrm{d} \bm{\mathrm{S}} \mathrm{d}\hat{\bm{\mathrm{S}}}e^{-\frac{d}{2}S(\bm{\mathrm{Q}}, \hat{\bm{\mathrm{Q}}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})}, \end{align}
💭 Click to ask about this equation
where the action reads
S(Q,Q^,R,R^,S,S^)=ψplogdetQ^ψpTr(QQ^)Tr(RR^)Tr(SS^)ψp(zst2)TrQ+ψnlogdet(Is+at2e2tpnR)+bt2ψpTrS+ψnlog(Is+pn(vt2at2e2tvt2nR(Is+at2e2tpnR)1)Q)+logdet(QS^)+dλρΣ(λ)Trlog(ImId+λR^S^1)+Tr(SQ1).\begin{align} S(\bm{\mathrm{Q}}, \hat{\bm{\mathrm{Q}}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})&=\psi_p \log \det\hat{\bm{\mathrm{Q}}}-\psi_p \operatorname{Tr}(\bm{\mathrm{Q}}\hat{\bm{\mathrm{Q}}})- \operatorname{Tr}(\bm{\mathrm{R}}\hat{\bm{\mathrm{R}}})- \operatorname{Tr}(\bm{\mathrm{S}}\hat{\bm{\mathrm{S}}})\nonumber \\ &-\psi_p(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}+\psi_n\log \det(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})+b_t^2\psi_p \operatorname{Tr} \bm{\mathrm{S}}\nonumber \\ &+\psi_n \log(\bm{I}_s+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})\nonumber \\ &+\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})+\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1}) + \operatorname{Tr}(\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1}). \end{align}
💭 Click to ask about this equation
In the high dimensional limit, the partition function is dominated by the saddle point. By derivating with respect to Q^\hat{\bm{\mathrm{Q}}} we get
Q^1=Q,\begin{align} \hat{\bm{\mathrm{Q}}}^{-1}= \bm{\mathrm{Q}}, \end{align}
💭 Click to ask about this equation
which yields
S(Q,R,R^,S,S^)=ψplogdetQTr(RR^)Tr(SS^)ψp(zst2)TrQ+ψnlogdet(Is+at2e2tpnR)+bt2ψpTrS+ψnlog(Is+pn(vt2at2e2tvt2nR(Is+at2e2tpnR)1)Q)+logdet(QS^)+dλρΣ(λ)Trlog(ImId+λR^S^1)+Tr(SQ1).\begin{align} S(\bm{\mathrm{Q}}, \bm{\mathrm{R}}, \hat{\bm{\mathrm{R}}}, \bm{\mathrm{S}}, \hat{\bm{\mathrm{S}}})&=-\psi_p \log \det \bm{\mathrm{Q}}- \operatorname{Tr}(\bm{\mathrm{R}}\hat{\bm{\mathrm{R}}})- \operatorname{Tr}(\bm{\mathrm{S}}\hat{\bm{\mathrm{S}}})\nonumber \\ &-\psi_p(z-s_t^2) \operatorname{Tr} \bm{\mathrm{Q}}+\psi_n\log \det(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})+b_t^2\psi_p \operatorname{Tr} \bm{\mathrm{S}}\nonumber \\ &+\psi_n \log(\bm{I}_s+\frac{p}{n}(v_t^2-\frac{a_t^2e^{-2t}v_t^2}{n} \bm{\mathrm{R}}(\bm{I}_s+\frac{a_t^2e^{-2t}p}{n} \bm{\mathrm{R}})^{-1}) \bm{\mathrm{Q}})\nonumber \\ &+\log \det(\bm{\mathrm{Q}}\hat{\bm{\mathrm{S}}})+\int \mathrm{d}\lambda\rho_{\boldsymbol{\Sigma}}(\lambda) \operatorname{Tr}\log(\bm{I}_m\otimes \bm{I}_d+\lambda\hat{\bm{\mathrm{R}}}\hat{\bm{\mathrm{S}}}^{-1})\nonumber \\ &+ \operatorname{Tr}(\bm{\mathrm{S}} \bm{\mathrm{Q}}^{-1}). \end{align}
💭 Click to ask about this equation
As a sanity check, if Σ=Id\boldsymbol{\Sigma}= \bm{I}_d, differentiation with respect to R^\hat{\bm{\mathrm{R}}} and S^\hat{\bm{\mathrm{S}}} yields
R=S=(S^+R^)1,\begin{align} \bm{\mathrm{R}}= \bm{\mathrm{S}}=(\hat{\bm{\mathrm{S}}}+\hat{\bm{\mathrm{R}}})^{-1}, \end{align}
💭 Click to ask about this equation
and we find back the same action as before. RS Ansatz. As before we introduce a RS ansatz for all the the matrices and moreover suppose that only the diagonal terms are non vanishing i.e. they are of the form Q=qIs\bm{\mathrm{Q}}=q \bm{I}_s. This ansatz yields
S(q,r,r^,s,s^)/s=ψplogqrr^ss^ψp(zst2)q+ψnlog(1+at2e2tpnr+pvt2nq)+bt2ψps+log(q)+dλ  ρΣ(λ)log(s^+λr^)+sq.\begin{align} S(q, r, \hat{r}, s, \hat{s})/s&=-\psi_p \log q-r\hat{r}-s\hat{s}\nonumber \\ &-\psi_p(z-s_t^2)q+\psi_n\log(1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q)+b_t^2\psi_p s\nonumber \\ &+\log (q)+\int \mathrm{d}\lambda\; \rho_{\boldsymbol{\Sigma}}(\lambda)\log(\hat{s}+\lambda\hat{r})+\frac{s}{q}. \end{align}
💭 Click to ask about this equation
Let us differentiate with respect to the 5 variables
Ss=s^+bt2ψp+1q,Sr=r^+ψpat2e2t1+at2e2tpnr+pvt2nq,Ss^=s+dλρΣ(λ)1s^+λr^,Sr^=r+dλρΣ(λ)λs^+λr^,Sq=ψpqψp(zst2)+ψpvt21+at2e2tpnr+pvt2nq+1qsq2.\begin{align} &\frac{\partial S}{\partial s }=-\hat{s}+b_t^2\psi_p+\frac{1}{q}, \\ &\frac{\partial S}{\partial r }=-\hat{r}+\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}, \\ &\frac{\partial S}{\partial \hat{s}}=-s+\int \mathrm{d}\lambda \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}+\lambda\hat{r}}, \\ &\frac{\partial S}{\partial \hat{r}}=-r+\int \mathrm{d}\lambda \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}+\lambda\hat{r}}, \\ &\frac{\partial S}{\partial q}=-\frac{\psi_p}{q}-\psi_p(z-s_t^2)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}+\frac{1}{q}-\frac{s}{q^2}. \end{align}
💭 Click to ask about this equation
Hence the saddle point equations read
s^=bt2ψp+1q,r^=ψpat2e2t1+at2e2tpnr+pvt2nq,s=dρΣ(λ)1s^+λr^,r=dρΣ(λ)λs^+λr^,ψp(st2z)+ψpvt21+at2e2tpnr+pvt2nq+1ψpqsq2=0.\begin{align} &\hat{s}=b_t^2\psi_p+\frac{1}{q}, \\ &\hat{r}=\frac{\psi_p a_t^2e^{-2t}}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}, \\ &s=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{1}{\hat{s}+\lambda\hat{r}}, \\ &r=\int \mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)\frac{\lambda}{\hat{s}+\lambda\hat{r}}, \\ &\psi_p(s_t^2-z)+\frac{\psi_pv_t^2}{1+\frac{a_t^2e^{-2t}p}{n}r+\frac{pv_t^2}{n}q}+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0. \end{align}
💭 Click to ask about this equation
Finally, we observe that the solution qq^* to the saddle point equations corresponds to the Stieltjes transform of ρ\rho.
2z1pE[Zs]1s=2z1ped2S(q,r)1mm02z1pd2S(q,r)=q.\begin{align} 2\partial_z\frac{1}{p}\frac{\mathbb{E}[\mathcal{Z}^s]-1}{s}=2\partial_z\frac{1}{p}\frac{e^{-\frac{d}{2}S(q^*, r^*)}-1}{m}\underset{m\rightarrow0}{\rightarrow}-2\partial_z\frac{1}{p}\frac{d}{2}S(q^*, r^*)=q^*. \end{align}
💭 Click to ask about this equation

C.5 Proof of Theorem 2

We recall Theorem 2 of the MT.

Theorem 7: Informal

Let ρ\rho denote the spectral density of U\bm{\mathrm{U}}.
  • Regime I (overparametrized): ψp>ψn1\psi_p>\psi_n\gg 1.
ρ(λ)=(11+ψnψp)δ(λst2)+ψnψpρ1(λ)+1ψpρ2(λ).\rho(\lambda)=\Bigl(1-\frac{1+\psi_n}{\psi_p}\Bigr)\delta(\lambda-{s_t^2}) +\frac{\psi_n}{\psi_p}\, \rho_1(\lambda) +\frac{1}{\psi_p}\, \rho_2(\lambda).
💭 Click to ask about this equation
  • Regime II (underparametrized): ψn>ψp1\psi_n>\psi_p\gg 1.
ρ(λ)=(11ψp)ρ1(λ)+1ψpρ2(λ).\rho(\lambda)=\Bigl(1-\frac{1}{\psi_p}\Bigr)\rho_1(\lambda) +\frac{1}{\psi_p}\, \rho_2(\lambda).
💭 Click to ask about this equation
where ρ1\rho_1 is a atomless measure with support
[st2+vt2(1ψp/ψn)2,  st2+vt2(1+ψp/ψn)2],\left[s_t^2 + v_t^2\left(1-\sqrt{\psi_p/\psi_n}\right)^{2}, \; s_t^2 + v_t^2\left(1+\sqrt{\psi_p/\psi_n}\right)^{2}\right],
💭 Click to ask about this equation
and ρ2\rho_2 coincides with the asymptotic eigenvalue bulk density of the population covariance U~=EX[U]\tilde{\bm{\mathrm{U}}}=\mathbb{E}_{\bm{\mathrm{X}}}[\bm{\mathrm{U}}]; ρ2\rho_2 is independent of ψn\psi_n and its support is on the scale ψp\psi_p.
The eigenvectors associated with δ(λst2)\delta(\lambda-{s_t^2}) leave both training and test losses unchanged and are therefore irrelevant. In the limit ψpψn\psi_p\gg \psi_n, the supports of ρ1\rho_1 and ρ2\rho_2 are respectively on the scales ψp/ψn\psi_p/\psi_n and ψp\psi_p, i.e. they are well separated.
We now proceed to prove Theorem 2.
Proof: Delta peak. We first account for the delta peak in the spectrum. We use the Gaussian equivalence for U\bm{\mathrm{U}} computed in Lemma 4. Let ΩνRp\bm{\Omega}^\nu\in\mathbb{R}^p be the ν\nu th column of Ω\bm{\Omega} and WiRp\bm{\mathrm{W}}_i\in\mathbb{R}^p the ii th row of W\bm{\mathrm{W}}. Suppose a vector vRp\bm{\mathrm{v}}\in\mathbb{R}^p lies in the kernel of all these
ν=1,,n,i=1pΩiνvi=0,k=1,,d,k=1pWikvi=0.\begin{align} &\forall \nu=1, \dots, n, \quad\sum_{i=1}^p \bm{\Omega}^\nu_i \bm{\mathrm{v}}_i=0, \\ &\forall k=1, \ldots, d, \quad\sum_{k=1}^p \bm{\mathrm{W}}_{ik} \bm{\mathrm{v}}_i=0. \end{align}
💭 Click to ask about this equation
then Uv=st2v\bm{\mathrm{U}} \bm{\mathrm{v}}=s_t^2 \bm{\mathrm{v}}. These are n+dn+d linear constraints on a vector of size pp hence there are non trivial solutions for n+dpn+d\le p. Hence a delta‐peak at st2s_t^2 appears as soon as ψpψn+1\psi_p \ge \psi_n+1. Next, we extract its weight. Recall that the Stieltjes transform satisfies
q(z)  =  ρ(λ)λzdλ,q(z) \;=\;\int \frac{\rho(\lambda)}{\lambda - z}\, \mathrm{d}\lambda,
💭 Click to ask about this equation
and a point mass of weight ff at λ=st2\lambda = s_t^2 contributes fzst2fε\tfrac{-f}{z - s_t^2}\approx \tfrac{f}{\varepsilon} as zst2εz \to s_t^2 - \varepsilon . Meanwhile
s(z)  =  1pTr ⁣[WT(UzI)1W],r(z)  =  1pTr ⁣[Σ1/2WT(UzI)1WΣ1/2]s(z) \;=\;\frac1p \operatorname{Tr}\!\bigl[\bm{\mathrm{W}}^T(\bm{\mathrm{U}} - z \bm{I})^{-1} \bm{\mathrm{W}}\bigr], \quad r(z) \;=\;\frac1p \operatorname{Tr}\!\bigl[{\boldsymbol{\Sigma}}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}} - z \bm{I})^{-1} \bm{\mathrm{W}}{\boldsymbol{\Sigma}}^{1/2}\bigr]
💭 Click to ask about this equation
remain finite in that limit, since the corresponding eigenvectors satisfy Wv=0\bm{\mathrm{W}}\, \bm{\mathrm{v}}=0 . We substitute this Ansatz into the equations of Theorem 1. The first equation reads
ψnpvt2n1+e2tμ12pσx2nr+pnvt2q+ψp(st2z)+1ψpqsq2=0,\begin{align} \psi_n\frac{\frac{pv_t^2}{n}}{1+\frac{e^{-2t}\mu_1^2p\sigma_{\bm{\mathrm{x}}}^2}{n}r+\frac{p}{n}v_t^2q}+\psi_p(s_t^2-z)+\frac{1-\psi_p}{q}-\frac{s}{q^2}=0, \end{align}
💭 Click to ask about this equation
and simplifies to
ψnεf+ψpε+(1ψp)εf=0.\begin{align} \frac{\psi_n\varepsilon}{f}+\psi_p\varepsilon+\frac{(1-\psi_p)\varepsilon}{f}=0. \end{align}
💭 Click to ask about this equation
It readily gives
f=11ψpψnψp.\begin{align} f=1-\frac{1}{\psi_p}-\frac{\psi_n}{\psi_p}. \end{align}
💭 Click to ask about this equation
Thus the point mass at st2s_t^2 has weight 11ψpψnψp1-\frac{1}{\psi_p}-\frac{\psi_n}{\psi_p}, in agreement with the counting of degrees of freedom presented above.
Finally, one checks that these isolated eigenvalues do not contribute to the train and test losses. After expanding the square they read
Ltrain(A)=1+ΔtdTr(ATpApU)+2ΔtdTr(ApV)Ltest(A)=1+ΔtdTr(ATpApU~)+2ΔtdTr(ApV~)\begin{align} &\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{U}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{V}})\\ &\mathcal{L}_\mathrm{test}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{U}}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{V}}}) \end{align}
💭 Click to ask about this equation
The terms that appear in the loss are of the form Tr(ATA...)\operatorname{Tr}(\bm{\mathrm{A}}^T \bm{\mathrm{A}}...) and Tr(AW)\operatorname{Tr}(\bm{\mathrm{A}} \bm{\mathrm{W}}). The trace can be decomposed on the basis of eigenvectors of U\bm{\mathrm{U}}. The eigenvectors associated with the delta peak satisfy WTv=0\bm{\mathrm{W}}^T \bm{\mathrm{v}}=0. Looking at the expression of the matrix A=WT...+A0\bm{\mathrm{A}}= \bm{\mathrm{W}}^T...+ \bm{\mathrm{A}}_0, one can easily see that, for initial conditions A0=0\bm{\mathrm{A}}_0=0, one has vTAT=0\bm{\mathrm{v}}^T \bm{\mathrm{A}}^T=0 and the subspace corresponding to these isolated eigenvalues does not contribute to the loss.
First bulk. Using the expression for q=1pTr1UzIpq=\frac{1}{p} \operatorname{Tr}\frac{1}{\bm{\mathrm{U}}-z \bm{I}_p} and r(z)=1pTr(Σ1/2WT(UzI)1WΣ1/2)r(z)=\frac{1}{p} \operatorname{Tr}(\boldsymbol{\Sigma}^{1/2} \bm{\mathrm{W}}^T(\bm{\mathrm{U}}-z \bm{I})^{-1} \bm{\mathrm{W}} \boldsymbol{\Sigma}^{1/2}) we make the following Ansatz in the large ψp\psi_p limit:
q=Oψp(1),r=Oψp(1ψp).\begin{align} q=\mathcal{O}_{\psi_p}(1), \quad r=\mathcal{O}_{\psi_p}(\frac{1}{\psi_p}). \end{align}
💭 Click to ask about this equation
In this limit the saddle point equations becomes at leading order in ψp\psi_p
s^=bt2ψpr^=ψpat2e2t1+vt2pnrs=O(1/ψp)r=O(1/ψp)(st2z)+vt21+pvt2nq1q=0.\begin{align} &\hat{s}=b_t^2\psi_p\\ &\hat{r}=\frac{\psi_pa_t^2e^{-2t}}{1+\frac{v_t^2p}{n}r}\\ &s=\mathcal{O}(1/\psi_p)\\ &r=\mathcal{O}(1/\psi_p)\\ &(s_t^2-z)+\frac{ v_t^2}{1+\frac{pv_t^2}{n}q}-\frac{1}{q}=0. \end{align}
💭 Click to ask about this equation
We can focus only on the last equation on qq only. This is a quadratic polynomial in qq. If its discriminant is negative then the solutions are imaginary and thus the density of eigenvalues is non-zero. The edge of the bulk are where the discriminant vanishes
Δ=(st2λ(1pn)vt2)2+4(st2λ)pnvt2=0.\begin{align} \Delta=(s_t^2-\lambda(1-\frac{p}{n})v_t^2)^2+4(s_t^2-\lambda)\frac{p}{n}v_t^2=0. \end{align}
💭 Click to ask about this equation
It vanishes for
λ±=st2+vt2(1±pn)2\begin{align} \lambda_{\pm}=s_t^2+v_t^2\left(1\pm\sqrt{\frac{p}{n}}\right)^2 \end{align}
💭 Click to ask about this equation
which are the edges of the first bulk ρ1\rho_1. We have checked this result, and hence validated the Ansatz solving numerically the equations on r,qr, q. Interestingly at leading order the expression of the first bulk is independent of ρΣ\rho_{\boldsymbol{\Sigma}}.
Second Bulk. We scale q=Oψp(1/ψp)q=\mathcal{O}_{\psi_p}(1/\psi_p) and r=Oψp(1/ψp)r=\mathcal{O}_{\psi_p}(1/\psi_p). The equations on s^\hat{s} and r^\hat{r} lead to
s^=ψpbt2+1qr^=ψpat2e2t.\begin{align} &\hat{s}=\psi_pb_t^2+\frac{1}{q}\\ &\hat{r}=\psi_pa_t^2e^{-2t}. \end{align}
💭 Click to ask about this equation
This yields the following equation on qq
ψp(st2z)+ψpvt2+1ψpq1qdρΣ(λ)1+qψp(bt2+λat2e2t)=0.\begin{align} \psi_p(s_t^2-z)+\psi_pv_t^2+\frac{1-\psi_p}{q}-\frac{1}{q}\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}=0. \end{align}
💭 Click to ask about this equation
We denote the shifted variable z=zst2vt2z'=z-s_t^2-v_t^2. This yields
ψpz+1ψpq1qdρΣ(λ)1+qψp(bt2+λat2e2t)=0.\begin{align} -\psi_pz'+\frac{1-\psi_p}{q}-\frac{1}{q}\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}=0. \end{align}
💭 Click to ask about this equation
We decompose the integral
dρΣ(λ)1+qψp(bt2+λat2e2t)=dρΣ(λ)(1+qψp(bt2+λat2e2t)qψp(bt2+λat2e2t))1+qψp(bt2+λat2e2t)=1qψpdρΣ(λ)(bt2+λat2e2t)1+qψp(bt2+λat2e2t)\begin{align} \int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}&=\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)(1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})-q\psi_p(b_t^2+\lambda a_t^2e^{-2t}))}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})}\\ &=1-q\psi_p\int\frac{\mathrm{d}\rho_{\boldsymbol{\Sigma}}(\lambda)(b_t^2+\lambda a_t^2e^{-2t})}{1+q\psi_p(b_t^2+\lambda a_t^2e^{-2t})} \end{align}
💭 Click to ask about this equation
By plugging this back in the equation we find
q=(zdρΣ(λ)(bt2+λat2e2t)1+ψpq(bt2+λat2e2t))1.\begin{align} q=-\left(z'-\int \frac{\mathrm{d} \rho_{\boldsymbol{\Sigma}}(\lambda)(b_t^2+\lambda a_t^2e^{-2t})}{1+\psi_p q(b_t^2+\lambda a_t^2e^{-2t})}\right)^{-1}. \end{align}
💭 Click to ask about this equation
We do the change of variable μ=bt2+λat2e2t\mu=b_t^2+\lambda a_t^2e^{-2t}. This yields
q=(z1at2e2tdμρΣ(μbt2at2e2t)μ1+ψpqμ)1.\begin{align} q=-\left(z'-\frac{1}{a_t^2e^{-2t}}\int \frac{\mathrm{d}\mu \rho_{\boldsymbol{\Sigma}}(\frac{\mu-b_t^2}{a_t^2e^{-2t}})\mu}{1+\psi_p q\mu}\right)^{-1}. \end{align}
💭 Click to ask about this equation
An integration by parts give that bt2=Δtμ12(t)b_t^2=\Delta_t\mu_1^2(t) at2=μ12(t)/σx2a_t^2=\mu_1^2(t)/\sigma_{\bm{\mathrm{x}}}^2. We thus realize that the integral is over the eigenvalue distribution of μ12(t)(e2tΣ+ΔtId)\mu_1^2(t)(e^{-2t} \boldsymbol{\Sigma}+\Delta_t \bm{I}_d),
q=(zdμρμ12(t)Σt(μ)μ1+ψpqμ)1.\begin{align} q=-\left(z'-\int \frac{\mathrm{d}\mu \rho_{\mu_1^2(t) \boldsymbol{\Sigma}_t}(\mu)\mu}{1+\psi_p q\mu}\right)^{-1}. \end{align}
💭 Click to ask about this equation
We recognize the Bai-Silverstein equations [64, 65] for the eigenvalue density of the matrix
U~=μ12(t)WΣtWTd+(st2+vt2)Ip=Ex[U]\begin{align} \tilde{\bm{\mathrm{U}}}=\mu_1^2(t)\frac{\bm{\mathrm{W}} \boldsymbol{\Sigma}_t \bm{\mathrm{W}}^T}{d}+(s_t^2+v_t^2) \bm{I}_p=\mathbb{E}_{\bm{\mathrm{x}}}[\bm{\mathrm{U}}] \end{align}
💭 Click to ask about this equation
which is the population version of U\bm{\mathrm{U}} and is thus independent of nn. Lemma 5 concludes on the order of the eigenvalues in the bulk of ρ2\rho_2.

C.6 Dynamics on the fast timescales

In the following we denote for a matrix ARp×p\bm{\mathrm{A}}\in \mathbb{R}^{p\times p},
Aop=supvRp,v=1Av\begin{align} \lVert \bm{\mathrm{A}}\rVert_{\mathrm{op}}=\underset{\bm{\mathrm{v}}\in\mathbb{R}^p, \lVert \bm{\mathrm{v}}\rVert=1}{\sup}\lVert \bm{\mathrm{A}} \bm{\mathrm{v}}\rVert \end{align}
💭 Click to ask about this equation
the operator norm and
AF=(i,j=1pAij2)1/2\begin{align} \lVert \bm{\mathrm{A}}\rVert_\mathrm{F}=(\sum_{i, j=1}^p \bm{\mathrm{A}}_{ij}^2)^{1/2} \end{align}
💭 Click to ask about this equation
the Frobenius norm. Before deriving the fast‐time behavior, we need the following lemma.

Lemma 8

The operator norm of UU~\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}} satisfies
(UU~)op=O(ψpψn),\begin{align} \lVert (\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}})\rVert_{\mathrm{op}}=\mathcal{O}(\frac{\psi_p}{\sqrt{\psi_n}}), \end{align}
💭 Click to ask about this equation
when pndp\gg n\gg d.
Proof: On the one hand,
U=e2tat2WXXTWTd+vt2ΩΩTn+etatvtnd(WXΩT+ΩXTWT)+(st2+vt2)Ip\begin{align} \bm{\mathrm{U}}=e^{-2t}a_t^2\frac{\bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T}{d}+v_t^2\frac{\bm{\Omega} \bm{\Omega}^T}{n}+\frac{e^{-t}a_tv_t}{n\sqrt{d}}\left(\bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T+ \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T\right)+(s_t^2+v_t^2) \bm{I}_p \end{align}
💭 Click to ask about this equation
and on the other hand,
U~=μ12e2tWΣWTd+Δtμ12WWTd+(st2+vt2)Ip.\begin{align} \tilde{\bm{\mathrm{U}}}=\mu_1^2e^{-2t}\frac{\bm{\mathrm{W}} \boldsymbol{\Sigma} \bm{\mathrm{W}}^T}{d}+\Delta_t\mu_1^2\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}+(s_t^2+v_t^2) \bm{I}_p. \end{align}
💭 Click to ask about this equation
We also note the identities bt2=Δtμ12(t)b_t^2=\Delta_t\mu_1^2(t) and at2=μ12(t)a_t^2=\mu_1^2(t).
UU~=at2e2tWd(XXTnΣ)WTd+vt2(ΩΩTnIp)+atvtetnd(ΩXTWT+WXΩT).\begin{align} \bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}=a_t^2e^{-2t}\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}+v_t^2(\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)+\frac{a_tv_te^{-t}}{n\sqrt{d}}(\bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T). \end{align}
💭 Click to ask about this equation
We can bound its operator norm
(UU~)opC1Wd(XXTnΣ)WTdop+C2(ΩΩTnIp)op+C3ndΩXTWT+WXΩTop,\begin{align} \lVert (\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}})&\rVert_{\mathrm{op}}\le C_1 \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}+C_2\lVert (\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)\rVert_{\mathrm{op}} \nonumber \\ &\qquad+\frac{C_3}{n\sqrt{d}}\lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{op}}, \end{align}
💭 Click to ask about this equation
where C1,C2,C3C_1, C_2, C_3 are constants independent of p,n,dp, n, d. We bound each of the three terms on the right hand side. We will use the fact that for a symmetric matrix, the operator norm .op\lVert .\rVert_{\mathrm{op}} is equal to its largest eigenvalue.
First term.
Wd(XXTnΣ)WTdop.\begin{align} \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}. \end{align}
💭 Click to ask about this equation
We observe that Wd(XXTnΣ)WTd\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}} and WTdWd(XXTnΣ)\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma}) have the same eigenvalues up to the multiplicity of 03. We then use the sub-multiplicativity of the operator norm
They both have the same moments Tr(.)k\operatorname{Tr}(.)^k owing to the cyclicity of the trace.
Wd(XXTnΣ)WTdopWTdWdop(XXTnΣ)op.\begin{align} \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}\le\lVert \frac{\bm{\mathrm{W}}^T}{\sqrt{d}} \frac{\bm{\mathrm{W}}}{\sqrt{d}}\rVert_{\mathrm{op}}\lVert(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\rVert_{\mathrm{op}}. \end{align}
💭 Click to ask about this equation
We can do the same operation by introducing X=ΣZ\bm{\mathrm{X}}= \boldsymbol{\Sigma} \bm{\mathrm{Z}} with ZRd×n\bm{\mathrm{Z}}\in\mathbb{R}^{d\times n} with standard Gaussian entries,
(XXTnΣ)op=Σ1/2(ZZTnId)Σ1/2op(ZZTnId)opΣop.\begin{align} \lVert(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\rVert_{\mathrm{op}}=\lVert \boldsymbol{\Sigma}^{1/2}(\frac{\bm{\mathrm{Z}} \bm{\mathrm{Z}}^T}{n}- \bm{I}_d) \boldsymbol{\Sigma}^{1/2}\rVert_{\mathrm{op}}\le\lVert(\frac{\bm{\mathrm{Z}} \bm{\mathrm{Z}}^T}{n}- \bm{I}_d)\rVert_{\mathrm{op}}\lVert \boldsymbol{\Sigma}\rVert_{\mathrm{op}}. \end{align}
💭 Click to ask about this equation
Among our assumptions, we had Σop<O(1)\lVert \boldsymbol{\Sigma}\rVert_{\mathrm{op}}<\mathcal{O}(1). The spectrum of (XXTnId)(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \bm{I}_d) is the Marchenko-Pastur law whose largest eigenvalue is of order d/n\sqrt{d/n} while for WTWd\frac{\bm{\mathrm{W}}^T \bm{\mathrm{W}}}{d} it is order pd\frac{p}{d}. The bound reads
Wd(XXTnΣ)WTdopO(pnd).\begin{align} \lVert \frac{\bm{\mathrm{W}}}{\sqrt{d}}(\frac{\bm{\mathrm{X}} \bm{\mathrm{X}}^T}{n}- \boldsymbol{\Sigma})\frac{\bm{\mathrm{W}}^T}{\sqrt{d}}\rVert_{\mathrm{op}}\le\mathcal{O}(\frac{p}{\sqrt{nd}}). \end{align}
💭 Click to ask about this equation
Second term.
(ΩΩTnIp)op.\begin{align} \lVert (\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)\rVert_{\mathrm{op}}. \end{align}
💭 Click to ask about this equation
We observe that the spectrum of ΩΩT/nIp\bm{\Omega} \bm{\Omega}^T/n- \bm{I}_p is Marchenko-Pastur and thus its largest eigenvalue is order O(p/n)\mathcal{O}(p/n) yielding
(ΩΩTnIp)opO(p/n).\begin{align} \lVert (\frac{\bm{\Omega} \bm{\Omega}^T}{n}- \bm{I}_p)\rVert_{\mathrm{op}}\le \mathcal{O}(p/n). \end{align}
💭 Click to ask about this equation
Third term.
ΩXTWT+WXΩTop.\begin{align} \lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{op}}. \end{align}
💭 Click to ask about this equation
We first bound the operator norm by the Frobenius norm.
ΩXTWT+WXΩTop2ΩXTWTF.\begin{align} \lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{op}}\le 2\lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T\rVert_{\mathrm{F}}. \end{align}
💭 Click to ask about this equation
We expand the square
ΩXTWT+WXΩTF2=Ck=1di=1p(ν=1nΩiνXkνWkl)2.\begin{align} \lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{F}}^2=C\sum_{k=1}^d\sum_{i=1}^p(\sum_{\nu=1}^n \bm{\Omega}_i^\nu \bm{\mathrm{X}}_k^\nu \bm{\mathrm{W}}_{kl})^2. \end{align}
💭 Click to ask about this equation
The Central Limit Theorem yields
ν=1nΩiνXkνWkl=O(n)Wkl,\begin{align} \sum_{\nu=1}^n \bm{\Omega}_i^\nu \bm{\mathrm{X}}_k^\nu \bm{\mathrm{W}}_{kl}=\mathcal{O}(\sqrt{n}) \bm{\mathrm{W}}_{kl}, \end{align}
💭 Click to ask about this equation
hence
1ndΩXTWT+WXΩTop=O(ndpnd)=O(pn)\begin{align} \frac{1}{n\sqrt{d}}\lVert \bm{\Omega} \bm{\mathrm{X}}^T \bm{\mathrm{W}}^T+ \bm{\mathrm{W}} \bm{\mathrm{X}} \bm{\Omega}^T\rVert_{\mathrm{op}}=\mathcal{O}(\frac{\sqrt{ndp}}{n\sqrt{d}})=\mathcal{O}(\sqrt{\frac{p}{n}}) \end{align}
💭 Click to ask about this equation
Putting all the contributions together yields
(UU~)opO(pdn)=O(ψpψn).\begin{align} \lVert (\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}})\rVert_{\mathrm{op}}&\le \mathcal{O}(\frac{p}{\sqrt{dn}})=\mathcal{O}(\frac{\psi_p}{\sqrt{\psi_n}}). \end{align}
💭 Click to ask about this equation

Proposition (Informal).

On timescales 1τψn1\ll \tau\ll \psi_n, both the train and test losses satisfy
LtrainLtest1O(Δt).\begin{align} \mathcal{L}_\mathrm{train}\simeq\mathcal{L}_\mathrm{test}\simeq 1-\mathcal{O}(\Delta_t). \end{align}
💭 Click to ask about this equation
Proof: According to the spectral analysis of U\bm{\mathrm{U}} conducted previously, there are two bulks in the spectrum that contribute to the dynamics: a first bulk with eigenvalues of order ψpψn\frac{\psi_p}{\psi_n} and a second bulk with eigenvalues of order ψp\psi_p in the ψp,ψn1\psi_p, \psi_n\gg1 limit. Hence, in the regime 1τψn1\ll\tau\ll \psi_n, eλΔtτψp0e^{-\lambda\frac{\Delta_t\tau}{\psi_p}}\sim0 if λ\lambda is in the second bulk and is eλΔtτψp1e^{-\lambda\frac{\Delta_t\tau}{\psi_p}}\sim1 if λ\lambda is in the first bulk. We remind the expressions of the train and test loss
Ltrain(A)=1+ΔtdTr(ATpApU)+2ΔtdTr(ApV)Ltest(A)=1+ΔtdTr(ATpApU~)+2ΔtdTr(ApV~)\begin{align} &\mathcal{L}_\mathrm{train}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{U}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}} \bm{\mathrm{V}})\\ &\mathcal{L}_\mathrm{test}(\bm{\mathrm{A}})=1+\frac{\Delta_t}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}^T}{\sqrt{p}}\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{U}}})+\frac{2\sqrt{\Delta_t}}{d} \operatorname{Tr}(\frac{\bm{\mathrm{A}}}{\sqrt{p}}\tilde{\bm{\mathrm{V}}}) \end{align}
💭 Click to ask about this equation
and use the expression of A(τ)\bm{\mathrm{A}}(\tau) in Proposition 3 that we expand on the basis of eigenvectors {vλ}λSp(U)\{\bm{\mathrm{v}}_\lambda\}_{\lambda\in Sp(\bm{\mathrm{U}})} of U\bm{\mathrm{U}}.
A(τ)p=1ΔtVTU1(e2ΔtdUτIp)=1ΔtVTU1λ(e2Δtdλτ1)vλvλT1ΔtVTU1λρ2vλvλT,\begin{align} \frac{\bm{\mathrm{A}}(\tau)}{\sqrt{p}}&=\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}(e^{-\frac{2\Delta_t}{d} \bm{\mathrm{U}} \tau}- \bm{I}_p)\\ &=\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}\sum_\lambda(e^{-\frac{2\Delta_t}{d}\lambda \tau}-1) \bm{\mathrm{v}}_\lambda \bm{\mathrm{v}}_\lambda^T\\ &\sim -\frac{1}{\sqrt{\Delta_t}} \bm{\mathrm{V}}^T \bm{\mathrm{U}}^{-1}\sum_{\lambda\in \rho_2} \bm{\mathrm{v}}_\lambda \bm{\mathrm{v}}_\lambda^T, \end{align}
💭 Click to ask about this equation
where λρ2\lambda\in\rho_2 means that the eigenvalue λ\lambda belongs to the second bulk. We also have that V\bm{\mathrm{V}} and V~\tilde{\bm{\mathrm{V}}} have the same GEP μ1(t)ΔtΓtWd\frac{\mu_1(t)\sqrt{\Delta_t}}{\Gamma_t}\frac{\bm{\mathrm{W}}}{\sqrt{d}} and they thus cancel each other when computing the generalization loss Lgen=LtestLtrain\mathcal{L}_{\mathrm{gen}}=\mathcal{L}_{\mathrm{test}}-\mathcal{L}_{\mathrm{train}}. It reads
Lgen=μ12(t)ΔtΓt2dTr(λ,λρ2vλvλTU1WWTdU1vλvλT(UU~))=μ12ΔtΓt2d(λ,λρ2vλTU1WWTdU1vλvλT(UU~)vλ)=μ12ΔtΓt2d(λ,λρ2vλT1λWWTd1λvλvλT(UU~)vλ)(12)\begin{align} \mathcal{L}_\mathrm{gen}&=-\frac{\mu_1^2(t)\Delta_t}{\Gamma_t^2d} \operatorname{Tr}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'} \bm{\mathrm{v}}_{\lambda'}^T \bm{\mathrm{U}}^{-1}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}))\\ &=-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T \bm{\mathrm{U}}^{-1}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\\ &=-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\\ \end{align}\tag{12}
💭 Click to ask about this equation
We then use Lemma 8 --- which states that the operator norm of UU~\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}} in the subspace spanned by the eigenvectors of the second bulk is bounded by O(ψpψn)\mathcal{O}(\frac{\psi_p}{\sqrt{\psi_n}}) --- to bound Lgen\mathcal{L}_{\mathrm{gen}},
Lgenμ12ΔtΓt2d(λ,λρ2vλT1λWWTd1λvλvλT(UU~)vλ)opμ12ΔtΓt2dd1ψp2WWTdopψpψnO(dψp2dψp2ψn)=O(1ψn).\begin{align} \lvert \mathcal{L}_{\mathrm{gen}} \rvert &\le \lVert \frac{\mu_1^2\Delta_t}{\Gamma_t^2d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T(\bm{\mathrm{U}}-\tilde{\bm{\mathrm{U}}}) \bm{\mathrm{v}}_{\lambda'})\rVert_{\mathrm{op}}\\ &\le \frac{\mu_1^2\Delta_t}{\Gamma_t^2d} d\frac{1}{\psi_p^2} \lVert \frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\rVert_{\mathrm{op}}\frac{\psi_p}{\sqrt{\psi_n}}\le\mathcal{O}(\frac{d\psi_p^2}{d\psi_p^2\sqrt{\psi_n}})=\mathcal{O}(\frac{1}{\sqrt{\psi_n}}). \end{align}
💭 Click to ask about this equation
We also used the fact that the sums contain dd terms --- the only terms that matter are the diagonal ones --- and that the eigenvalues scale as ψp\psi_p. The bound yield that Lgen\mathcal{L}_{\mathrm{gen}} vanishes asymptotically in the large number of data and large number of parameters regime. Therefore, on the fast timescale we find LtrainLtest\mathcal{L}_\mathrm{train}\simeq \mathcal{L}_\mathrm{test}. Let us now focus on Ltrain\mathcal{L}_\mathrm{train}
Ltrain=1+μ12ΔtΓt2d(λ,λρ2vλT1λWWTd1λvλvλTUvλ)2Δtμ12Γt2dλρ2vλTWWTdU1vλ=1μ12ΔtΓt2dλρ21λvλTWWTdvλ.\begin{align} \mathcal{L}_\mathrm{train}& = 1+\frac{\mu_1^2\Delta_t}{\Gamma_t^2 d}(\sum_{\lambda, \lambda'\in\rho_2} \bm{\mathrm{v}}_{\lambda'}^T\frac{1}{\lambda'}\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d}\frac{1}{\lambda} \bm{\mathrm{v}}_{\lambda} \bm{\mathrm{v}}_{\lambda}^T \bm{\mathrm{U}} \bm{\mathrm{v}}_{\lambda'})-\frac{2\Delta_t\mu_1^2}{\Gamma_t^2d}\sum_{\lambda\in\rho_2} \bm{\mathrm{v}}_\lambda^T\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{U}}^{-1} \bm{\mathrm{v}}_\lambda\\ &=1-\frac{\mu_1^2\Delta_t}{\Gamma_t^2d}\sum_{\lambda\in\rho_2}\frac{1}{\lambda} \bm{\mathrm{v}}_\lambda^T\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} \bm{\mathrm{v}}_\lambda. \end{align}
💭 Click to ask about this equation
There are dd values in the sum and the eigenvalues of U\bm{\mathrm{U}} and WWTd\frac{\bm{\mathrm{W}} \bm{\mathrm{W}}^T}{d} are both order O(ψp)\mathcal{O}(\psi_p) hence the sum divided by dd is a positive O(1)\mathcal{O}(1) quantity thus in this training time regime, 1τψn1\ll\tau\ll \psi_n, we obtain:
LtrainLtest=1O(Δt).\begin{align} \mathcal{L}_\mathrm{train}\sim\mathcal{L}_\mathrm{test}=1-\mathcal{O}(\Delta_t). \end{align}
💭 Click to ask about this equation

D. Numerical experiments for Random Features

Details on the numerical experiments. All the numerical experiments for the RFNN were conducted using σ=tanh\sigma=\tanh and σx=1\sigma_{\bm{\mathrm{x}}}=1 unless specified. At each step, the gradient of the loss was computed using the full batch of data points. The train loss was estimated by adding noise to each data point N=100N=100 times. The test loss was computed by drawing nn new points from the data distribution and noising each one NN times. The error on the score was evaluated by drawing 10, 000 points from the noisy distribution Pt=N(0,Γt2Id)P_t=\mathcal{N}(0, \Gamma_t^2 \bm{I}_d).
Effect of tt.

Figure 11: Generalization loss for different diffusion times tt. Generalization loss Lgen\mathcal{L}_{\mathrm{gen}} against (Left) training time τ\tau and (Right) rescaled training time τ/τgen\tau/\tau_{\mathrm{gen}} for different ψp=32,d=100\psi_p=32, d=100 and different ψn\psi_n and tt.

💭 Click to ask about this figure
We present plots for different diffusion times tt in Figure 11 and show that the rescaling of the training times τ\tau by τmem=ψp/Δtλmin\tau_{\mathrm{mem}}=\psi_p/\Delta_t\lambda_\mathrm{min} also makes the loss curves collapse. Of particular interest is the behavior of τmem\tau_{\mathrm{mem}}, and more specifically the ratio τmem/τgen\tau_{\mathrm{mem}} / \tau_{\mathrm{gen}}, at small tt . Recall that
λmin=st2+vt2(1ψpψn)2.\lambda_{\min} = s_t^2 + v_t^2 \left(1 - \sqrt{\frac{\psi_p}{\psi_n}}\right)^2.
💭 Click to ask about this equation
In the overparameterized regime pnp \gg n, this ratio is independent of tt since vt2μ2v_t^2\sim\mu_*^2 and st2ts_t^2\sim{t}. However, when pnp \sim n, a nontrivial scaling emerges: since λminst2t\lambda_{\min} \sim s_t^2 \sim t, it follows that
τmemτgen1t,\frac{\tau_{\mathrm{mem}}}{\tau_{\mathrm{gen}}} \sim \frac{1}{t},
💭 Click to ask about this equation
implying that the two timescales become increasingly separated. It is unclear whether this behavior is related to specific properties of the learned score function, and is related to the approach of the interpolation threshold. We leave this question for future investigation.
Experiments with σx21\sigma_{\bm{\mathrm{x}}}^2\neq1. In Figure 12, we present train and test loss curves for σx1\sigma_{\bm{\mathrm{x}}}\neq1. We see that our prediction of the timescale of memorization computed in the MT holds for general data variance.
**Figure 12:** **Different $\sigma_{\bm{\mathrm{x}}}^2$.** Train loss (solid line) and test loss (dotted line) for $\psi_p=64, t=0.1, d=100$, different $\psi_n$ and $\sigma_{\bm{\mathrm{x}}}=2.$ (top) and $\sigma_{\bm{\mathrm{x}}}=0.5$ (bottom) against the training time $\tau$ and the rescaled training time $\tau/\tau_{\mathrm{mem}}$.

Figure 12: Different σx2\sigma_{\bm{\mathrm{x}}}^2. Train loss (solid line) and test loss (dotted line) for ψp=64,t=0.1,d=100\psi_p=64, t=0.1, d=100, different ψn\psi_n and σx=2.\sigma_{\bm{\mathrm{x}}}=2. (top) and σx=0.5\sigma_{\bm{\mathrm{x}}}=0.5 (bottom) against the training time τ\tau and the rescaled training time τ/τmem\tau/\tau_{\mathrm{mem}}.

💭 Click to ask about this figure
Scaling of Escore\mathcal{E}_{\mathrm{score}} with nn. In the RF model, the error with respect to the true score, as defined in the main text,
EScore=1dEyN(0,Γt2Ip)[sA(τ)(y)+yΓt22],\begin{align} \mathcal{E}_{\mathrm{Score}} = \frac{1}{d} \mathbb{E}_{\bm{\mathrm{y}} \sim \mathcal{N}(0, \Gamma_t^2 \bm{I}_p)} \left[\left\lVert \bm{\mathrm{s}}_{\bm{\mathrm{A}}(\tau)}(\bm{\mathrm{y}}) + \frac{\bm{\mathrm{y}}}{\Gamma_t^2} \right\rVert^2 \right], \end{align}
💭 Click to ask about this equation
serves as a measure of the generalization capability of the generative process. As shown in [45], the Kullback–Leibler divergence between the true data distribution PxP_{\bm{\mathrm{x}}} and the generated distribution P^\hat{P} can be upper bounded
DKL(PxP^)d2dtEScore(At),\begin{align} \mathcal{D}_{\mathrm{KL}}(P_{\bm{\mathrm{x}}} \, \|\, \hat{P}) \leq \frac{d}{2} \int \mathrm{d} t\, \mathcal{E}_{\mathrm{Score}}(\bm{\mathrm{A}}_t), \end{align}
💭 Click to ask about this equation
where the integral is taken over all estimations of the parameter matrix A\bm{\mathrm{A}} at all diffusion times tt. This bound assumes that the reverse dynamics are integrated exactly, starting from infinite time. In practical settings, however, one typically relies on an approximate scheme and initiates the reverse process at a large but finite time TT. A generalization of this bound under such conditions can be found in [46]. We have numerically investigate the behaviour of Escore\mathcal{E}_{\mathrm{score}} on Figure 13. On the fast timescale τgen\tau_{\mathrm{gen}}, it decreases until a minimal value Escore\mathcal{E}_{\mathrm{score}}^* that depends only on ψn\psi_n with a power-law ψnη\psi_n^{-\eta} with η0.59\eta \simeq 0.59. We leave for future work performing an accurate numerical estimate of η\eta and a developing a theory for it.

Figure 13: Effect of ψn\psi_n on EScore\mathcal{E}_{\mathrm{Score}}^*. (Left) Error between the learned score and the true score EScore\mathcal{E}_{\mathrm{Score}} for ψp=32\psi_p = 32, t=0.01t = 0.01, and various values of ψn\psi_n. (Right) Minimum score error EScore=minτ[EScore(τ)]\mathcal{E}_{\mathrm{Score}}^* = \underset{\tau}{\min}[\mathcal{E}_{\mathrm{Score}}(\tau)] as a function of ψn\psi_n, showing a power-law decay with exponent approximately 0.59-0.59. The error bars correspond to thrice the standard deviation over 10 runs with new initial conditions.

💭 Click to ask about this figure
Spectrum of U. In Figure 14, we compare the solutions of the equations of Theorem 1 to the histogram of finite size realizations of U\bm{\mathrm{U}}.

Figure 14: Spectrum of U\bm{\mathrm{U}}. Solutions of the equations in Theorem 3.1. (solid lines) and empirical spectrum for ρΣ(λ)=δ(λ1)\rho_{\boldsymbol{\Sigma}}(\lambda)=\delta(\lambda-1) and d=100d=100 (histogram). (Left) ψp=64\psi_p=64, ψn=8\psi_n=8, t=0.01t=0.01. (Right) ψp=64\psi_p=64, ψn=32\psi_n=32, t=0.01t=0.01.

💭 Click to ask about this figure
Effect of Adam optimization. Numerical experiments with RFNN on Gaussian data show that the linear scaling of the memorization time with nn holds also for the Adam optimizer as shown in Figure 15.
**Figure 15:** **Adam.** Train loss (solid line) and test loss (dotted line) at $t=0.01, d=100, \psi_p=64$ for several $\psi_n$ with the Pytorch [50] implementation of Adam. The inset shows the effect of a rescaling of the training time by $n$.

Figure 15: Adam. Train loss (solid line) and test loss (dotted line) at t=0.01,d=100,ψp=64t=0.01, d=100, \psi_p=64 for several ψn\psi_n with the Pytorch [50] implementation of Adam. The inset shows the effect of a rescaling of the training time by nn.

💭 Click to ask about this figure

References

[1] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France. PMLR.
[2] Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models.
[3] Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Wallach, H., Larochelle, H., Beygelzimer, A., d\textquotesingle Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
[4] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021b). Score-based generative modeling through stochastic differential equations.
[5] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.
[6] Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S.-H., and Kweon, I. S. (2023). A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai.
[7] Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., and Sun, L. (2024). Sora: A review on background, technology, limitations, and opportunities of large vision models.
[8] Li, T., Biferale, L., Bonaccorso, F., and et al. (2024b). Synthetic lagrangian turbulence by generative diffusion models. Nat Mach Intell, 6:393–403.
[9] Price, I., Sanchez-Gonzalez, A., Alet, F., and et al. (2025). Probabilistic weather forecasting with machine learning. Nature, 637:84–90.
[10] Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations.
[11] Li, S., Chen, S., and Li, Q. (2024a). A good score does not lead to a good generative model.
[12] Biroli, G., Bonnaire, T., de Bortoli, V., and Mézard, M. (2024). Dynamical regimes of diffusion models. Nature Communications, 15(9957). Open access.
[13] Kadkhodaie, Z., Guth, F., Simoncelli, E. P., and Mallat, S. (2024). Generalization in diffusion models arises from geometry-adaptive harmonic representations. In The Twelfth International Conference on Learning Representations.
[14] Kamb, M. and Ganguli, S. (2024). An analytic theory of creativity in convolutional diffusion models.
[15] Shah, K., Kalavasis, A., Klivans, A. R., and Daras, G. (2025). Does generation require memorization? creative diffusion models using ambient diffusion.
[16] Wu, Y.-H., Marion, P., Biau, G., and Boyer, C. (2025). Taking a big step: Large learning rates in denoising score matching prevent memorization.
[17] George, A. J., Veiga, R., and Macris, N. (2025). Denoising score matching with random features: Insights on diffusion models from precise learning curves.
[18] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. (2019). On the spectral bias of neural networks. In International conference on machine learning, pages 5301–5310. PMLR.
[19] Gu, X., Du, C., Pang, T., Li, C., Lin, M., and Wang, Y. (2023). On memorization in diffusion models.
[20] Yoon, T., Choi, J. Y., Kwon, S., and Ryu, E. K. (2023). Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling.
[21] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F., editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham. Springer International Publishing.
[22] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
[23] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramèr, F., Balle, B., Ippolito, D., and Wallace, E. (2023). Extracting training data from diffusion models. In Proceedings of the 32nd USENIX Conference on Security Symposium, SEC '23, USA. USENIX Association.
[24] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. (2023a). Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[25] Somepalli, G., Singla, V., Goldblum, M., Geiping, J., and Goldstein, T. (2023b). Understanding and mitigating copying in diffusion models. Advances in Neural Information Processing Systems, 36:47783–47803.
[26] Achilli, B., Ventura, E., Silvestri, G., Pham, B., Raya, G., Krotov, D., Lucibello, C., and Ambrogioni, L. (2024). Losing dimensions: Geometric memorization in generative diffusion.
[27] Ventura, E., Achilli, B., Silvestri, G., Lucibello, C., and Ambrogioni, L. (2025). Manifolds, random matrices and spectral gaps: The geometric phases of generative diffusion.
[28] Cui, H., Krzakala, F., Vanden-Eijnden, E., and Zdeborova, L. (2024). Analysis of learning a flow-based generative model from limited sample complexity. In The Twelfth International Conference on Learning Representations.
[29] Cui, H., Pehlevan, C., and Lu, Y. M. (2025). A precise asymptotic analysis of learning diffusion models: theory and insights.
[30] Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., and Qu, Q. (2024). Diffusion models learn low-dimensional distributions via subspace clustering.
[31] Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Platt, J., Koller, D., Singer, Y., and Roweis, S., editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc.
[32] Li, P., Li, Z., Zhang, H., and Bian, J. (2025). On the generalization properties of diffusion models.
[33] Zhi-Qin John Xu, Z.-Q. J. X., Yaoyu Zhang, Y. Z., Tao Luo, T. L., Yanyang Xiao, Y. X., and Zheng Ma, Z. M. (2020). Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746–1767.
[34] Wang, B. (2025). An analytical theory of power law spectral bias in the learning dynamics of diffusion models.
[35] Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709.
[36] Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674.
[37] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
[38] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, ICLR (Poster).
[39] Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the design space of diffusion-based generative models.
[40] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[41] Song, J., Meng, C., and Ermon, S. (2022). Denoising diffusion implicit models.
[42] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium.
[43] Mei, S., Misiakiewicz, T., and Montanari, A. (2019). Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Beygelzimer, A. and Hsu, D., editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2388–2464. PMLR.
[44] D'Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. (2020). Double trouble in double descent: Bias and variance(s) in the lazy regime. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2280–2290. PMLR.
[45] Song, Y., Durkan, C., Murray, I., and Ermon, S. (2021a). Maximum likelihood training of score-based diffusion models.
[46] Bortoli, V. D. (2022). Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research. Expert Certification.
[47] Péché, S. (2019). A note on the pennington-worah distribution. Electronic Communications in Probability, 24:1–7.
[48] Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., Mézard, M., and Zdeborová, L. (2021). The gaussian equivalence of generative models for learning with shallow neural networks.
[49] Hu, H. and Lu, Y. M. (2023). Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964.
[50] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8024–8035. Curran Associates, Inc.
[51] Gerace, F., Loureiro, B., Krzakala, F., Mezard, M., and Zdeborova, L. (2020). Generalisation error in learning with random features and the hidden manifold model. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3452–3462. PMLR.
[52] Mei, S. and Montanari, A. (2020). The generalization error of random features regression: Precise asymptotics and double descent curve.
[53] Bodin, A. P. M. (2024). Random Matrix Methods for High-Dimensional Machine Learning Models. Phd thesis, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
[54] Mézard, M., Parisi, G., and Virasoro, M. A. (1987). Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 of Lecture Notes in Physics. World Scientific Publishing Company, Singapore.
[55] Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework for flows and diffusions.
[56] Ho, J. and Salimans, T. (2022). Classifier-free diffusion guidance.
[57] Adomaityte, U., Defilippis, L., Loureiro, B., and Sicuro, G. (2024). High-dimensional robust regression under heavy-tailed data: asymptotics and universality. Journal of Statistical Mechanics: Theory and Experiment, 2024(11):114002.
[58] Favero, A., Sclocchi, A., and Wyart, M. (2025). Bigger isn't always memorizing: Early stopping overparameterized diffusion models.
[59] Wen, Y., Liu, Y., Chen, C., and Lyu, L. (2024). Detecting, explaining, and mitigating memorization in diffusion models.
[60] Chen, C., Liu, D., and Xu, C. (2024). Towards memorization-free diffusion models.
[61] Kibble, W. F. (1945). An extension of a theorem of mehler’s on hermite polynomials. Mathematical Proceedings of the Cambridge Philosophical Society, 41(1):12–15.
[62] Bach, F. (2023). Polynomial magic iii: Hermite polynomials. https://francisbach.com/hermite-polynomials/ Accessed: 2025-10-09.
[63] Potters, M. and Bouchaud, J.-P. (2020). A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists. Cambridge University Press.
[64] Silverstein, J. and Bai, Z. (1995). On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):175–192.
[65] Bai, Z. and Zhou, W. (2008). Large sample covariance matrices without independence structures in columns. Statistica Sinica, 18(2):425–442.