77 Synthetic Data Generation

Synthetic data is information that is produced by an algorithm rather than collected from a real observation, yet is intended to carry the statistical signal of the real thing. The motivation is practical. Real data is expensive to label, encumbered by privacy law, imbalanced across the classes we care about, and frequently unavailable for the rare events that matter most. Synthetic data promises an escape from these constraints, and over the past decade it has moved from a niche augmentation trick to a central pillar of how modern systems are trained, evaluated, and stress tested. This chapter develops the subject rigorously and practically. It covers simulation, the major generative model families, the rise of large language models as data engines, privacy-preserving synthesis, and the failure mode that haunts the whole enterprise, model collapse.

Definition: synthetic data

Let $p_{\text{data}}$ be the true distribution that real observations are drawn from. A synthetic data generator is a sampling procedure $g$, possibly stochastic and possibly conditioned on real data $D$, whose outputs $\tilde{x} = g(\cdot)$ are induced by a distribution $p_g$. The generator is useful to the extent that $p_g$ is close to $p_{\text{data}}$ on the statistics a downstream task depends on, while $g$ relaxes a constraint that makes sampling from $p_{\text{data}}$ directly impractical, such as labeling cost, privacy exposure, or the rarity of an event of interest. Closeness is task relative: a generator can be excellent for training a fraud classifier and useless for estimating a tail quantile, because the two tasks weight $p_{\text{data}}$ differently.

The families covered below differ in how they obtain $p_g$. Simulators encode it by hand from domain knowledge. Generative models learn it from samples of $p_{\text{data}}$. Large language models reuse a distribution already learned during pretraining and steer it with prompts. The diagram below organizes the landscape.

flowchart TD
  A["Synthetic data generation"] --> B["Simulation, knowledge encoded by hand"]
  A --> C["Learned generative models, density fit from data"]
  A --> D["LLM generation, pretrained distribution steered by prompts"]
  B --> B1["Mechanistic and physics simulators"]
  B --> B2["Domain randomization"]
  C --> C1["VAE, latent variable, stable, blurry"]
  C --> C2["GAN, adversarial, sharp, unstable"]
  C --> C3["Diffusion, denoising, high fidelity, slow"]
  D --> D1["Self instruct"]
  D --> D2["Distillation"]
  D --> D3["Augmentation and rephrasing"]

77.1 1. Why Synthesize Data

77.1.1 1.1 The Demand Curve for Data

The scaling behavior of deep learning ties model quality to dataset size. Empirically, test loss for a model with $N$ parameters trained on $D$ tokens follows an approximate power law, $L(N, D) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}$, so reductions in loss demand large multiplicative increases in $D$. The power law form is the crux of the problem. Because the data term decays as $D^{-\beta}$ with $\beta$ well below one, each fixed increment of loss reduction costs a roughly geometric increase in tokens. Halving the data driven excess loss does not require twice the data; it requires a factor of $2^{1/\beta}$, which for typical $\beta \approx 0.1$ is a factor of roughly one thousand. Human generated data is finite. High quality text on the public web is a bounded resource, and credible projections suggest the stock of useful tokens will be largely exhausted within a few years of continued scaling. Synthetic data is one of the few levers that can extend the curve, alongside multiple epochs over the same data and multimodal sources.

77.1.2 1.2 Beyond Volume

Volume is not the only reason to synthesize. Four others recur in practice.

Privacy. Health records, financial transactions, and location traces cannot be shared freely. A synthetic surrogate that preserves utility while breaking the link to real individuals lets teams collaborate and publish.
Class imbalance and rare events. Fraud, equipment failure, and adverse drug reactions are scarce by definition. Generating plausible minority examples can rebalance a training set.
Coverage and edge cases. Autonomous systems must handle situations that are dangerous or unethical to collect in the wild, such as a child running into a road.
Controllability. Synthetic pipelines expose knobs. We can dial the lighting, the dialect, the failure rate, or the label distribution, which makes systematic evaluation possible.

77.2 2. Simulation Based Generation

77.2.1 2.1 Mechanistic and Physics Based Simulators

The oldest form of synthetic data comes from simulators that encode domain knowledge directly. Game engines and ray tracers render labeled images for perception. Physics solvers produce sensor readings for robotics. Agent based models generate population level behavior in epidemiology and economics. The defining feature is that labels are free. Because the simulator knows the ground truth pose, depth, or segmentation mask, annotation cost collapses to zero.

The defining problem is the reality gap, the distribution shift between the simulator distribution $p_{\text{sim}}$ and the real distribution $p_{\text{real}}$. A model $f$ trained to minimize risk under $p_{\text{sim}}$ minimizes $R_{\text{sim}}(f) = \mathbb{E}_{p_{\text{sim}}}[\ell(f(x), y)]$, but it is deployed against $R_{\text{real}}(f) = \mathbb{E}_{p_{\text{real}}}[\ell(f(x), y)]$. The gap between the two risks is controlled by how far apart the distributions are. A standard bound writes $R_{\text{real}}(f) \le R_{\text{sim}}(f) + d(p_{\text{sim}}, p_{\text{real}})$, where $d$ is a discrepancy that depends on the model class. A model trained only on synthetic renders often fails on real photographs because textures, noise, and lighting differ in ways the simulator did not capture, which inflates $d$ even when $R_{\text{sim}}$ is near zero. The two levers for shrinking the gap are making the simulator more realistic and making the model invariant to the differences. Domain randomization takes the second path.

77.2.2 2.2 Domain Randomization

Domain randomization closes the gap by making the simulator deliberately diverse. Rather than trying to render one photorealistic world, we randomize textures, colors, camera positions, and lighting across a wide range. The intuition is that if the real world looks like just one more random variation, a model trained across the randomized ensemble will treat reality as in distribution. Formally, if $p_{\text{sim}}(\theta)$ is a distribution over simulator parameters $\theta$, we train on the marginal $\mathbb{E}_{\theta \sim p_{\text{sim}}}[\,\mathcal{L}(f, \mathcal{D}_\theta)\,]$ and hope the support is broad enough to contain the real distribution. The randomization succeeds precisely when $p_{\text{real}}$ lies inside the convex hull of the rendered variations, so that no real observation is out of distribution relative to the training mixture. This technique underpinned early sim to real transfer for robotic manipulation and remains a strong baseline. Its cost is sample efficiency: a wider randomization range forces the model to fit a harder, more varied problem, so it needs more capacity and more data to reach a given accuracy. The practical art is to randomize the nuisance factors that the simulator gets wrong, such as texture and lighting, while keeping the task relevant structure, such as object geometry, faithful.

77.3 3. Generative Models for Data

When no mechanistic simulator exists, we learn the data distribution from samples. The goal of a generative model is to approximate an unknown density $p_{\text{data}}(x)$ with a model $p_\theta(x)$ that we can sample from. Three families dominate.

77.3.1 3.1 Variational Autoencoders

A variational autoencoder (VAE) couples an encoder $q_\phi(z \mid x)$ that maps data to a latent code with a decoder $p_\theta(x \mid z)$ that reconstructs it. Training maximizes the evidence lower bound,

\[\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}\!\left(q_\phi(z \mid x)\,\|\,p(z)\right),\]

which trades reconstruction fidelity against keeping the posterior close to a simple prior $p(z)$, usually a standard normal. To generate, we sample $z \sim p(z)$ and decode. VAEs are stable to train and give a smooth, interpretable latent space, but the Gaussian assumptions and the averaging behavior of the reconstruction term tend to produce blurry samples. They remain valuable for tabular data, anomaly detection, and as components inside larger systems.

77.3.2 3.2 Generative Adversarial Networks

A generative adversarial network (GAN) pits a generator $G$ against a discriminator $D$ in a minimax game,

\[\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))].\]

The discriminator learns to separate real from fake, and the generator learns to fool it. For a fixed generator the optimal discriminator is $D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_g(x))$, and substituting it back reduces the generator objective to the Jensen Shannon divergence between $p_{\text{data}}$ and $p_g$ up to a constant. At the theoretical optimum that divergence is zero, so the generator distribution matches the data distribution and the discriminator is reduced to outputting $\tfrac{1}{2}$ everywhere, maximally confused. GANs produce sharp, high fidelity samples and dominated image synthesis for years. They are also notoriously hard to train. The two failure modes to know are non convergence, where the adversarial dynamics oscillate rather than settle, and mode collapse, where the generator maps many inputs to a few outputs and ignores large regions of the data distribution. Wasserstein losses, gradient penalties, and spectral normalization were developed to stabilize training. For tabular synthesis, conditional variants such as CTGAN handle mixed discrete and continuous columns and skewed marginals.

77.3.3 3.3 Diffusion Models

Diffusion models have become the state of the art for high fidelity generation. The idea is to define a forward process that gradually corrupts data with Gaussian noise across $T$ steps,

\[q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t; \sqrt{1 - \beta_t}\, x_{t-1}, \beta_t \mathbf{I}\right),\]

until the signal is pure noise. A useful property is that the forward process has a closed form marginal that lets us jump to any step in one shot. Writing $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$,

\[q(x_t \mid x_0) = \mathcal{N}\!\left(x_t; \sqrt{\bar{\alpha}_t}\, x_0, (1 - \bar{\alpha}_t)\mathbf{I}\right),\]

so a noised sample is $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, \mathbf{I})$. The model then learns the reverse process that denoises step by step. A neural network $\epsilon_\theta(x_t, t)$ is trained to predict the noise added at each step, minimizing a simple regression objective,

\[\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t}\left[\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\,\right].\]

This objective is exactly the closed form sample above fed through the network, which is why diffusion training is so stable: there is no adversary, just a regression toward known noise. Sampling starts from noise and iterates the learned reverse steps. Conditioning, such as a text prompt, enters through the network input, and classifier free guidance sharpens adherence to the condition by extrapolating between the conditional and unconditional predictions at sampling time. Diffusion models avoid the adversarial instability of GANs and cover modes far better, at the cost of slow, multi step sampling, although distillation and few step solvers have narrowed that gap. They power most current text to image systems and have extended to audio, video, molecules, and tabular data. The following sketch shows the training loop at a high level.

# Diffusion training step (schematic, not runnable as-is)
x0 = sample_batch(data)
t  = randint(1, T)                      # random diffusion step
noise = randn_like(x0)
xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * noise
loss = mse(model(xt, t), noise)         # predict the injected noise
loss.backward()

77.3.4 3.4 Choosing a Family

There is no universally best model. VAEs are a sensible default for tabular and low dimensional data where stability and a structured latent space matter. GANs still compete when sample sharpness is paramount and training budget is limited. Diffusion models are the choice when fidelity and mode coverage justify heavier compute. For text and code, the autoregressive transformer, discussed next, dominates outright. The table summarizes the tradeoffs.

Family	Training stability	Sample fidelity	Mode coverage	Sampling cost	Typical use
VAE	High	Lower, often blurry	Good	Cheap, one pass	Tabular, anomaly detection, latent components
GAN	Low, adversarial	High, sharp	Risk of mode collapse	Cheap, one pass	Images on a budget, tabular via CTGAN
Diffusion	High	Highest	Strong	Expensive, multi step	Images, audio, video, molecules
Autoregressive	High	High for sequences	Strong	Sequential, token by token	Text, code, structured sequences

Mature open source implementations exist for all four. Hugging Face Diffusers and PyTorch cover diffusion and autoregressive models, the Synthetic Data Vault provides VAE and GAN based tabular synthesizers including CTGAN, and these libraries are free, well documented, and widely used.

77.4 4. LLM Generated Data

77.4.1 4.1 The Shift to Language Models as Data Engines

Large language models changed synthetic data from a research curiosity into a production workflow. An autoregressive language model factorizes the joint distribution over a token sequence as $p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t})$, and sampling from this learned distribution, optionally conditioned on a prompt $c$ to give $p_\theta(x_t \mid x_{<t}, c)$, is exactly the act of generating synthetic text. The model has already absorbed a vast distribution during pretraining, so unlike a VAE or GAN there is no per task density to fit. We reuse the pretrained distribution and steer it. A capable model can be prompted to produce instruction and response pairs, question and answer sets, classification examples with rationales, and entire dialogues. The output is fluent, controllable through natural language instructions, and cheap relative to human annotation. Several influential instruction tuned models were trained largely on data generated by a stronger model, and the practice of distilling a teacher model into a smaller student through generated data is now routine.

77.4.2 4.2 Common Patterns

Three patterns appear repeatedly.

Self instruct. A model bootstraps a diverse instruction set from a small seed pool by generating new tasks, filtering for quality, and using the result to fine tune itself or a smaller model.
Distillation. A strong teacher generates inputs and high quality outputs, and a cheaper student is trained to imitate them. This compresses capability into a deployable size.
Augmentation and rephrasing. Existing examples are paraphrased, translated, or perturbed to expand coverage without changing the underlying labels.

77.4.3 4.3 Quality Control and Verification

The central risk of LLM generated data is that fluent text is not necessarily correct text. Hallucinated facts, subtle reasoning errors, and homogeneous phrasing degrade the resulting model. Effective pipelines therefore treat generation as the first stage of a filter. Verification strategies include execution feedback, where generated code or math is checked by running it or a verifier; model based judging, where a separate model scores candidates against a rubric; consistency filtering, where only answers a model reaches by multiple independent paths are retained; and deduplication to limit repetition. The reliable signal comes from grounding generation in a checkable oracle. Synthetic data for code and mathematics has advanced faster than for open ended text precisely because correctness is verifiable there.

generate -> filter for correctness -> deduplicate -> balance -> train
            (execution, judge, consistency, schema checks)

77.5 5. Privacy Preserving Synthesis

77.5.1 5.1 The Goal and Its Subtlety

A common claim is that synthetic data is automatically private because no row corresponds to a real person. This is false. A model trained on sensitive data can memorize and regurgitate individual records, and membership inference attacks can determine whether a specific person was in the training set by probing the model or its outputs. Privacy must be engineered, not assumed.

77.5.2 5.2 Differential Privacy

The rigorous standard is differential privacy (DP). A randomized mechanism $\mathcal{M}$ is $(\varepsilon, \delta)$ differentially private if for all datasets $D$ and $D'$ differing in one record and all output sets $S$,

\[\Pr[\mathcal{M}(D) \in S] \le e^{\varepsilon}\, \Pr[\mathcal{M}(D') \in S] + \delta.\]

The parameter $\varepsilon$ bounds how much any single individual can influence the output, and the additive $\delta$ allows a small probability of exceeding that bound. Smaller $\varepsilon$ means stronger privacy. Two structural properties make the definition useful in practice. Post processing means any function applied to a DP output is still DP with the same budget, so once a generator is trained privately every sample it produces is covered for free. Composition means the budgets of repeated accesses add up, which is why training, a sequence of many gradient steps, must account for its total spend. The dominant training technique is DP-SGD, which clips per example gradients to a fixed norm $C$, so no single record can dominate, and adds calibrated Gaussian noise of scale proportional to $C$ before each update. A generative model trained under DP-SGD inherits the guarantee, so its synthetic samples carry the same protection by the post processing property. The cost is a privacy utility tradeoff. Tighter privacy injects more noise and lowers fidelity, and that tension is fundamental rather than a temporary engineering gap. The open source Opacus library implements DP-SGD for PyTorch with built in privacy accounting, which makes the budget explicit rather than guesswork.

77.5.3 5.3 Measuring Privacy and Utility

Synthetic data should be evaluated on both axes. Utility is measured by fidelity, how well marginal and joint distributions match the real data, and by downstream task performance, often using the train on synthetic, test on real protocol. Privacy is measured by resistance to membership inference and by distance to nearest real records, since a synthetic point that is nearly identical to a training row is a leak regardless of any aggregate metric. A responsible release reports a privacy budget and an attack based audit, not just a fidelity score.

77.6 6. Evaluation

77.6.1 6.1 Fidelity, Diversity, and Utility

Three properties define good synthetic data. Fidelity asks whether samples are individually realistic. Diversity asks whether the samples cover the full range of the real distribution rather than a few modes. Utility asks whether a model trained on the synthetic data performs well on real data. These can conflict. A generator that copies a handful of real examples scores high on fidelity but fails diversity and offers little utility. For images, the Frechet Inception Distance compares feature statistics by modeling the real and generated features as Gaussians with means $\mu_r, \mu_g$ and covariances $\Sigma_r, \Sigma_g$ in the embedding space of a pretrained network, then computing

\[\mathrm{FID} = \lVert \mu_r - \mu_g \rVert_2^2 + \operatorname{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right).\]

Lower is better, and FID penalizes both a mean shift, through the first term, and a covariance mismatch, through the second, so it is sensitive to fidelity and diversity at once. Its weakness is that a single scalar cannot say which axis failed. Precision and recall style metrics address this by separating the two: precision is the fraction of generated samples that fall within the support of the real data, a fidelity measure, and recall is the fraction of real data covered by the generated support, a diversity measure. The most decisive test remains train on synthetic, test on real, because it measures the only thing that ultimately matters.

77.6.2 6.2 The Train on Synthetic, Test on Real Protocol

The protocol is simple. Train a downstream model entirely on synthetic data, then evaluate it on a held out set of real data. The gap to a model trained on real data quantifies how much signal the synthetic pipeline preserved. Crucially, the test set must be real and untouched by the generator, otherwise the evaluation inherits the generator’s blind spots.

77.7 7. Model Collapse and the Pitfalls

77.7.1 7.1 What Model Collapse Is

Model collapse is the degenerative process that occurs when generative models are trained on data produced by earlier generative models, recursively, over generations. Each generation samples from an imperfect approximation of the previous distribution. Errors compound. Tails of the distribution, the rare events and minority modes, are sampled less and less until they vanish, and the model’s outputs drift toward a narrow, over smoothed average. Late stage collapse can converge to near constant output that bears little resemblance to the original data.

77.7.2 7.2 Why It Happens

Three error sources drive collapse. Statistical error arises because we train on finite samples, so rare events are sometimes simply absent. Functional expressivity error arises because models cannot perfectly represent the true distribution. Functional approximation error arises from imperfect optimization and biased estimators. Each generation, sampling from the model rather than reality, re injects and amplifies these errors. The mathematical signature is a steady contraction of variance.

Worked example: variance collapse under recursion

Consider the simplest possible recursive loop. At generation $0$ the truth is $\mathcal{N}(\mu, \sigma^2)$. At each generation we draw $n$ samples from the current model, fit a new Gaussian by maximum likelihood, and discard the old data, keeping only the fresh samples. Let $\sigma_k^2$ denote the variance of the fitted model at generation $k$. The maximum likelihood variance estimate from $n$ samples of a $\mathcal{N}(\mu_k, \sigma_k^2)$ source is, in expectation, the biased estimator $\mathbb{E}[\hat{\sigma}_{k+1}^2 \mid \sigma_k^2] = \tfrac{n-1}{n}\,\sigma_k^2$. Because each generation only sees its predecessor’s samples, the expected variances form a geometric sequence,

\[\mathbb{E}[\sigma_k^2] = \left(\frac{n-1}{n}\right)^k \sigma^2 \xrightarrow[k \to \infty]{} 0.\]

The variance decays to zero no matter how large $n$ is, as long as it is finite, because the bias is multiplicative and compounds at every step. With $n = 1000$ samples per generation the spread shrinks by a tenth of a percent each round, which sounds negligible, yet after a few thousand generations the distribution has collapsed to a spike at $\mu$. The tails go first, then everything. This toy model is not a curiosity; it is the cleanest illustration of why feeding a model its own output without an external anchor is self limiting. The only way to stop the geometric decay is to keep injecting samples from the true source, which resets the multiplicative bias instead of letting it compound.

Even this idealized case of fitting a Gaussian to its own samples each generation shows the estimated variance shrinking toward zero over time, a clean illustration of how tails disappear under recursion. Real generative models add the expressivity and approximation errors on top, so collapse in practice is faster and less benign.

77.7.3 7.3 Mitigations

Collapse is not inevitable. The decisive mitigation is to retain real data. If each generation continues to train on the original real corpus alongside synthetic data, rather than replacing it, the degeneration is largely arrested. Accumulating data, where synthetic outputs are added to rather than substituted for the real anchor, avoids the worst outcomes in both theory and experiment. Other defenses include provenance tracking to distinguish human from machine generated content, strong verification filters that discard low quality synthetic samples before they enter the training pool, and mixing ratios that cap the synthetic fraction. The practical guidance is direct. Never let a model’s own outputs silently become its entire diet.

77.7.4 7.4 The Broader Risk

Model collapse is not only a per project concern. As machine generated text and images fill the public web, future models scraped from that web will increasingly train on their predecessors’ output without anyone intending it. This raises the value of data with verified human provenance and of curated, attested corpora collected before the saturation point. The commons that made large models possible is itself at risk of contamination, which is one reason data provenance has become a first class engineering concern.

77.8 8. Practical Guidance

A workable synthetic data program follows a few principles. Start from the cheapest source that meets the need, which is often a simulator or simple augmentation before any learned generator. Anchor everything in real data, both as a training ingredient and as the evaluation set that the generator never sees. Verify aggressively, preferring oracles that can check correctness over judgments of plausibility. Measure privacy and utility as separate axes and report a budget, not a vibe. Cap the synthetic fraction and track provenance so that collapse cannot creep in unnoticed. Treat synthetic data as a powerful supplement to reality, never a wholesale replacement for it.

77.9 9. Summary

Synthetic data has become indispensable because real data is finite, costly, imbalanced, and constrained by privacy. Simulation offers free labels at the price of a reality gap that domain randomization helps close. Learned generative models, the VAE, the GAN, and the diffusion model, approximate data distributions with different tradeoffs among stability, fidelity, and coverage, while transformers dominate synthetic text and code. Large language models have made generation cheap and controllable, shifting the bottleneck from production to verification. Differential privacy gives synthesis a rigorous privacy guarantee at a measurable utility cost. The overriding pitfall is model collapse, the recursive degeneration that strips away the tails of a distribution when models feed on their own output, and its remedy is to keep real data in the loop. Used with discipline, synthetic data extends the reach of machine learning. Used carelessly, it quietly hollows it out.

77.10 References

Kaplan, J. et al. Scaling Laws for Neural Language Models. 2020. https://arxiv.org/abs/2001.08361
Villalobos, P. et al. Will We Run Out of Data? Limits of LLM Scaling Based on Human Generated Data. 2022. https://arxiv.org/abs/2211.04325
Tobin, J. et al. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. 2017. https://arxiv.org/abs/1703.06907
Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. 2013. https://arxiv.org/abs/1312.6114
Goodfellow, I. et al. Generative Adversarial Networks. 2014. https://arxiv.org/abs/1406.2661
Xu, L. et al. Modeling Tabular Data using Conditional GAN (CTGAN). 2019. https://arxiv.org/abs/1907.00503
Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models. 2020. https://arxiv.org/abs/2006.11239
Wang, Y. et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions. 2022. https://arxiv.org/abs/2212.10560
Abadi, M. et al. Deep Learning with Differential Privacy (DP-SGD). 2016. https://arxiv.org/abs/1607.00133
Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. 2014. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
Shumailov, I. et al. The Curse of Recursion: Training on Generated Data Makes Models Forget. 2023. https://arxiv.org/abs/2305.17493
Gerstgrasser, M. et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. 2024. https://arxiv.org/abs/2404.01413
Heusel, M. et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID). 2017. https://arxiv.org/abs/1706.08500
Shokri, R. et al. Membership Inference Attacks Against Machine Learning Models. 2017. https://arxiv.org/abs/1610.05820

# Synthetic Data Generation Synthetic data is information that is produced by an algorithm rather than collected from a real observation, yet is intended to carry the statistical signal of the real thing. The motivation is practical. Real data is expensive to label, encumbered by privacy law, imbalanced across the classes we care about, and frequently unavailable for the rare events that matter most. Synthetic data promises an escape from these constraints, and over the past decade it has moved from a niche augmentation trick to a central pillar of how modern systems are trained, evaluated, and stress tested. This chapter develops the subject rigorously and practically. It covers simulation, the major generative model families, the rise of large language models as data engines, privacy-preserving synthesis, and the failure mode that haunts the whole enterprise, model collapse. ::: {.callout-note title="Definition: synthetic data"} Let $p_{\text{data}}$ be the true distribution that real observations are drawn from. A synthetic data generator is a sampling procedure $g$, possibly stochastic and possibly conditioned on real data $D$, whose outputs $\tilde{x} = g(\cdot)$ are induced by a distribution $p_g$. The generator is useful to the extent that $p_g$ is close to $p_{\text{data}}$ on the statistics a downstream task depends on, while $g$ relaxes a constraint that makes sampling from $p_{\text{data}}$ directly impractical, such as labeling cost, privacy exposure, or the rarity of an event of interest. Closeness is task relative: a generator can be excellent for training a fraud classifier and useless for estimating a tail quantile, because the two tasks weight $p_{\text{data}}$ differently. ::: The families covered below differ in how they obtain $p_g$. Simulators encode it by hand from domain knowledge. Generative models learn it from samples of $p_{\text{data}}$. Large language models reuse a distribution already learned during pretraining and steer it with prompts. The diagram below organizes the landscape. ```{mermaid} flowchart TD A["Synthetic data generation"] --> B["Simulation, knowledge encoded by hand"] A --> C["Learned generative models, density fit from data"] A --> D["LLM generation, pretrained distribution steered by prompts"] B --> B1["Mechanistic and physics simulators"] B --> B2["Domain randomization"] C --> C1["VAE, latent variable, stable, blurry"] C --> C2["GAN, adversarial, sharp, unstable"] C --> C3["Diffusion, denoising, high fidelity, slow"] D --> D1["Self instruct"] D --> D2["Distillation"] D --> D3["Augmentation and rephrasing"] ``` ## 1. Why Synthesize Data ### 1.1 The Demand Curve for Data The scaling behavior of deep learning ties model quality to dataset size. Empirically, test loss for a model with $N$ parameters trained on $D$ tokens follows an approximate power law, $L(N, D) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}$, so reductions in loss demand large multiplicative increases in $D$. The power law form is the crux of the problem. Because the data term decays as $D^{-\beta}$ with $\beta$ well below one, each fixed increment of loss reduction costs a roughly geometric increase in tokens. Halving the data driven excess loss does not require twice the data; it requires a factor of $2^{1/\beta}$, which for typical $\beta \approx 0.1$ is a factor of roughly one thousand. Human generated data is finite. High quality text on the public web is a bounded resource, and credible projections suggest the stock of useful tokens will be largely exhausted within a few years of continued scaling. Synthetic data is one of the few levers that can extend the curve, alongside multiple epochs over the same data and multimodal sources. ### 1.2 Beyond Volume Volume is not the only reason to synthesize. Four others recur in practice. 1. Privacy. Health records, financial transactions, and location traces cannot be shared freely. A synthetic surrogate that preserves utility while breaking the link to real individuals lets teams collaborate and publish. 2. Class imbalance and rare events. Fraud, equipment failure, and adverse drug reactions are scarce by definition. Generating plausible minority examples can rebalance a training set. 3. Coverage and edge cases. Autonomous systems must handle situations that are dangerous or unethical to collect in the wild, such as a child running into a road. 4. Controllability. Synthetic pipelines expose knobs. We can dial the lighting, the dialect, the failure rate, or the label distribution, which makes systematic evaluation possible. ## 2. Simulation Based Generation ### 2.1 Mechanistic and Physics Based Simulators The oldest form of synthetic data comes from simulators that encode domain knowledge directly. Game engines and ray tracers render labeled images for perception. Physics solvers produce sensor readings for robotics. Agent based models generate population level behavior in epidemiology and economics. The defining feature is that labels are free. Because the simulator knows the ground truth pose, depth, or segmentation mask, annotation cost collapses to zero. The defining problem is the reality gap, the distribution shift between the simulator distribution $p_{\text{sim}}$ and the real distribution $p_{\text{real}}$. A model $f$ trained to minimize risk under $p_{\text{sim}}$ minimizes $R_{\text{sim}}(f) = \mathbb{E}_{p_{\text{sim}}}[\ell(f(x), y)]$, but it is deployed against $R_{\text{real}}(f) = \mathbb{E}_{p_{\text{real}}}[\ell(f(x), y)]$. The gap between the two risks is controlled by how far apart the distributions are. A standard bound writes $R_{\text{real}}(f) \le R_{\text{sim}}(f) + d(p_{\text{sim}}, p_{\text{real}})$, where $d$ is a discrepancy that depends on the model class. A model trained only on synthetic renders often fails on real photographs because textures, noise, and lighting differ in ways the simulator did not capture, which inflates $d$ even when $R_{\text{sim}}$ is near zero. The two levers for shrinking the gap are making the simulator more realistic and making the model invariant to the differences. Domain randomization takes the second path. ### 2.2 Domain Randomization Domain randomization closes the gap by making the simulator deliberately diverse. Rather than trying to render one photorealistic world, we randomize textures, colors, camera positions, and lighting across a wide range. The intuition is that if the real world looks like just one more random variation, a model trained across the randomized ensemble will treat reality as in distribution. Formally, if $p_{\text{sim}}(\theta)$ is a distribution over simulator parameters $\theta$, we train on the marginal $\mathbb{E}_{\theta \sim p_{\text{sim}}}[\,\mathcal{L}(f, \mathcal{D}_\theta)\,]$ and hope the support is broad enough to contain the real distribution. The randomization succeeds precisely when $p_{\text{real}}$ lies inside the convex hull of the rendered variations, so that no real observation is out of distribution relative to the training mixture. This technique underpinned early sim to real transfer for robotic manipulation and remains a strong baseline. Its cost is sample efficiency: a wider randomization range forces the model to fit a harder, more varied problem, so it needs more capacity and more data to reach a given accuracy. The practical art is to randomize the nuisance factors that the simulator gets wrong, such as texture and lighting, while keeping the task relevant structure, such as object geometry, faithful. ## 3. Generative Models for Data When no mechanistic simulator exists, we learn the data distribution from samples. The goal of a generative model is to approximate an unknown density $p_{\text{data}}(x)$ with a model $p_\theta(x)$ that we can sample from. Three families dominate. ### 3.1 Variational Autoencoders A variational autoencoder (VAE) couples an encoder $q_\phi(z \mid x)$ that maps data to a latent code with a decoder $p_\theta(x \mid z)$ that reconstructs it. Training maximizes the evidence lower bound, $$\mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}\!\left(q_\phi(z \mid x)\,\|\,p(z)\right),$$ which trades reconstruction fidelity against keeping the posterior close to a simple prior $p(z)$, usually a standard normal. To generate, we sample $z \sim p(z)$ and decode. VAEs are stable to train and give a smooth, interpretable latent space, but the Gaussian assumptions and the averaging behavior of the reconstruction term tend to produce blurry samples. They remain valuable for tabular data, anomaly detection, and as components inside larger systems. ### 3.2 Generative Adversarial Networks A generative adversarial network (GAN) pits a generator $G$ against a discriminator $D$ in a minimax game, $$\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))].$$ The discriminator learns to separate real from fake, and the generator learns to fool it. For a fixed generator the optimal discriminator is $D^*(x) = p_{\text{data}}(x) / (p_{\text{data}}(x) + p_g(x))$, and substituting it back reduces the generator objective to the Jensen Shannon divergence between $p_{\text{data}}$ and $p_g$ up to a constant. At the theoretical optimum that divergence is zero, so the generator distribution matches the data distribution and the discriminator is reduced to outputting $\tfrac{1}{2}$ everywhere, maximally confused. GANs produce sharp, high fidelity samples and dominated image synthesis for years. They are also notoriously hard to train. The two failure modes to know are non convergence, where the adversarial dynamics oscillate rather than settle, and mode collapse, where the generator maps many inputs to a few outputs and ignores large regions of the data distribution. Wasserstein losses, gradient penalties, and spectral normalization were developed to stabilize training. For tabular synthesis, conditional variants such as CTGAN handle mixed discrete and continuous columns and skewed marginals. ### 3.3 Diffusion Models Diffusion models have become the state of the art for high fidelity generation. The idea is to define a forward process that gradually corrupts data with Gaussian noise across $T$ steps, $$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t; \sqrt{1 - \beta_t}\, x_{t-1}, \beta_t \mathbf{I}\right),$$ until the signal is pure noise. A useful property is that the forward process has a closed form marginal that lets us jump to any step in one shot. Writing $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$, $$q(x_t \mid x_0) = \mathcal{N}\!\left(x_t; \sqrt{\bar{\alpha}_t}\, x_0, (1 - \bar{\alpha}_t)\mathbf{I}\right),$$ so a noised sample is $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, \mathbf{I})$. The model then learns the reverse process that denoises step by step. A neural network $\epsilon_\theta(x_t, t)$ is trained to predict the noise added at each step, minimizing a simple regression objective, $$\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t}\left[\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\,\right].$$ This objective is exactly the closed form sample above fed through the network, which is why diffusion training is so stable: there is no adversary, just a regression toward known noise. Sampling starts from noise and iterates the learned reverse steps. Conditioning, such as a text prompt, enters through the network input, and classifier free guidance sharpens adherence to the condition by extrapolating between the conditional and unconditional predictions at sampling time. Diffusion models avoid the adversarial instability of GANs and cover modes far better, at the cost of slow, multi step sampling, although distillation and few step solvers have narrowed that gap. They power most current text to image systems and have extended to audio, video, molecules, and tabular data. The following sketch shows the training loop at a high level. ```python # Diffusion training step (schematic, not runnable as-is) x0 = sample_batch(data) t = randint(1, T) # random diffusion step noise = randn_like(x0) xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * noise loss = mse(model(xt, t), noise) # predict the injected noise loss.backward() ``` ### 3.4 Choosing a Family There is no universally best model. VAEs are a sensible default for tabular and low dimensional data where stability and a structured latent space matter. GANs still compete when sample sharpness is paramount and training budget is limited. Diffusion models are the choice when fidelity and mode coverage justify heavier compute. For text and code, the autoregressive transformer, discussed next, dominates outright. The table summarizes the tradeoffs. | Family | Training stability | Sample fidelity | Mode coverage | Sampling cost | Typical use | |---|---|---|---|---|---| | VAE | High | Lower, often blurry | Good | Cheap, one pass | Tabular, anomaly detection, latent components | | GAN | Low, adversarial | High, sharp | Risk of mode collapse | Cheap, one pass | Images on a budget, tabular via CTGAN | | Diffusion | High | Highest | Strong | Expensive, multi step | Images, audio, video, molecules | | Autoregressive | High | High for sequences | Strong | Sequential, token by token | Text, code, structured sequences | Mature open source implementations exist for all four. Hugging Face Diffusers and PyTorch cover diffusion and autoregressive models, the Synthetic Data Vault provides VAE and GAN based tabular synthesizers including CTGAN, and these libraries are free, well documented, and widely used. ## 4. LLM Generated Data ### 4.1 The Shift to Language Models as Data Engines Large language models changed synthetic data from a research curiosity into a production workflow. An autoregressive language model factorizes the joint distribution over a token sequence as $p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t})$, and sampling from this learned distribution, optionally conditioned on a prompt $c$ to give $p_\theta(x_t \mid x_{<t}, c)$, is exactly the act of generating synthetic text. The model has already absorbed a vast distribution during pretraining, so unlike a VAE or GAN there is no per task density to fit. We reuse the pretrained distribution and steer it. A capable model can be prompted to produce instruction and response pairs, question and answer sets, classification examples with rationales, and entire dialogues. The output is fluent, controllable through natural language instructions, and cheap relative to human annotation. Several influential instruction tuned models were trained largely on data generated by a stronger model, and the practice of distilling a teacher model into a smaller student through generated data is now routine. ### 4.2 Common Patterns Three patterns appear repeatedly. 1. Self instruct. A model bootstraps a diverse instruction set from a small seed pool by generating new tasks, filtering for quality, and using the result to fine tune itself or a smaller model. 2. Distillation. A strong teacher generates inputs and high quality outputs, and a cheaper student is trained to imitate them. This compresses capability into a deployable size. 3. Augmentation and rephrasing. Existing examples are paraphrased, translated, or perturbed to expand coverage without changing the underlying labels. ### 4.3 Quality Control and Verification The central risk of LLM generated data is that fluent text is not necessarily correct text. Hallucinated facts, subtle reasoning errors, and homogeneous phrasing degrade the resulting model. Effective pipelines therefore treat generation as the first stage of a filter. Verification strategies include execution feedback, where generated code or math is checked by running it or a verifier; model based judging, where a separate model scores candidates against a rubric; consistency filtering, where only answers a model reaches by multiple independent paths are retained; and deduplication to limit repetition. The reliable signal comes from grounding generation in a checkable oracle. Synthetic data for code and mathematics has advanced faster than for open ended text precisely because correctness is verifiable there. ```text generate -> filter for correctness -> deduplicate -> balance -> train (execution, judge, consistency, schema checks) ``` ## 5. Privacy Preserving Synthesis ### 5.1 The Goal and Its Subtlety A common claim is that synthetic data is automatically private because no row corresponds to a real person. This is false. A model trained on sensitive data can memorize and regurgitate individual records, and membership inference attacks can determine whether a specific person was in the training set by probing the model or its outputs. Privacy must be engineered, not assumed. ### 5.2 Differential Privacy The rigorous standard is differential privacy (DP). A randomized mechanism $\mathcal{M}$ is $(\varepsilon, \delta)$ differentially private if for all datasets $D$ and $D'$ differing in one record and all output sets $S$, $$\Pr[\mathcal{M}(D) \in S] \le e^{\varepsilon}\, \Pr[\mathcal{M}(D') \in S] + \delta.$$ The parameter $\varepsilon$ bounds how much any single individual can influence the output, and the additive $\delta$ allows a small probability of exceeding that bound. Smaller $\varepsilon$ means stronger privacy. Two structural properties make the definition useful in practice. Post processing means any function applied to a DP output is still DP with the same budget, so once a generator is trained privately every sample it produces is covered for free. Composition means the budgets of repeated accesses add up, which is why training, a sequence of many gradient steps, must account for its total spend. The dominant training technique is DP-SGD, which clips per example gradients to a fixed norm $C$, so no single record can dominate, and adds calibrated Gaussian noise of scale proportional to $C$ before each update. A generative model trained under DP-SGD inherits the guarantee, so its synthetic samples carry the same protection by the post processing property. The cost is a privacy utility tradeoff. Tighter privacy injects more noise and lowers fidelity, and that tension is fundamental rather than a temporary engineering gap. The open source Opacus library implements DP-SGD for PyTorch with built in privacy accounting, which makes the budget explicit rather than guesswork. ### 5.3 Measuring Privacy and Utility Synthetic data should be evaluated on both axes. Utility is measured by fidelity, how well marginal and joint distributions match the real data, and by downstream task performance, often using the train on synthetic, test on real protocol. Privacy is measured by resistance to membership inference and by distance to nearest real records, since a synthetic point that is nearly identical to a training row is a leak regardless of any aggregate metric. A responsible release reports a privacy budget and an attack based audit, not just a fidelity score. ## 6. Evaluation ### 6.1 Fidelity, Diversity, and Utility Three properties define good synthetic data. Fidelity asks whether samples are individually realistic. Diversity asks whether the samples cover the full range of the real distribution rather than a few modes. Utility asks whether a model trained on the synthetic data performs well on real data. These can conflict. A generator that copies a handful of real examples scores high on fidelity but fails diversity and offers little utility. For images, the Frechet Inception Distance compares feature statistics by modeling the real and generated features as Gaussians with means $\mu_r, \mu_g$ and covariances $\Sigma_r, \Sigma_g$ in the embedding space of a pretrained network, then computing $$\mathrm{FID} = \lVert \mu_r - \mu_g \rVert_2^2 + \operatorname{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right).$$ Lower is better, and FID penalizes both a mean shift, through the first term, and a covariance mismatch, through the second, so it is sensitive to fidelity and diversity at once. Its weakness is that a single scalar cannot say which axis failed. Precision and recall style metrics address this by separating the two: precision is the fraction of generated samples that fall within the support of the real data, a fidelity measure, and recall is the fraction of real data covered by the generated support, a diversity measure. The most decisive test remains train on synthetic, test on real, because it measures the only thing that ultimately matters. ### 6.2 The Train on Synthetic, Test on Real Protocol The protocol is simple. Train a downstream model entirely on synthetic data, then evaluate it on a held out set of real data. The gap to a model trained on real data quantifies how much signal the synthetic pipeline preserved. Crucially, the test set must be real and untouched by the generator, otherwise the evaluation inherits the generator's blind spots. ## 7. Model Collapse and the Pitfalls ### 7.1 What Model Collapse Is Model collapse is the degenerative process that occurs when generative models are trained on data produced by earlier generative models, recursively, over generations. Each generation samples from an imperfect approximation of the previous distribution. Errors compound. Tails of the distribution, the rare events and minority modes, are sampled less and less until they vanish, and the model's outputs drift toward a narrow, over smoothed average. Late stage collapse can converge to near constant output that bears little resemblance to the original data. ### 7.2 Why It Happens Three error sources drive collapse. Statistical error arises because we train on finite samples, so rare events are sometimes simply absent. Functional expressivity error arises because models cannot perfectly represent the true distribution. Functional approximation error arises from imperfect optimization and biased estimators. Each generation, sampling from the model rather than reality, re injects and amplifies these errors. The mathematical signature is a steady contraction of variance. ::: {.callout-tip title="Worked example: variance collapse under recursion"} Consider the simplest possible recursive loop. At generation $0$ the truth is $\mathcal{N}(\mu, \sigma^2)$. At each generation we draw $n$ samples from the current model, fit a new Gaussian by maximum likelihood, and discard the old data, keeping only the fresh samples. Let $\sigma_k^2$ denote the variance of the fitted model at generation $k$. The maximum likelihood variance estimate from $n$ samples of a $\mathcal{N}(\mu_k, \sigma_k^2)$ source is, in expectation, the biased estimator $\mathbb{E}[\hat{\sigma}_{k+1}^2 \mid \sigma_k^2] = \tfrac{n-1}{n}\,\sigma_k^2$. Because each generation only sees its predecessor's samples, the expected variances form a geometric sequence, $$\mathbb{E}[\sigma_k^2] = \left(\frac{n-1}{n}\right)^k \sigma^2 \xrightarrow[k \to \infty]{} 0.$$ The variance decays to zero no matter how large $n$ is, as long as it is finite, because the bias is multiplicative and compounds at every step. With $n = 1000$ samples per generation the spread shrinks by a tenth of a percent each round, which sounds negligible, yet after a few thousand generations the distribution has collapsed to a spike at $\mu$. The tails go first, then everything. This toy model is not a curiosity; it is the cleanest illustration of why feeding a model its own output without an external anchor is self limiting. The only way to stop the geometric decay is to keep injecting samples from the true source, which resets the multiplicative bias instead of letting it compound. ::: Even this idealized case of fitting a Gaussian to its own samples each generation shows the estimated variance shrinking toward zero over time, a clean illustration of how tails disappear under recursion. Real generative models add the expressivity and approximation errors on top, so collapse in practice is faster and less benign. ### 7.3 Mitigations Collapse is not inevitable. The decisive mitigation is to retain real data. If each generation continues to train on the original real corpus alongside synthetic data, rather than replacing it, the degeneration is largely arrested. Accumulating data, where synthetic outputs are added to rather than substituted for the real anchor, avoids the worst outcomes in both theory and experiment. Other defenses include provenance tracking to distinguish human from machine generated content, strong verification filters that discard low quality synthetic samples before they enter the training pool, and mixing ratios that cap the synthetic fraction. The practical guidance is direct. Never let a model's own outputs silently become its entire diet. ### 7.4 The Broader Risk Model collapse is not only a per project concern. As machine generated text and images fill the public web, future models scraped from that web will increasingly train on their predecessors' output without anyone intending it. This raises the value of data with verified human provenance and of curated, attested corpora collected before the saturation point. The commons that made large models possible is itself at risk of contamination, which is one reason data provenance has become a first class engineering concern. ## 8. Practical Guidance A workable synthetic data program follows a few principles. Start from the cheapest source that meets the need, which is often a simulator or simple augmentation before any learned generator. Anchor everything in real data, both as a training ingredient and as the evaluation set that the generator never sees. Verify aggressively, preferring oracles that can check correctness over judgments of plausibility. Measure privacy and utility as separate axes and report a budget, not a vibe. Cap the synthetic fraction and track provenance so that collapse cannot creep in unnoticed. Treat synthetic data as a powerful supplement to reality, never a wholesale replacement for it. ## 9. Summary Synthetic data has become indispensable because real data is finite, costly, imbalanced, and constrained by privacy. Simulation offers free labels at the price of a reality gap that domain randomization helps close. Learned generative models, the VAE, the GAN, and the diffusion model, approximate data distributions with different tradeoffs among stability, fidelity, and coverage, while transformers dominate synthetic text and code. Large language models have made generation cheap and controllable, shifting the bottleneck from production to verification. Differential privacy gives synthesis a rigorous privacy guarantee at a measurable utility cost. The overriding pitfall is model collapse, the recursive degeneration that strips away the tails of a distribution when models feed on their own output, and its remedy is to keep real data in the loop. Used with discipline, synthetic data extends the reach of machine learning. Used carelessly, it quietly hollows it out. ## References 1. Kaplan, J. et al. Scaling Laws for Neural Language Models. 2020. https://arxiv.org/abs/2001.08361 2. Villalobos, P. et al. Will We Run Out of Data? Limits of LLM Scaling Based on Human Generated Data. 2022. https://arxiv.org/abs/2211.04325 3. Tobin, J. et al. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. 2017. https://arxiv.org/abs/1703.06907 4. Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. 2013. https://arxiv.org/abs/1312.6114 5. Goodfellow, I. et al. Generative Adversarial Networks. 2014. https://arxiv.org/abs/1406.2661 6. Xu, L. et al. Modeling Tabular Data using Conditional GAN (CTGAN). 2019. https://arxiv.org/abs/1907.00503 7. Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models. 2020. https://arxiv.org/abs/2006.11239 8. Wang, Y. et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions. 2022. https://arxiv.org/abs/2212.10560 9. Abadi, M. et al. Deep Learning with Differential Privacy (DP-SGD). 2016. https://arxiv.org/abs/1607.00133 10. Dwork, C. and Roth, A. The Algorithmic Foundations of Differential Privacy. 2014. https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf 11. Shumailov, I. et al. The Curse of Recursion: Training on Generated Data Makes Models Forget. 2023. https://arxiv.org/abs/2305.17493 12. Gerstgrasser, M. et al. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. 2024. https://arxiv.org/abs/2404.01413 13. Heusel, M. et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID). 2017. https://arxiv.org/abs/1706.08500 14. Shokri, R. et al. Membership Inference Attacks Against Machine Learning Models. 2017. https://arxiv.org/abs/1610.05820