| |
| |
| |
| |
| export const DEFAULT_CONTENT = ` |
| <h2>Introduction</h2> |
| |
| <p>Recent work has revealed remarkably predictable relationships between the scale of neural language models and their performance. As <span data-type="glossary" term="Scaling law" definition="A power-law relationship between model size, dataset size, compute budget, and performance."></span> research matures, practitioners increasingly rely on these empirical regularities to guide decisions about model architecture, training data, and compute allocation.</p> |
| |
| <div data-component="sidenote"><p>The concept of scaling laws in deep learning predates the current LLM era. Similar power-law behaviors were observed in machine translation and image classification as early as 2017.</p></div> |
| |
| <p>This paper presents an empirical study of <span data-type="glossary" term="Neural scaling" definition="The phenomenon where model performance improves as a power law with increasing model size, data, or compute."></span> behavior across three key axes: model size (number of parameters), dataset size (number of tokens), and compute budget (FLOPs). Building on the foundational work of <span data-type="citation" key="kaplan2020"></span> and subsequent refinements by <span data-type="citation" key="hoffmann2022"></span>, we investigate how these relationships hold across different architectural variants and training regimes.<span data-type="footnote" content="Our experiments span five orders of magnitude in compute, from single-GPU runs to multi-node clusters with 256 GPUs."></span></p> |
| |
| <p>We make the following contributions:</p> |
| |
| <ol> |
| <li>A comprehensive benchmark of scaling behavior across <strong>five model families</strong> ranging from 125M to 70B parameters</li> |
| <li>An analysis of <em>compute-optimal training</em>, extending the Chinchilla framework to mixture-of-experts architectures</li> |
| <li>Practical guidelines for resource allocation in large-scale training campaigns</li> |
| </ol> |
| |
| <h2>Background</h2> |
| |
| <h3>The Transformer architecture</h3> |
| |
| <p>The <span data-type="glossary" term="Transformer" definition="A neural network architecture based entirely on self-attention mechanisms, introduced by Vaswani et al. (2017)."></span> architecture <span data-type="citation" key="vaswani2017"></span> forms the backbone of all models studied in this work. At its core, the self-attention mechanism computes a weighted combination of value vectors, where weights are determined by the compatibility between query and key vectors:</p> |
| |
| <div data-type="block-math" data-latex="\\text{Attention}(Q, K, V) = \\text{softmax}\\!\\left(\\frac{QK^\\top}{\\sqrt{d_k}}\\right) V"></div> |
| |
| <p>where <span data-type="inline-math" data-latex="Q \\in \\mathbb{R}^{n \\times d_k}"></span> are queries, <span data-type="inline-math" data-latex="K \\in \\mathbb{R}^{m \\times d_k}"></span> are keys, and <span data-type="inline-math" data-latex="V \\in \\mathbb{R}^{m \\times d_v}"></span> are values. The scaling factor <span data-type="inline-math" data-latex="\\sqrt{d_k}"></span> prevents the dot products from growing too large in magnitude.<span data-type="footnote" content="Without this scaling factor, the softmax function would be pushed into regions where it has extremely small gradients, making training unstable."></span></p> |
| |
| <h3>Pre-training objectives</h3> |
| |
| <p>All models in our study are trained with the standard autoregressive language modeling objective. Given a sequence of tokens <span data-type="inline-math" data-latex="x_1, x_2, \\ldots, x_T"></span>, the model maximizes:</p> |
| |
| <div data-type="block-math" data-latex="\\mathcal{L}(\\theta) = -\\sum_{t=1}^{T} \\log P_\\theta(x_t \\mid x_{<t})"></div> |
| |
| <p>This objective naturally decomposes across sequence positions, enabling efficient parallelization. Prior work <span data-type="citation" key="devlin2019"></span> demonstrated that bidirectional objectives can yield superior representations for downstream tasks, but autoregressive training remains the dominant paradigm for generative models.</p> |
| |
| <h4>Loss decomposition</h4> |
| |
| <p>For analysis purposes, we decompose the training loss into contributions from different frequency bins:</p> |
| |
| <div data-type="block-math" data-latex="\\begin{aligned} \\mathcal{L}_{\\text{total}} &= \\mathcal{L}_{\\text{high-freq}} + \\mathcal{L}_{\\text{mid-freq}} + \\mathcal{L}_{\\text{low-freq}} \\\\ &= -\\sum_{t \\in H} \\log P(x_t \\mid x_{<t}) - \\sum_{t \\in M} \\log P(x_t \\mid x_{<t}) - \\sum_{t \\in L} \\log P(x_t \\mid x_{<t}) \\end{aligned}"></div> |
| |
| <p>where <span data-type="inline-math" data-latex="H, M, L"></span> partition the vocabulary by token frequency. This decomposition reveals that scaling primarily benefits predictions on rare tokens.</p> |
| |
| <h2>Experimental setup</h2> |
| |
| <h3>Model configurations</h3> |
| |
| <p>We train five model families spanning three orders of magnitude in parameter count. All models use the standard decoder-only Transformer architecture with rotary position embeddings (RoPE) and SwiGLU activations.</p> |
| |
| <table> |
| <tr> |
| <th>Model</th> |
| <th>Parameters</th> |
| <th>Layers</th> |
| <th>Hidden dim</th> |
| <th>Heads</th> |
| </tr> |
| <tr> |
| <td>Small</td> |
| <td>125M</td> |
| <td>12</td> |
| <td>768</td> |
| <td>12</td> |
| </tr> |
| <tr> |
| <td>Medium</td> |
| <td>1.3B</td> |
| <td>24</td> |
| <td>2048</td> |
| <td>16</td> |
| </tr> |
| <tr> |
| <td>Large</td> |
| <td>6.7B</td> |
| <td>32</td> |
| <td>4096</td> |
| <td>32</td> |
| </tr> |
| <tr> |
| <td>XL</td> |
| <td>13B</td> |
| <td>40</td> |
| <td>5120</td> |
| <td>40</td> |
| </tr> |
| <tr> |
| <td>XXL</td> |
| <td>70B</td> |
| <td>80</td> |
| <td>8192</td> |
| <td>64</td> |
| </tr> |
| </table> |
| |
| <h3>Training infrastructure</h3> |
| |
| <div data-component="accordion" title="Hardware and software details" open="false"><p>All experiments were conducted on clusters of NVIDIA A100 80GB GPUs connected via NVLink and InfiniBand. We used a custom distributed training framework built on PyTorch FSDP with mixed-precision (bf16) training. Gradient checkpointing was enabled for models above 6.7B parameters to fit within GPU memory constraints.</p></div> |
| |
| <p>The following code snippet illustrates our distributed training configuration:</p> |
| |
| <pre><code class="language-python">import torch |
| from torch.distributed.fsdp import FullyShardedDataParallel as FSDP |
| from transformers import AutoModelForCausalLM, AutoConfig |
| |
| config = AutoConfig.from_pretrained("meta-llama/Llama-3-8B") |
| config.use_cache = False |
| |
| model = AutoModelForCausalLM.from_config(config) |
| model = FSDP( |
| model, |
| mixed_precision=MixedPrecision( |
| param_dtype=torch.bfloat16, |
| reduce_dtype=torch.float32, |
| buffer_dtype=torch.bfloat16, |
| ), |
| sharding_strategy=ShardingStrategy.FULL_SHARD, |
| activation_checkpointing=True, |
| ) |
| |
| optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1) |
| scheduler = CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=3e-5)</code></pre> |
| |
| <div data-component="note" title="Reproducibility" emoji="💡" variant="info"><p>All experiments use a fixed random seed (42) for weight initialization and data shuffling. However, non-determinism in CUDA operations means exact reproducibility across hardware configurations is not guaranteed. We report means and standard deviations across three independent runs for each configuration.</p></div> |
| |
| <h3>Dataset</h3> |
| |
| <p>We train on a deduplicated mixture of web text, books, and scientific papers totaling 2.4 trillion tokens. The mixture weights were optimized following the approach of <span data-type="citation" key="brown2020"></span>, with upsampling of high-quality sources.</p> |
| |
| <h2>Scaling behavior</h2> |
| |
| <h3>Compute-optimal training</h3> |
| |
| <p>Our central finding confirms and extends the <span data-type="glossary" term="Chinchilla scaling" definition="The observation by Hoffmann et al. (2022) that compute-optimal training requires scaling model size and data size in roughly equal proportions."></span> hypothesis: for a given compute budget <span data-type="inline-math" data-latex="C"></span>, the optimal model size <span data-type="inline-math" data-latex="N^*"></span> and token count <span data-type="inline-math" data-latex="D^*"></span> follow power laws:</p> |
| |
| <div data-type="block-math" data-latex="N^* \\propto C^{0.50}, \\quad D^* \\propto C^{0.50}"></div> |
| |
| <p>The interactive visualization below shows the scaling relationship across our model suite:</p> |
| |
| <div data-component="htmlEmbed" src="d3-scaling-chart.html" title="Compute-performance scaling" desc="Validation loss as a function of training compute (FLOPs) for each model size." wide="false" downloadable="true"></div> |
| |
| <h3>Memory requirements</h3> |
| |
| <h4>Training memory breakdown</h4> |
| |
| <p>Understanding memory consumption is critical for capacity planning. Training memory consists of four components: model parameters, gradients, optimizer states, and activations.<span data-type="footnote" content="With Adam optimizer, optimizer states include first and second moment estimates, each requiring the same memory as the model parameters. This means optimizer states alone consume 2× the parameter memory."></span> The chart below shows how these components scale across model sizes and sequence lengths:</p> |
| |
| <div data-component="htmlEmbed" src="d3-memory-bar.html" title="Training memory breakdown" desc="Memory consumption by component across model sizes and sequence lengths. Use the dropdowns to explore different configurations." wide="false" downloadable="true"></div> |
| |
| <div data-component="note" title="Practical constraint" emoji="⚠️" variant="danger"><p>The 70B model requires at minimum 4-way tensor parallelism at sequence length 4096 with selective recomputation. Without activation checkpointing, memory requirements exceed 1.5 TB, making training infeasible even on 8×A100 nodes.</p></div> |
| |
| <h4>Activation checkpointing strategies</h4> |
| |
| <p>We evaluate three recomputation strategies and find that <strong>selective recomputation</strong> (checkpointing only attention layers) provides the best trade-off between memory savings and compute overhead:</p> |
| |
| <ul> |
| <li><strong>None</strong>: full activation storage, maximum memory, fastest wall-clock time</li> |
| <li><strong>Selective</strong>: checkpoint attention blocks only, ~75% memory reduction in activations with <5% throughput loss</li> |
| <li><strong>Full</strong>: checkpoint every layer, ~94% activation memory reduction but 30-40% throughput penalty</li> |
| </ul> |
| |
| <h2>Evaluation</h2> |
| |
| <h3>Benchmark performance</h3> |
| |
| <p>We evaluate all models on a suite of standard benchmarks. The table below summarizes key results:</p> |
| |
| <table> |
| <tr> |
| <th>Model</th> |
| <th>MMLU</th> |
| <th>HellaSwag</th> |
| <th>ARC-C</th> |
| <th>TruthfulQA</th> |
| </tr> |
| <tr> |
| <td>Small (125M)</td> |
| <td>26.1</td> |
| <td>31.2</td> |
| <td>22.8</td> |
| <td>42.1</td> |
| </tr> |
| <tr> |
| <td>Medium (1.3B)</td> |
| <td>34.7</td> |
| <td>52.6</td> |
| <td>31.4</td> |
| <td>38.9</td> |
| </tr> |
| <tr> |
| <td>Large (6.7B)</td> |
| <td>52.3</td> |
| <td>74.1</td> |
| <td>45.6</td> |
| <td>35.2</td> |
| </tr> |
| <tr> |
| <td>XL (13B)</td> |
| <td>61.8</td> |
| <td>80.4</td> |
| <td>53.2</td> |
| <td>39.7</td> |
| </tr> |
| <tr> |
| <td>XXL (70B)</td> |
| <td>72.4</td> |
| <td>86.9</td> |
| <td>62.1</td> |
| <td>44.3</td> |
| </tr> |
| </table> |
| |
| <div data-component="sidenote"><p>TruthfulQA shows a U-shaped pattern with model scale: medium-sized models perform worst as they are large enough to memorize common misconceptions but not large enough to reason past them.</p></div> |
| |
| <h3>Classification analysis</h3> |
| |
| <p>To understand where scaling helps most, we analyze per-class performance on a 10-class text classification task. The confusion matrices below compare our baseline (1.3B) and improved (6.7B) models:</p> |
| |
| <div data-component="wide"><div data-component="htmlEmbed" src="d3-confusion-matrix.html" title="Confusion matrix comparison" desc="Per-class classification accuracy for baseline (1.3B, left) vs improved (6.7B, right). The delta matrix (right) highlights classes with the largest gains." wide="false"></div></div> |
| |
| <p>The delta matrix reveals that scaling disproportionately benefits classes 2, 3, 5, and 8 (categories with high inter-class confusion at smaller scales), while already well-separated classes (0, 7, 9) see modest improvements.</p> |
| |
| <h2>Analysis</h2> |
| |
| <h3>Training dynamics</h3> |
| |
| <div data-component="quoteBlock" author="Ilya Sutskever" source="NeurIPS 2024 keynote"><p>If you have a very large neural network and you train it on a very large dataset, you get very good results. It really is that simple.</p></div> |
| |
| <p>While the quote above captures the high-level intuition, the reality is more nuanced. Our experiments reveal several non-obvious phenomena during training:</p> |
| |
| <div data-component="mermaid" code="graph LR\n A[Initialize model] --> B[Warmup phase]\n B --> C{Loss plateau?}\n C -->|No| D[Continue training]\n D --> C\n C -->|Yes| E[Increase learning rate]\n E --> F{Loss decreases?}\n F -->|Yes| D\n F -->|No| G[Reduce batch size]\n G --> D"></div> |
| |
| <h4>Phase transitions</h4> |
| |
| <p>We observe distinct <strong>phase transitions</strong> in model capabilities as training progresses. These transitions are characterized by sudden improvements in specific task categories, consistent with the "emergence" hypothesis <span data-type="citation" key="wei2022"></span>.<span data-type="footnote" content="The emergence hypothesis suggests that certain capabilities appear abruptly at specific scale thresholds, rather than improving gradually. This remains a topic of active debate in the field, with some researchers arguing that emergent abilities may be artifacts of evaluation metrics rather than genuine phase transitions in model behavior."></span></p> |
| |
| <div data-component="stack" layout="2-column" gap="medium"><div data-type="stack-column"><p><strong>Early training (0-20% tokens)</strong></p><p>Models acquire basic syntax, word co-occurrence statistics, and simple factual associations. Loss decreases rapidly following a power law. All model sizes show similar learning curves during this phase.</p></div><div data-type="stack-column"><p><strong>Late training (80-100% tokens)</strong></p><p>Larger models continue improving on complex reasoning tasks while smaller models plateau. The gap between model sizes widens, with diminishing returns for the smallest models. Multi-step reasoning capabilities emerge only in models above 6.7B parameters.</p></div></div> |
| |
| <h3>Compute efficiency</h3> |
| |
| <p>A critical finding is the relationship between total compute <span data-type="inline-math" data-latex="C"></span> and achieved loss <span data-type="inline-math" data-latex="L"></span>:</p> |
| |
| <div data-type="block-math" data-latex="L(C) = \\left(\\frac{C_0}{C}\\right)^{\\alpha_C} + L_\\infty"></div> |
| |
| <p>where <span data-type="inline-math" data-latex="\\alpha_C \\approx 0.050"></span> is the compute scaling exponent and <span data-type="inline-math" data-latex="L_\\infty"></span> represents the irreducible loss.<span data-type="footnote" content="The irreducible loss corresponds to the entropy of natural language itself and cannot be reduced by any model regardless of scale. Estimates place it around 1.69 nats for English text."></span> This implies that reducing loss by 10% requires approximately <span data-type="inline-math" data-latex="7.2\\times"></span> more compute.</p> |
| |
| <div data-component="note" title="Cost implications" emoji="💰" variant="info"><p>At current cloud GPU prices ($2/GPU-hour for A100), training a 70B parameter model for 2T tokens costs approximately <strong>$2.4M</strong>. Scaling to 400B parameters at compute-optimal token counts would cost an estimated <strong>$45M</strong>, highlighting the importance of getting scaling predictions right before committing resources.</p></div> |
| |
| <h2>Discussion</h2> |
| |
| <h3>Limitations</h3> |
| |
| <div data-component="note" title="Important caveats" emoji="⚠️" variant="danger"><p>Our scaling laws are fit on models up to 70B parameters and may not extrapolate reliably beyond this range. The power-law fits assume a fixed architecture family; mixture-of-experts models, retrieval-augmented systems, and other architectural innovations may follow different scaling trajectories.</p></div> |
| |
| <p>Several limitations should be considered when interpreting our results:</p> |
| |
| <ol> |
| <li>All models share the same tokenizer. Tokenizer choice can significantly affect apparent scaling behavior, particularly for multilingual or code-heavy evaluations.</li> |
| <li>We focus exclusively on pre-training loss. The relationship between pre-training loss and downstream task performance is complex and task-dependent.</li> |
| <li>Our compute estimates do not include the cost of hyperparameter search, which can add 2-5× overhead for novel architectures.</li> |
| </ol> |
| |
| <h3>Broader impact</h3> |
| |
| <blockquote> |
| <p>"The ability to predict model performance before training enables more responsible allocation of computational resources and reduces wasteful experimentation."</p> |
| </blockquote> |
| |
| <p>Scaling laws have implications beyond pure research. They enable organizations to make informed decisions about resource investment, potentially reducing the carbon footprint of unnecessary large-scale training runs.</p> |
| |
| <div data-component="fullWidth"><p>This research was conducted as part of the Hugging Face Science initiative, which aims to advance open research in machine learning. All models, training code, and evaluation scripts are available under the Apache 2.0 license at <a href="https://huggingface.co">huggingface.co</a>.</p></div> |
| |
| <h2>Conclusion</h2> |
| |
| <p>We have presented a comprehensive empirical study of scaling laws for neural language models, confirming the power-law relationship between compute and performance across five model sizes. Our key findings include:</p> |
| |
| <ul> |
| <li>Compute-optimal training follows the <span data-type="inline-math" data-latex="N^* \\propto C^{0.50}"></span> scaling predicted by the Chinchilla framework</li> |
| <li>Memory requirements scale super-linearly with sequence length due to activation storage</li> |
| <li>Scaling disproportionately benefits rare tokens and high-confusion classes</li> |
| <li>Phase transitions in model capabilities emerge at predictable compute thresholds</li> |
| </ul> |
| |
| <p>These results provide actionable guidance for practitioners planning large-scale training campaigns. Future work should extend this analysis to multimodal models, investigate the interaction between scaling and alignment training, and develop theoretical frameworks that explain the observed power-law behavior.</p> |
| |
| <div data-component="accordion" title="Acknowledgments" open="false"><p>We thank the Hugging Face compute team for providing access to GPU clusters, the open-source contributors who maintain the training infrastructure, and the anonymous reviewers for their insightful feedback. Special thanks to the BigScience and EleutherAI communities for inspiring this line of research.</p></div> |
| |
| <hr /> |
| |
| <div data-component="hfUser" username="tfrere" name="Thibaud Frere" url="https://huggingface.co/tfrere"></div> |
| |
| <div data-type="bibliography"></div> |
| `; |
|
|
| |
| |
| |
| export const SEED_CITATIONS: Record<string, any> = { |
| vaswani2017: { |
| id: "vaswani2017", |
| type: "paper-conference", |
| title: "Attention is all you need", |
| author: [ |
| { family: "Vaswani", given: "Ashish" }, |
| { family: "Shazeer", given: "Noam" }, |
| { family: "Parmar", given: "Niki" }, |
| { family: "Uszkoreit", given: "Jakob" }, |
| { family: "Jones", given: "Llion" }, |
| { family: "Gomez", given: "Aidan N." }, |
| { family: "Kaiser", given: "Lukasz" }, |
| { family: "Polosukhin", given: "Illia" }, |
| ], |
| issued: { "date-parts": [[2017]] }, |
| "container-title": "Advances in Neural Information Processing Systems", |
| volume: "30", |
| DOI: "10.48550/arXiv.1706.03762", |
| }, |
| devlin2019: { |
| id: "devlin2019", |
| type: "paper-conference", |
| title: "BERT: Pre-training of deep bidirectional transformers for language understanding", |
| author: [ |
| { family: "Devlin", given: "Jacob" }, |
| { family: "Chang", given: "Ming-Wei" }, |
| { family: "Lee", given: "Kenton" }, |
| { family: "Toutanova", given: "Kristina" }, |
| ], |
| issued: { "date-parts": [[2019]] }, |
| "container-title": "Proceedings of NAACL-HLT", |
| page: "4171-4186", |
| DOI: "10.18653/v1/N19-1423", |
| }, |
| kaplan2020: { |
| id: "kaplan2020", |
| type: "article-journal", |
| title: "Scaling laws for neural language models", |
| author: [ |
| { family: "Kaplan", given: "Jared" }, |
| { family: "McCandlish", given: "Sam" }, |
| { family: "Henighan", given: "Tom" }, |
| { family: "Brown", given: "Tom B." }, |
| { family: "Chess", given: "Benjamin" }, |
| { family: "Child", given: "Rewon" }, |
| { family: "Gray", given: "Scott" }, |
| { family: "Radford", given: "Alec" }, |
| { family: "Wu", given: "Jeffrey" }, |
| { family: "Amodei", given: "Dario" }, |
| ], |
| issued: { "date-parts": [[2020]] }, |
| "container-title": "arXiv preprint arXiv:2001.08361", |
| DOI: "10.48550/arXiv.2001.08361", |
| }, |
| brown2020: { |
| id: "brown2020", |
| type: "paper-conference", |
| title: "Language models are few-shot learners", |
| author: [ |
| { family: "Brown", given: "Tom B." }, |
| { family: "Mann", given: "Benjamin" }, |
| { family: "Ryder", given: "Nick" }, |
| { family: "Subbiah", given: "Melanie" }, |
| { family: "Kaplan", given: "Jared" }, |
| { family: "Dhariwal", given: "Prafulla" }, |
| { family: "Neelakantan", given: "Arvind" }, |
| { family: "Shyam", given: "Pranav" }, |
| { family: "Sastry", given: "Girish" }, |
| { family: "Askell", given: "Amanda" }, |
| ], |
| issued: { "date-parts": [[2020]] }, |
| "container-title": "Advances in Neural Information Processing Systems", |
| volume: "33", |
| DOI: "10.48550/arXiv.2005.14165", |
| }, |
| hoffmann2022: { |
| id: "hoffmann2022", |
| type: "article-journal", |
| title: "Training compute-optimal large language models", |
| author: [ |
| { family: "Hoffmann", given: "Jordan" }, |
| { family: "Borgeaud", given: "Sebastian" }, |
| { family: "Mensch", given: "Arthur" }, |
| { family: "Buchatskaya", given: "Elena" }, |
| { family: "Cai", given: "Trevor" }, |
| { family: "Rutherford", given: "Eliza" }, |
| { family: "de Las Casas", given: "Diego" }, |
| { family: "Hendricks", given: "Lisa Anne" }, |
| { family: "Welbl", given: "Johannes" }, |
| { family: "Clark", given: "Aidan" }, |
| ], |
| issued: { "date-parts": [[2022]] }, |
| "container-title": "arXiv preprint arXiv:2203.15556", |
| DOI: "10.48550/arXiv.2203.15556", |
| }, |
| wei2022: { |
| id: "wei2022", |
| type: "article-journal", |
| title: "Emergent abilities of large language models", |
| author: [ |
| { family: "Wei", given: "Jason" }, |
| { family: "Tay", given: "Yi" }, |
| { family: "Bommasani", given: "Rishi" }, |
| { family: "Raffel", given: "Colin" }, |
| { family: "Zoph", given: "Barret" }, |
| { family: "Borgeaud", given: "Sebastian" }, |
| { family: "Yogatama", given: "Dani" }, |
| { family: "Bosma", given: "Maarten" }, |
| { family: "Zhou", given: "Denny" }, |
| { family: "Metzler", given: "Donald" }, |
| ], |
| issued: { "date-parts": [[2022]] }, |
| "container-title": "Transactions on Machine Learning Research", |
| DOI: "10.48550/arXiv.2206.07682", |
| }, |
| }; |
|
|