/** * Default demo content loaded when a document is empty (first sync). * Covers every block/mark type supported by the editor. */ export const DEFAULT_CONTENT = `

Introduction

Recent work has revealed remarkably predictable relationships between the scale of neural language models and their performance. As research matures, practitioners increasingly rely on these empirical regularities to guide decisions about model architecture, training data, and compute allocation.

The concept of scaling laws in deep learning predates the current LLM era. Similar power-law behaviors were observed in machine translation and image classification as early as 2017.

This paper presents an empirical study of behavior across three key axes: model size (number of parameters), dataset size (number of tokens), and compute budget (FLOPs). Building on the foundational work of and subsequent refinements by , we investigate how these relationships hold across different architectural variants and training regimes.

We make the following contributions:

A comprehensive benchmark of scaling behavior across five model families ranging from 125M to 70B parameters
An analysis of compute-optimal training, extending the Chinchilla framework to mixture-of-experts architectures
Practical guidelines for resource allocation in large-scale training campaigns

Background

The Transformer architecture

The architecture forms the backbone of all models studied in this work. At its core, the self-attention mechanism computes a weighted combination of value vectors, where weights are determined by the compatibility between query and key vectors:

where are queries, are keys, and are values. The scaling factor prevents the dot products from growing too large in magnitude.

Pre-training objectives

All models in our study are trained with the standard autoregressive language modeling objective. Given a sequence of tokens , the model maximizes:

This objective naturally decomposes across sequence positions, enabling efficient parallelization. Prior work demonstrated that bidirectional objectives can yield superior representations for downstream tasks, but autoregressive training remains the dominant paradigm for generative models.

Loss decomposition

For analysis purposes, we decompose the training loss into contributions from different frequency bins:

where partition the vocabulary by token frequency. This decomposition reveals that scaling primarily benefits predictions on rare tokens.

Experimental setup

Model configurations

We train five model families spanning three orders of magnitude in parameter count. All models use the standard decoder-only Transformer architecture with rotary position embeddings (RoPE) and SwiGLU activations.

Model	Parameters	Layers	Hidden dim	Heads
Small	125M	12	768	12
Medium	1.3B	24	2048	16
Large	6.7B	32	4096	32
XL	13B	40	5120	40
XXL	70B	80	8192	64

Training infrastructure

All experiments were conducted on clusters of NVIDIA A100 80GB GPUs connected via NVLink and InfiniBand. We used a custom distributed training framework built on PyTorch FSDP with mixed-precision (bf16) training. Gradient checkpointing was enabled for models above 6.7B parameters to fit within GPU memory constraints.

The following code snippet illustrates our distributed training configuration:

import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from transformers import AutoModelForCausalLM, AutoConfig

config = AutoConfig.from_pretrained("meta-llama/Llama-3-8B")
config.use_cache = False

model = AutoModelForCausalLM.from_config(config)
model = FSDP(
    model,
    mixed_precision=MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.float32,
        buffer_dtype=torch.bfloat16,
    ),
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    activation_checkpointing=True,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=3e-5)

All experiments use a fixed random seed (42) for weight initialization and data shuffling. However, non-determinism in CUDA operations means exact reproducibility across hardware configurations is not guaranteed. We report means and standard deviations across three independent runs for each configuration.

Dataset

We train on a deduplicated mixture of web text, books, and scientific papers totaling 2.4 trillion tokens. The mixture weights were optimized following the approach of , with upsampling of high-quality sources.

Scaling behavior

Compute-optimal training

Our central finding confirms and extends the hypothesis: for a given compute budget , the optimal model size and token count follow power laws:

The interactive visualization below shows the scaling relationship across our model suite:

Memory requirements

Training memory breakdown

Understanding memory consumption is critical for capacity planning. Training memory consists of four components: model parameters, gradients, optimizer states, and activations. The chart below shows how these components scale across model sizes and sequence lengths:

The 70B model requires at minimum 4-way tensor parallelism at sequence length 4096 with selective recomputation. Without activation checkpointing, memory requirements exceed 1.5 TB, making training infeasible even on 8×A100 nodes.

Activation checkpointing strategies

We evaluate three recomputation strategies and find that selective recomputation (checkpointing only attention layers) provides the best trade-off between memory savings and compute overhead:

None: full activation storage, maximum memory, fastest wall-clock time
Selective: checkpoint attention blocks only, ~75% memory reduction in activations with <5% throughput loss
Full: checkpoint every layer, ~94% activation memory reduction but 30-40% throughput penalty

Evaluation

Benchmark performance

We evaluate all models on a suite of standard benchmarks. The table below summarizes key results:

Model	MMLU	HellaSwag	ARC-C	TruthfulQA
Small (125M)	26.1	31.2	22.8	42.1
Medium (1.3B)	34.7	52.6	31.4	38.9
Large (6.7B)	52.3	74.1	45.6	35.2
XL (13B)	61.8	80.4	53.2	39.7
XXL (70B)	72.4	86.9	62.1	44.3

TruthfulQA shows a U-shaped pattern with model scale: medium-sized models perform worst as they are large enough to memorize common misconceptions but not large enough to reason past them.

Classification analysis

To understand where scaling helps most, we analyze per-class performance on a 10-class text classification task. The confusion matrices below compare our baseline (1.3B) and improved (6.7B) models:

The delta matrix reveals that scaling disproportionately benefits classes 2, 3, 5, and 8 (categories with high inter-class confusion at smaller scales), while already well-separated classes (0, 7, 9) see modest improvements.

Analysis

Training dynamics

If you have a very large neural network and you train it on a very large dataset, you get very good results. It really is that simple.

While the quote above captures the high-level intuition, the reality is more nuanced. Our experiments reveal several non-obvious phenomena during training:

Phase transitions

We observe distinct phase transitions in model capabilities as training progresses. These transitions are characterized by sudden improvements in specific task categories, consistent with the "emergence" hypothesis .

Early training (0-20% tokens)

Models acquire basic syntax, word co-occurrence statistics, and simple factual associations. Loss decreases rapidly following a power law. All model sizes show similar learning curves during this phase.

Late training (80-100% tokens)

Larger models continue improving on complex reasoning tasks while smaller models plateau. The gap between model sizes widens, with diminishing returns for the smallest models. Multi-step reasoning capabilities emerge only in models above 6.7B parameters.

Compute efficiency

A critical finding is the relationship between total compute and achieved loss :

where is the compute scaling exponent and represents the irreducible loss. This implies that reducing loss by 10% requires approximately more compute.

At current cloud GPU prices ($2/GPU-hour for A100), training a 70B parameter model for 2T tokens costs approximately $2.4M. Scaling to 400B parameters at compute-optimal token counts would cost an estimated $45M, highlighting the importance of getting scaling predictions right before committing resources.

Discussion

Limitations

Our scaling laws are fit on models up to 70B parameters and may not extrapolate reliably beyond this range. The power-law fits assume a fixed architecture family; mixture-of-experts models, retrieval-augmented systems, and other architectural innovations may follow different scaling trajectories.

Several limitations should be considered when interpreting our results:

All models share the same tokenizer. Tokenizer choice can significantly affect apparent scaling behavior, particularly for multilingual or code-heavy evaluations.
We focus exclusively on pre-training loss. The relationship between pre-training loss and downstream task performance is complex and task-dependent.
Our compute estimates do not include the cost of hyperparameter search, which can add 2-5× overhead for novel architectures.

Broader impact

"The ability to predict model performance before training enables more responsible allocation of computational resources and reduces wasteful experimentation."

Scaling laws have implications beyond pure research. They enable organizations to make informed decisions about resource investment, potentially reducing the carbon footprint of unnecessary large-scale training runs.

This research was conducted as part of the Hugging Face Science initiative, which aims to advance open research in machine learning. All models, training code, and evaluation scripts are available under the Apache 2.0 license at huggingface.co.

Conclusion

We have presented a comprehensive empirical study of scaling laws for neural language models, confirming the power-law relationship between compute and performance across five model sizes. Our key findings include:

Compute-optimal training follows the scaling predicted by the Chinchilla framework
Memory requirements scale super-linearly with sequence length due to activation storage
Scaling disproportionately benefits rare tokens and high-confusion classes
Phase transitions in model capabilities emerge at predictable compute thresholds

These results provide actionable guidance for practitioners planning large-scale training campaigns. Future work should extend this analysis to multimodal models, investigate the interaction between scaling and alignment training, and develop theoretical frameworks that explain the observed power-law behavior.

We thank the Hugging Face compute team for providing access to GPU clusters, the open-source contributors who maintain the training infrastructure, and the anonymous reviewers for their insightful feedback. Special thanks to the BigScience and EleutherAI communities for inspiring this line of research.

`; /** * Demo citation entries (CSL-JSON) seeded into Y.Map("citations"). */ export const SEED_CITATIONS: Record = { vaswani2017: { id: "vaswani2017", type: "paper-conference", title: "Attention is all you need", author: [ { family: "Vaswani", given: "Ashish" }, { family: "Shazeer", given: "Noam" }, { family: "Parmar", given: "Niki" }, { family: "Uszkoreit", given: "Jakob" }, { family: "Jones", given: "Llion" }, { family: "Gomez", given: "Aidan N." }, { family: "Kaiser", given: "Lukasz" }, { family: "Polosukhin", given: "Illia" }, ], issued: { "date-parts": [[2017]] }, "container-title": "Advances in Neural Information Processing Systems", volume: "30", DOI: "10.48550/arXiv.1706.03762", }, devlin2019: { id: "devlin2019", type: "paper-conference", title: "BERT: Pre-training of deep bidirectional transformers for language understanding", author: [ { family: "Devlin", given: "Jacob" }, { family: "Chang", given: "Ming-Wei" }, { family: "Lee", given: "Kenton" }, { family: "Toutanova", given: "Kristina" }, ], issued: { "date-parts": [[2019]] }, "container-title": "Proceedings of NAACL-HLT", page: "4171-4186", DOI: "10.18653/v1/N19-1423", }, kaplan2020: { id: "kaplan2020", type: "article-journal", title: "Scaling laws for neural language models", author: [ { family: "Kaplan", given: "Jared" }, { family: "McCandlish", given: "Sam" }, { family: "Henighan", given: "Tom" }, { family: "Brown", given: "Tom B." }, { family: "Chess", given: "Benjamin" }, { family: "Child", given: "Rewon" }, { family: "Gray", given: "Scott" }, { family: "Radford", given: "Alec" }, { family: "Wu", given: "Jeffrey" }, { family: "Amodei", given: "Dario" }, ], issued: { "date-parts": [[2020]] }, "container-title": "arXiv preprint arXiv:2001.08361", DOI: "10.48550/arXiv.2001.08361", }, brown2020: { id: "brown2020", type: "paper-conference", title: "Language models are few-shot learners", author: [ { family: "Brown", given: "Tom B." }, { family: "Mann", given: "Benjamin" }, { family: "Ryder", given: "Nick" }, { family: "Subbiah", given: "Melanie" }, { family: "Kaplan", given: "Jared" }, { family: "Dhariwal", given: "Prafulla" }, { family: "Neelakantan", given: "Arvind" }, { family: "Shyam", given: "Pranav" }, { family: "Sastry", given: "Girish" }, { family: "Askell", given: "Amanda" }, ], issued: { "date-parts": [[2020]] }, "container-title": "Advances in Neural Information Processing Systems", volume: "33", DOI: "10.48550/arXiv.2005.14165", }, hoffmann2022: { id: "hoffmann2022", type: "article-journal", title: "Training compute-optimal large language models", author: [ { family: "Hoffmann", given: "Jordan" }, { family: "Borgeaud", given: "Sebastian" }, { family: "Mensch", given: "Arthur" }, { family: "Buchatskaya", given: "Elena" }, { family: "Cai", given: "Trevor" }, { family: "Rutherford", given: "Eliza" }, { family: "de Las Casas", given: "Diego" }, { family: "Hendricks", given: "Lisa Anne" }, { family: "Welbl", given: "Johannes" }, { family: "Clark", given: "Aidan" }, ], issued: { "date-parts": [[2022]] }, "container-title": "arXiv preprint arXiv:2203.15556", DOI: "10.48550/arXiv.2203.15556", }, wei2022: { id: "wei2022", type: "article-journal", title: "Emergent abilities of large language models", author: [ { family: "Wei", given: "Jason" }, { family: "Tay", given: "Yi" }, { family: "Bommasani", given: "Rishi" }, { family: "Raffel", given: "Colin" }, { family: "Zoph", given: "Barret" }, { family: "Borgeaud", given: "Sebastian" }, { family: "Yogatama", given: "Dani" }, { family: "Bosma", given: "Maarten" }, { family: "Zhou", given: "Denny" }, { family: "Metzler", given: "Donald" }, ], issued: { "date-parts": [[2022]] }, "container-title": "Transactions on Machine Learning Research", DOI: "10.48550/arXiv.2206.07682", }, };