/** * Default demo content loaded when a document is empty (first sync). * Covers every block/mark type supported by the editor. */ export const DEFAULT_CONTENT = `
Recent work has revealed remarkably predictable relationships between the scale of neural language models and their performance. As research matures, practitioners increasingly rely on these empirical regularities to guide decisions about model architecture, training data, and compute allocation.
The concept of scaling laws in deep learning predates the current LLM era. Similar power-law behaviors were observed in machine translation and image classification as early as 2017.
This paper presents an empirical study of behavior across three key axes: model size (number of parameters), dataset size (number of tokens), and compute budget (FLOPs). Building on the foundational work of and subsequent refinements by , we investigate how these relationships hold across different architectural variants and training regimes.
We make the following contributions:
The architecture forms the backbone of all models studied in this work. At its core, the self-attention mechanism computes a weighted combination of value vectors, where weights are determined by the compatibility between query and key vectors:
where are queries, are keys, and are values. The scaling factor prevents the dot products from growing too large in magnitude.
All models in our study are trained with the standard autoregressive language modeling objective. Given a sequence of tokens , the model maximizes:
This objective naturally decomposes across sequence positions, enabling efficient parallelization. Prior work demonstrated that bidirectional objectives can yield superior representations for downstream tasks, but autoregressive training remains the dominant paradigm for generative models.
For analysis purposes, we decompose the training loss into contributions from different frequency bins:
where partition the vocabulary by token frequency. This decomposition reveals that scaling primarily benefits predictions on rare tokens.
We train five model families spanning three orders of magnitude in parameter count. All models use the standard decoder-only Transformer architecture with rotary position embeddings (RoPE) and SwiGLU activations.
| Model | Parameters | Layers | Hidden dim | Heads |
|---|---|---|---|---|
| Small | 125M | 12 | 768 | 12 |
| Medium | 1.3B | 24 | 2048 | 16 |
| Large | 6.7B | 32 | 4096 | 32 |
| XL | 13B | 40 | 5120 | 40 |
| XXL | 70B | 80 | 8192 | 64 |
All experiments were conducted on clusters of NVIDIA A100 80GB GPUs connected via NVLink and InfiniBand. We used a custom distributed training framework built on PyTorch FSDP with mixed-precision (bf16) training. Gradient checkpointing was enabled for models above 6.7B parameters to fit within GPU memory constraints.
The following code snippet illustrates our distributed training configuration:
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from transformers import AutoModelForCausalLM, AutoConfig
config = AutoConfig.from_pretrained("meta-llama/Llama-3-8B")
config.use_cache = False
model = AutoModelForCausalLM.from_config(config)
model = FSDP(
model,
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32,
buffer_dtype=torch.bfloat16,
),
sharding_strategy=ShardingStrategy.FULL_SHARD,
activation_checkpointing=True,
)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps, eta_min=3e-5)
All experiments use a fixed random seed (42) for weight initialization and data shuffling. However, non-determinism in CUDA operations means exact reproducibility across hardware configurations is not guaranteed. We report means and standard deviations across three independent runs for each configuration.
We train on a deduplicated mixture of web text, books, and scientific papers totaling 2.4 trillion tokens. The mixture weights were optimized following the approach of , with upsampling of high-quality sources.
Our central finding confirms and extends the hypothesis: for a given compute budget , the optimal model size and token count follow power laws:
The interactive visualization below shows the scaling relationship across our model suite:
Understanding memory consumption is critical for capacity planning. Training memory consists of four components: model parameters, gradients, optimizer states, and activations. The chart below shows how these components scale across model sizes and sequence lengths:
The 70B model requires at minimum 4-way tensor parallelism at sequence length 4096 with selective recomputation. Without activation checkpointing, memory requirements exceed 1.5 TB, making training infeasible even on 8×A100 nodes.
We evaluate three recomputation strategies and find that selective recomputation (checkpointing only attention layers) provides the best trade-off between memory savings and compute overhead:
We evaluate all models on a suite of standard benchmarks. The table below summarizes key results:
| Model | MMLU | HellaSwag | ARC-C | TruthfulQA |
|---|---|---|---|---|
| Small (125M) | 26.1 | 31.2 | 22.8 | 42.1 |
| Medium (1.3B) | 34.7 | 52.6 | 31.4 | 38.9 |
| Large (6.7B) | 52.3 | 74.1 | 45.6 | 35.2 |
| XL (13B) | 61.8 | 80.4 | 53.2 | 39.7 |
| XXL (70B) | 72.4 | 86.9 | 62.1 | 44.3 |
TruthfulQA shows a U-shaped pattern with model scale: medium-sized models perform worst as they are large enough to memorize common misconceptions but not large enough to reason past them.
To understand where scaling helps most, we analyze per-class performance on a 10-class text classification task. The confusion matrices below compare our baseline (1.3B) and improved (6.7B) models:
The delta matrix reveals that scaling disproportionately benefits classes 2, 3, 5, and 8 (categories with high inter-class confusion at smaller scales), while already well-separated classes (0, 7, 9) see modest improvements.
If you have a very large neural network and you train it on a very large dataset, you get very good results. It really is that simple.
While the quote above captures the high-level intuition, the reality is more nuanced. Our experiments reveal several non-obvious phenomena during training:
We observe distinct phase transitions in model capabilities as training progresses. These transitions are characterized by sudden improvements in specific task categories, consistent with the "emergence" hypothesis .
Early training (0-20% tokens)
Models acquire basic syntax, word co-occurrence statistics, and simple factual associations. Loss decreases rapidly following a power law. All model sizes show similar learning curves during this phase.
Late training (80-100% tokens)
Larger models continue improving on complex reasoning tasks while smaller models plateau. The gap between model sizes widens, with diminishing returns for the smallest models. Multi-step reasoning capabilities emerge only in models above 6.7B parameters.
A critical finding is the relationship between total compute and achieved loss :
where is the compute scaling exponent and represents the irreducible loss. This implies that reducing loss by 10% requires approximately more compute.
At current cloud GPU prices ($2/GPU-hour for A100), training a 70B parameter model for 2T tokens costs approximately $2.4M. Scaling to 400B parameters at compute-optimal token counts would cost an estimated $45M, highlighting the importance of getting scaling predictions right before committing resources.
Our scaling laws are fit on models up to 70B parameters and may not extrapolate reliably beyond this range. The power-law fits assume a fixed architecture family; mixture-of-experts models, retrieval-augmented systems, and other architectural innovations may follow different scaling trajectories.
Several limitations should be considered when interpreting our results:
"The ability to predict model performance before training enables more responsible allocation of computational resources and reduces wasteful experimentation."
Scaling laws have implications beyond pure research. They enable organizations to make informed decisions about resource investment, potentially reducing the carbon footprint of unnecessary large-scale training runs.
This research was conducted as part of the Hugging Face Science initiative, which aims to advance open research in machine learning. All models, training code, and evaluation scripts are available under the Apache 2.0 license at huggingface.co.
We have presented a comprehensive empirical study of scaling laws for neural language models, confirming the power-law relationship between compute and performance across five model sizes. Our key findings include:
These results provide actionable guidance for practitioners planning large-scale training campaigns. Future work should extend this analysis to multimodal models, investigate the interaction between scaling and alignment training, and develop theoretical frameworks that explain the observed power-law behavior.
We thank the Hugging Face compute team for providing access to GPU clusters, the open-source contributors who maintain the training infrastructure, and the anonymous reviewers for their insightful feedback. Special thanks to the BigScience and EleutherAI communities for inspiring this line of research.