File size: 1,601 Bytes
acedab0
 
 
 
 
 
9726ac7
acedab0
9726ac7
acedab0
463671d
acedab0
 
 
 
 
 
 
463671d
 
acedab0
9726ac7
 
 
 
 
 
 
 
463671d
acedab0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: text-generation
---
---

# FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

This repository contains model checkpoints for the FineInstructions project, as introduced in the paper [FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale](https://huggingface.co/papers/2601.22146).

## Description

FineInstructions is a procedure that transforms internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The dataset uses ~18M instruction templates created from real user-written queries and prompts. These templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora.

With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective. This approach is more in-distribution with the expected downstream usage of LLMs (responding to user prompts). Experimental results show that pre-training on FineInstructions outperforms standard pre-training on benchmarks measuring free-form response quality.

## Citation

If you use this project in your research please cite:
```bibtex
@article{patel2026fineinstructions,
  title={FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale},
  author={Patel, Ajay and Raffel, Colin and Callison-Burch, Chris},
  journal={arXiv preprint arXiv:2601.22146},
  year={2026},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  doi={10.48550/arXiv.2601.22146}
}
```