| license: cc-by-nc-4.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| --- | |
| # FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale | |
| This repository contains model checkpoints for the FineInstructions project, as introduced in the paper [FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale](https://huggingface.co/papers/2601.22146). | |
| ## Description | |
| FineInstructions is a procedure that transforms internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The dataset uses ~18M instruction templates created from real user-written queries and prompts. These templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. | |
| With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective. This approach is more in-distribution with the expected downstream usage of LLMs (responding to user prompts). Experimental results show that pre-training on FineInstructions outperforms standard pre-training on benchmarks measuring free-form response quality. | |
| ## Citation | |
| If you use this project in your research please cite: | |
| ```bibtex | |
| @article{patel2026fineinstructions, | |
| title={FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale}, | |
| author={Patel, Ajay and Raffel, Colin and Callison-Burch, Chris}, | |
| journal={arXiv preprint arXiv:2601.22146}, | |
| year={2026}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| doi={10.48550/arXiv.2601.22146} | |
| } | |
| ``` |