Spaces:
Running
Running
File size: 1,818 Bytes
509600c e0eae9e 8fbeb73 a38872e 7d0018b d9d9471 509600c b2fa358 85a2a30 999655d 2261785 9b8f22b 2261785 21f9d6f 9daeb92 9b8f22b 2261785 c7321c5 e890d6e b83b3cd e890d6e af440db a38872e d9d9471 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | ---
title: README
emoji: π
colorFrom: gray
colorTo: blue
sdk: static
pinned: true
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/61c40eeb727d1257bf3cf5ba/2U1E0gmfmqa8MacXVuz-h.jpeg
short_description: Synthetic Pre-Training Scale Instruction-Tuning Data
license: mit
---

**β¨ Paper:** https://arxiv.org/abs/2601.22146
**β¨ Code:** *Coming soon*
**β¨ Datasets:**
- ~18M FineTemplates (instruction templates created from real user queries): https://huggingface.co/datasets/fineinstructions/finetemplates
- ~1B+ FineInstructions generated on Nemotron-CC corpus (300B tokens): https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron
**β¨ Models:**
1. Query Genericizer (Query β Instruction Template): https://huggingface.co/fineinstructions/query_templatizer
2. Document β Template Matching / Retrieval Embedding: https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding
3. Template Instantiator (Document + Template β Synthetic Instruction-Answer Pair): https://huggingface.co/fineinstructions/template_instantiator
**β¨ Citation:**
```
@article{patel2026fineinstructions,
title={FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale},
author={Patel, Ajay and Raffel, Colin and Callison-Burch, Chris},
journal={arXiv preprint arXiv:2601.22146},
year={2026},
archivePrefix={arXiv},
primaryClass={cs.CL},
doi={10.48550/arXiv.2601.22146}
}
```
**β¨ Built with DataDreamer:** http://datadreamer.dev/
**β¨ FineInstructions Pipeline:**
 |