Spaces:
Running
Running
| title: README | |
| emoji: π | |
| colorFrom: gray | |
| colorTo: blue | |
| sdk: static | |
| pinned: true | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/61c40eeb727d1257bf3cf5ba/2U1E0gmfmqa8MacXVuz-h.jpeg | |
| short_description: Synthetic Pre-Training Scale Instruction-Tuning Data | |
| license: mit | |
|  | |
| **β¨ Paper:** https://arxiv.org/abs/2601.22146 | |
| **β¨ Code:** *Coming soon* | |
| **β¨ Datasets:** | |
| - ~18M FineTemplates (instruction templates created from real user queries): https://huggingface.co/datasets/fineinstructions/finetemplates | |
| - ~1B+ FineInstructions generated on Nemotron-CC corpus (300B tokens): https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron | |
| **β¨ Models:** | |
| 1. Query Genericizer (Query β Instruction Template): https://huggingface.co/fineinstructions/query_templatizer | |
| 2. Document β Template Matching / Retrieval Embedding: https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding | |
| 3. Template Instantiator (Document + Template β Synthetic Instruction-Answer Pair): https://huggingface.co/fineinstructions/template_instantiator | |
| **β¨ Citation:** | |
| ``` | |
| @article{patel2026fineinstructions, | |
| title={FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale}, | |
| author={Patel, Ajay and Raffel, Colin and Callison-Burch, Chris}, | |
| journal={arXiv preprint arXiv:2601.22146}, | |
| year={2026}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| doi={10.48550/arXiv.2601.22146} | |
| } | |
| ``` | |
| **β¨ Built with DataDreamer:** http://datadreamer.dev/ | |
| **β¨ FineInstructions Pipeline:** | |
|  |