| library_name: transformers | |
| datasets: | |
| - HuggingFaceTB/smollm-corpus | |
| # Doge-tokenizer | |
| Tokenizer for the training model on [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), and support reasoning fine-tuning like R1. | |
| This tokenizer was trained on 2M samples from: | |
| - FineWeb-Edu 70% | |
| - Cosmopedia v2 20% | |
| - Python-Edu 5% | |
| - FineMath 5% | |