mgpt2-tokenizer / README.md
ace-1's picture
Publish mgpt2 tokenizer (GPT-2 exact merges) + eval metrics
24cf5c6 verified
# ace-1/mgpt2-tokenizer
A **pure-Python** Byte-Pair Encoding tokenizer trained to better handle:
- English
- Hindi (Devanagari + transliterated Latin)
- Kannada (Kannada script + transliterated Latin)
This repo is meant to be used with `trust_remote_code=True`.
## Quickstart
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
ids = tok.encode(text)
print(len(ids), ids[:20])
print(tok.decode(ids))
```
## Tokenizer spec
- **Vocabulary size**: 50,257 (GPT‑2 exact terms)
- 256 byte tokens + 50,000 merges + `<|endoftext|>`
- **Special tokens**: `<|endoftext|>`
- **Implementation**: custom python tokenizer under `tokenizer/` (loaded dynamically)
## Training corpus (tokenizer)
The tokenizer was trained on a deterministic mixture built from:
- FineWeb‑Edu (English)
- AI4Bharat Sangraha synthetic splits: `hin_Deva`, `hin_Latn`, `kan_Knda`, `kan_Latn`
## Evaluation
This repo includes `evaluation.json` with **tokenizer-only** metrics:
- tokens per 1k bytes (lower is better)
- p95 tokens per line (lower is better)
- bucket breakdown: latin / devanagari / kannada / mixed
## Files
- Native trained artifact: `mgpt2.model` (minbpe-style `.model` file)
- `tokenizer.vocab` / `tokenizer.model` (HF artifacts generated from the native model)
- `tokenization_mgpt2.py` (root module entrypoint for `transformers` dynamic loading)
## Notes / limitations
- This is a **slow tokenizer** (pure Python). It is intended for research and reproducibility.
- Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.