# ace-1/mgpt2-tokenizer

A **pure-Python** Byte-Pair Encoding tokenizer trained to better handle:
- English
- Hindi (Devanagari + transliterated Latin)
- Kannada (Kannada script + transliterated Latin)

This repo is meant to be used with `trust_remote_code=True`.

## Quickstart

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
ids = tok.encode(text)
print(len(ids), ids[:20])
print(tok.decode(ids))
```

## Tokenizer spec

- **Vocabulary size**: 50,257 (GPT‑2 exact terms)
  - 256 byte tokens + 50,000 merges + `<|endoftext|>`
- **Special tokens**: `<|endoftext|>`
- **Implementation**: custom python tokenizer under `tokenizer/` (loaded dynamically)

## Training corpus (tokenizer)

The tokenizer was trained on a deterministic mixture built from:
- FineWeb‑Edu (English)
- AI4Bharat Sangraha synthetic splits: `hin_Deva`, `hin_Latn`, `kan_Knda`, `kan_Latn`

## Evaluation

This repo includes `evaluation.json` with **tokenizer-only** metrics:
- tokens per 1k bytes (lower is better)
- p95 tokens per line (lower is better)
- bucket breakdown: latin / devanagari / kannada / mixed

## Files

- Native trained artifact: `mgpt2.model` (minbpe-style `.model` file)
- `tokenizer.vocab` / `tokenizer.model` (HF artifacts generated from the native model)
- `tokenization_mgpt2.py` (root module entrypoint for `transformers` dynamic loading)

## Notes / limitations

- This is a **slow tokenizer** (pure Python). It is intended for research and reproducibility.
- Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.