# ace-1/mgpt2-tokenizer A **pure-Python** Byte-Pair Encoding tokenizer trained to better handle: - English - Hindi (Devanagari + transliterated Latin) - Kannada (Kannada script + transliterated Latin) This repo is meant to be used with `trust_remote_code=True`. ## Quickstart ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True) text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!" ids = tok.encode(text) print(len(ids), ids[:20]) print(tok.decode(ids)) ``` ## Tokenizer spec - **Vocabulary size**: 50,257 (GPT‑2 exact terms) - 256 byte tokens + 50,000 merges + `<|endoftext|>` - **Special tokens**: `<|endoftext|>` - **Implementation**: custom python tokenizer under `tokenizer/` (loaded dynamically) ## Training corpus (tokenizer) The tokenizer was trained on a deterministic mixture built from: - FineWeb‑Edu (English) - AI4Bharat Sangraha synthetic splits: `hin_Deva`, `hin_Latn`, `kan_Knda`, `kan_Latn` ## Evaluation This repo includes `evaluation.json` with **tokenizer-only** metrics: - tokens per 1k bytes (lower is better) - p95 tokens per line (lower is better) - bucket breakdown: latin / devanagari / kannada / mixed ## Files - Native trained artifact: `mgpt2.model` (minbpe-style `.model` file) - `tokenizer.vocab` / `tokenizer.model` (HF artifacts generated from the native model) - `tokenization_mgpt2.py` (root module entrypoint for `transformers` dynamic loading) ## Notes / limitations - This is a **slow tokenizer** (pure Python). It is intended for research and reproducibility. - Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.