| # ace-1/mgpt2-tokenizer |
|
|
| A **pure-Python** Byte-Pair Encoding tokenizer trained to better handle: |
| - English |
| - Hindi (Devanagari + transliterated Latin) |
| - Kannada (Kannada script + transliterated Latin) |
|
|
| This repo is meant to be used with `trust_remote_code=True`. |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True) |
| text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!" |
| ids = tok.encode(text) |
| print(len(ids), ids[:20]) |
| print(tok.decode(ids)) |
| ``` |
|
|
| ## Tokenizer spec |
|
|
| - **Vocabulary size**: 50,257 (GPT‑2 exact terms) |
| - 256 byte tokens + 50,000 merges + `<|endoftext|>` |
| - **Special tokens**: `<|endoftext|>` |
| - **Implementation**: custom python tokenizer under `tokenizer/` (loaded dynamically) |
|
|
| ## Training corpus (tokenizer) |
|
|
| The tokenizer was trained on a deterministic mixture built from: |
| - FineWeb‑Edu (English) |
| - AI4Bharat Sangraha synthetic splits: `hin_Deva`, `hin_Latn`, `kan_Knda`, `kan_Latn` |
|
|
| ## Evaluation |
|
|
| This repo includes `evaluation.json` with **tokenizer-only** metrics: |
| - tokens per 1k bytes (lower is better) |
| - p95 tokens per line (lower is better) |
| - bucket breakdown: latin / devanagari / kannada / mixed |
|
|
| ## Files |
|
|
| - Native trained artifact: `mgpt2.model` (minbpe-style `.model` file) |
| - `tokenizer.vocab` / `tokenizer.model` (HF artifacts generated from the native model) |
| - `tokenization_mgpt2.py` (root module entrypoint for `transformers` dynamic loading) |
|
|
| ## Notes / limitations |
|
|
| - This is a **slow tokenizer** (pure Python). It is intended for research and reproducibility. |
| - Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model. |
|
|
|
|