Publish mgpt2 tokenizer (GPT-2 exact merges) + eval metrics

24cf5c6 verified 12 days ago

1.8 kB

	# ace-1/mgpt2-tokenizer

	A pure-Python Byte-Pair Encoding tokenizer trained to better handle:
	- English
	- Hindi (Devanagari + transliterated Latin)
	- Kannada (Kannada script + transliterated Latin)

	This repo is meant to be used with `trust_remote_code=True`.

	## Quickstart

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained('ace-1/mgpt2-tokenizer', trust_remote_code=True)
	text = "Hello! नमस्ते! ನಮಸ್ಕಾರ! namaste! namaskara!"
	ids = tok.encode(text)
	print(len(ids), ids[:20])
	print(tok.decode(ids))
	```

	## Tokenizer spec

	- Vocabulary size: 50,257 (GPT‑2 exact terms)
	- 256 byte tokens + 50,000 merges + `<\|endoftext\|>`
	- Special tokens: `<\|endoftext\|>`
	- Implementation: custom python tokenizer under `tokenizer/` (loaded dynamically)

	## Training corpus (tokenizer)

	The tokenizer was trained on a deterministic mixture built from:
	- FineWeb‑Edu (English)
	- AI4Bharat Sangraha synthetic splits: `hin_Deva`, `hin_Latn`, `kan_Knda`, `kan_Latn`

	## Evaluation

	This repo includes `evaluation.json` with tokenizer-only metrics:
	- tokens per 1k bytes (lower is better)
	- p95 tokens per line (lower is better)
	- bucket breakdown: latin / devanagari / kannada / mixed

	## Files

	- Native trained artifact: `mgpt2.model` (minbpe-style `.model` file)
	- `tokenizer.vocab` / `tokenizer.model` (HF artifacts generated from the native model)
	- `tokenization_mgpt2.py` (root module entrypoint for `transformers` dynamic loading)

	## Notes / limitations

	- This is a slow tokenizer (pure Python). It is intended for research and reproducibility.
	- Downstream LM metrics (perplexity, instruction following, DPO) are reported in the main mgpt2 project repo as controlled experiments vs a baseline GPT‑2 tokenizer/model.