pandurangpatil
/

sample-marathi-bpe-tokenizer

+---
+language:
+- mr
+license: mit
+tags:
+- tokenizer
+- bpe
+- marathi
+- devanagari
+library_name: tokenizers
+---
+# Marathi BPE Tokenizer
+A Byte Pair Encoding (BPE) tokenizer trained on Marathi text using the Devanagari script.
+## Model Details
+- **Model Type:** BPE Tokenizer
+- **Language:** Marathi (mr)
+- **Script:** Devanagari
+- **Vocabulary Size:** 4845 tokens
+- **Base Vocabulary:** 845 graphemes
+- **Merge Operations:** 4000
+- **License:** MIT
+## Training Details
+The tokenizer was trained using a custom Byte Pair Encoding implementation optimized for Devanagari script:
+- **Starting Unit:** Unicode extended grapheme clusters (not bytes)
+- **Training Corpus Size:** 92,627 characters
+- **Compression Ratio (Grapheme):** 2.84x
+- **Compression Ratio (Byte):** 12.30x
+## Usage
+```python
+from tokenizers import Tokenizer
+# Load the tokenizer
+tokenizer = Tokenizer.from_pretrained("pandurangpatil/sample-marathi-bpe-tokenizer")
+# Encode text
+text = "नमस्कार! हे एक मराठी टोकनायझर आहे."
+encoded = tokenizer.encode(text)
+print(f"Token IDs: {encoded.ids}")
+print(f"Tokens: {encoded.tokens}")
+# Decode back to text
+decoded = tokenizer.decode(encoded.ids)
+print(f"Decoded: {decoded}")
+```
+### Using with Custom Scripts
+If you want to use the raw artifacts:
+```python
+import json
+from tokenizer_utils import encode, decode
+# Load artifacts
+with open('vocab.json', 'r', encoding='utf-8') as f:
+    token_to_id = json.load(f)
+with open('merges.json', 'r', encoding='utf-8') as f:
+    merges_str = json.load(f)
+    merges = {tuple(map(int, k.split(','))): v for k, v in merges_str.items()}
+with open('id_to_token.json', 'r', encoding='utf-8') as f:
+    id_to_token = {int(k): v for k, v in json.load(f).items()}
+# Encode and decode
+text = "मराठी मजकूर"
+token_ids = encode(merges, token_to_id, text)
+reconstructed = decode(id_to_token, token_ids)
+```
+## Grapheme-based Approach
+Unlike traditional byte-level BPE, this tokenizer:
+- Starts with **Unicode grapheme clusters** as base units
+- Properly handles Devanagari combining characters (matras, virama)
+- Maintains linguistic meaning at the subword level
+- Achieves better compression for Devanagari text
+Example of grapheme segmentation:
+- नमस्कार → [न, म, स्, का, र] (graphemes)
+- Each grapheme preserves visual/phonetic integrity
+## Limitations
+- Trained on a limited corpus size
+- May not generalize well to domains outside training data
+- Does not include special tokens for ML models (PAD, UNK, BOS, EOS)
+- Designed for tokenization research and experimentation
+## Citation
+```bibtex
+@misc{marathi-bpe-tokenizer,
+  author = {Your Name},
+  title = {Marathi BPE Tokenizer},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/pandurangpatil/sample-marathi-bpe-tokenizer}
+}
+```
+## License
+MIT License - see LICENSE file for details.