hindi-bpe-tokenizer / README.md
ansul90's picture
Update README.md
dd38b63 verified
---
language:
- hi
tags:
- tokenizer
- bpe
- hindi
- devanagari
- byte-pair-encoding
- nlp
license: mit
library_name: custom
---
# Hindi BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.
## 🎯 Model Description
- **Vocabulary Size:** 5,500 tokens
- **Compression Ratio:** 6.52X average (up to 10.44X on technical text)
- **Training Corpus:** 575K characters (1.5MB) of diverse Hindi text
- **Decoding Accuracy:** 100%
- **Training Time:** ~30 seconds
## ✨ Features
- **Hindi-Optimized:** Devanagari Unicode prioritization (`\u0900-\u097F`)
- **High Compression:** 6.52X average, up to 10.44X on technical text
- **Perfect Decoding:** 100% accuracy in text reconstruction
- **Simple API:** Easy encode/decode with compression stats
- **Fast Training:** Train from scratch in ~30 seconds
## 📦 Installation
```bash
# Clone the repository
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
cd hindi-bpe-tokenizer
# Install dependencies
pip install regex numpy
# Or with uv:
uv add regex numpy
```
## 🚀 Quick Start
### Step 1: Train the Tokenizer
> **⚠️ Note:** This repository does not include the pre-trained model file (543MB).
> You need to train it once locally, which takes only ~30 seconds.
```bash
python train_bpe_simple.py
```
This will:
- Load the Hindi corpus (included)
- Train the BPE tokenizer
- Generate `hindi_bpe_tokenizer.json` (~543MB)
- Test on 8 Hindi samples
- Display performance metrics
### Step 2: Use the Tokenizer
```python
from hindi_bpe_tokenizer import HindiBPETokenizer
# Load trained tokenizer
tokenizer = HindiBPETokenizer()
tokenizer.load('hindi_bpe_tokenizer.json')
# Encode Hindi text
text = "भारत एक महान देश है।"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
# Get compression statistics
stats = tokenizer.get_compression_stats(text)
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
```
## 📊 Performance Metrics
| Metric | Value |
|--------|-------|
| Vocabulary Size | 5,500 tokens |
| Compression Ratio | 6.52X (avg), 10.44X (best) |
| Decoding Accuracy | 100% |
| Training Corpus | 575K chars, 1.5MB |
| Training Time | ~30 seconds |
### Test Results on Different Text Types
| Category | Original Bytes | Compressed Tokens | Compression Ratio |
|----------|---------------|-------------------|-------------------|
| Space Mission | 204 | 31 | 6.58X |
| Cricket News | 146 | 27 | 5.41X |
| Science & Tech | 188 | 18 | **10.44X** |
| Language | 123 | 18 | 6.83X |
| Education | 140 | 17 | 8.24X |
| Environment | 132 | 21 | 6.29X |
| Mixed Content | 125 | 34 | 3.68X |
| Long Sentence | 240 | 33 | 7.27X |
## 🔧 Advanced Usage
### Custom Training
```python
from hindi_bpe_tokenizer import HindiBPETokenizer
# Create tokenizer with custom vocabulary size
tokenizer = HindiBPETokenizer(vocab_size=8000)
# Load your custom Hindi corpus
with open('my_corpus.txt', 'r', encoding='utf-8') as f:
corpus = f.read()
# Train
tokenizer.train(corpus, verbose=True)
# Save
tokenizer.save('my_custom_tokenizer.json')
```
### Get Detailed Statistics
```python
stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट")
print(f"Original characters: {stats['original_chars']}")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Vocabulary size: {stats['vocab_size']:,}")
```
## 📁 Repository Structure
```
hindi-bpe-tokenizer/
├── hindi_bpe_tokenizer.py # Core implementation (8KB)
├── train_bpe_simple.py # Training script (5KB)
├── create_diverse_hindi_corpus.py # Corpus generator (17KB)
├── hindi_corpus.txt # Training data (1.5MB)
├── training_results.json # Performance metrics (2KB)
├── pyproject.toml # Dependencies
└── README.md # This file
```
**Note:** `hindi_bpe_tokenizer.json` (543MB) is generated when you run `train_bpe_simple.py`
## 🎓 Training Data
The tokenizer was trained on diverse Hindi content including:
- **News:** Cricket, space missions, current events
- **Science & Technology:** विज्ञान, प्रौद्योगिकी vocabulary
- **Education & Environment:** शिक्षा, पर्यावरण topics
- **Politics & Governance:** राजनीति, संविधान terms
- **Daily Life:** Common phrases, daily vocabulary
- **Complete Alphabet:** All Devanagari letters, vowels, consonants
- **Numbers:** Both Arabic (0-9) and Devanagari (०-९) numerals
## 🔬 Technical Details
### BPE Algorithm
1. Start with 256 byte vocabulary
2. Find most frequent byte pair in corpus
3. Merge pair into new token
4. Repeat until target vocabulary size reached
### Hindi-Specific Optimizations
- Devanagari Unicode blocks: `\u0900-\u097F`, `\u0980-\u09FF`
- Optimized regex pattern for Hindi word boundaries
- JSON-based serialization for easy sharing
### Regex Pattern
```python
r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
```
## 🤝 Use Cases
- Text compression for Hindi documents
- Tokenization for Hindi NLP models
- Language model preprocessing
- Text analysis and statistics
- Educational purposes (understanding BPE)
## 📝 Requirements
- Python 3.13+
- `regex` library (for Unicode support)
- `numpy` (optional, for numerical operations)
## ⚠️ Limitations
- Trained on specific corpus (may need retraining for domain-specific text)
- Best for Devanagari script (Hindi)
- Compression ratio varies by text type
- Not optimized for mixed Hindi-English text (compression drops to ~3.68X)
## 🛠️ Troubleshooting
### Issue: ModuleNotFoundError: No module named 'regex'
```bash
pip install regex
```
### Issue: hindi_bpe_tokenizer.json not found
```bash
# Train the tokenizer first
python train_bpe_simple.py
```
### Issue: UnicodeDecodeError
```bash
# Ensure files are read with UTF-8 encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
```
## 📚 Citation
If you use this tokenizer in your research or project, please cite:
```bibtex
@misc{hindi_bpe_tokenizer_2025,
title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
author={Your Name},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
}
```
## 📄 License
MIT License - See LICENSE file for details
## 🙏 Acknowledgments
- Inspired by OpenAI's GPT-2 BPE implementation
- Built for the Hindi NLP community
- Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)
## 📧 Contact & Links
- **Hugging Face:** [ansul90/hindi-bpe-tokenizer](https://huggingface.co/ansul90/hindi-bpe-tokenizer)
- **GitHub:** Your GitHub link
- **Email:** Your email
---
**धन्यवाद (Thank you) for using Hindi BPE Tokenizer!** 🙏