---
language:
- hi
tags:
- tokenizer
- bpe
- hindi
- devanagari
- byte-pair-encoding
- nlp
license: mit
library_name: custom
---

# Hindi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.

## 🎯 Model Description

- **Vocabulary Size:** 5,500 tokens
- **Compression Ratio:** 6.52X average (up to 10.44X on technical text)
- **Training Corpus:** 575K characters (1.5MB) of diverse Hindi text
- **Decoding Accuracy:** 100%
- **Training Time:** ~30 seconds

## ✨ Features

- **Hindi-Optimized:** Devanagari Unicode prioritization (`\u0900-\u097F`)
- **High Compression:** 6.52X average, up to 10.44X on technical text
- **Perfect Decoding:** 100% accuracy in text reconstruction
- **Simple API:** Easy encode/decode with compression stats
- **Fast Training:** Train from scratch in ~30 seconds

## 📦 Installation

```bash
# Clone the repository
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
cd hindi-bpe-tokenizer

# Install dependencies
pip install regex numpy
# Or with uv:
uv add regex numpy
```

## 🚀 Quick Start

### Step 1: Train the Tokenizer

> **⚠️ Note:** This repository does not include the pre-trained model file (543MB). 
> You need to train it once locally, which takes only ~30 seconds.

```bash
python train_bpe_simple.py
```

This will:
- Load the Hindi corpus (included)
- Train the BPE tokenizer
- Generate `hindi_bpe_tokenizer.json` (~543MB)
- Test on 8 Hindi samples
- Display performance metrics

### Step 2: Use the Tokenizer

```python
from hindi_bpe_tokenizer import HindiBPETokenizer

# Load trained tokenizer
tokenizer = HindiBPETokenizer()
tokenizer.load('hindi_bpe_tokenizer.json')

# Encode Hindi text
text = "भारत एक महान देश है।"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Get compression statistics
stats = tokenizer.get_compression_stats(text)
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
```

## 📊 Performance Metrics

| Metric | Value |
|--------|-------|
| Vocabulary Size | 5,500 tokens |
| Compression Ratio | 6.52X (avg), 10.44X (best) |
| Decoding Accuracy | 100% |
| Training Corpus | 575K chars, 1.5MB |
| Training Time | ~30 seconds |

### Test Results on Different Text Types

| Category | Original Bytes | Compressed Tokens | Compression Ratio |
|----------|---------------|-------------------|-------------------|
| Space Mission | 204 | 31 | 6.58X |
| Cricket News | 146 | 27 | 5.41X |
| Science & Tech | 188 | 18 | **10.44X** |
| Language | 123 | 18 | 6.83X |
| Education | 140 | 17 | 8.24X |
| Environment | 132 | 21 | 6.29X |
| Mixed Content | 125 | 34 | 3.68X |
| Long Sentence | 240 | 33 | 7.27X |

## 🔧 Advanced Usage

### Custom Training

```python
from hindi_bpe_tokenizer import HindiBPETokenizer

# Create tokenizer with custom vocabulary size
tokenizer = HindiBPETokenizer(vocab_size=8000)

# Load your custom Hindi corpus
with open('my_corpus.txt', 'r', encoding='utf-8') as f:
    corpus = f.read()

# Train
tokenizer.train(corpus, verbose=True)

# Save
tokenizer.save('my_custom_tokenizer.json')
```

### Get Detailed Statistics

```python
stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट")
print(f"Original characters: {stats['original_chars']}")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Vocabulary size: {stats['vocab_size']:,}")
```

## 📁 Repository Structure

```
hindi-bpe-tokenizer/
├── hindi_bpe_tokenizer.py          # Core implementation (8KB)
├── train_bpe_simple.py             # Training script (5KB)
├── create_diverse_hindi_corpus.py  # Corpus generator (17KB)
├── hindi_corpus.txt                # Training data (1.5MB)
├── training_results.json           # Performance metrics (2KB)
├── pyproject.toml                  # Dependencies
└── README.md                       # This file
```

**Note:** `hindi_bpe_tokenizer.json` (543MB) is generated when you run `train_bpe_simple.py`

## 🎓 Training Data

The tokenizer was trained on diverse Hindi content including:

- **News:** Cricket, space missions, current events
- **Science & Technology:** विज्ञान, प्रौद्योगिकी vocabulary
- **Education & Environment:** शिक्षा, पर्यावरण topics
- **Politics & Governance:** राजनीति, संविधान terms
- **Daily Life:** Common phrases, daily vocabulary
- **Complete Alphabet:** All Devanagari letters, vowels, consonants
- **Numbers:** Both Arabic (0-9) and Devanagari (०-९) numerals

## 🔬 Technical Details

### BPE Algorithm

1. Start with 256 byte vocabulary
2. Find most frequent byte pair in corpus
3. Merge pair into new token
4. Repeat until target vocabulary size reached

### Hindi-Specific Optimizations

- Devanagari Unicode blocks: `\u0900-\u097F`, `\u0980-\u09FF`
- Optimized regex pattern for Hindi word boundaries
- JSON-based serialization for easy sharing

### Regex Pattern

```python
r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
```

## 🤝 Use Cases

- Text compression for Hindi documents
- Tokenization for Hindi NLP models
- Language model preprocessing
- Text analysis and statistics
- Educational purposes (understanding BPE)

## 📝 Requirements

- Python 3.13+
- `regex` library (for Unicode support)
- `numpy` (optional, for numerical operations)

## ⚠️ Limitations

- Trained on specific corpus (may need retraining for domain-specific text)
- Best for Devanagari script (Hindi)
- Compression ratio varies by text type
- Not optimized for mixed Hindi-English text (compression drops to ~3.68X)

## 🛠️ Troubleshooting

### Issue: ModuleNotFoundError: No module named 'regex'
```bash
pip install regex
```

### Issue: hindi_bpe_tokenizer.json not found
```bash
# Train the tokenizer first
python train_bpe_simple.py
```

### Issue: UnicodeDecodeError
```bash
# Ensure files are read with UTF-8 encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()
```

## 📚 Citation

If you use this tokenizer in your research or project, please cite:

```bibtex
@misc{hindi_bpe_tokenizer_2025,
  title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
  author={Your Name},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
}
```

## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Inspired by OpenAI's GPT-2 BPE implementation
- Built for the Hindi NLP community
- Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)

## 📧 Contact & Links

- **Hugging Face:** [ansul90/hindi-bpe-tokenizer](https://huggingface.co/ansul90/hindi-bpe-tokenizer)
- **GitHub:** Your GitHub link
- **Email:** Your email

---

**धन्यवाद (Thank you) for using Hindi BPE Tokenizer!** 🙏