hindi-tokenizer / README.md
HSinghHuggingFace's picture
Hindi language tokenizer
d83c04d
---
title: Hindi BPE Tokenizer
colorFrom: blue
colorTo: red
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: false
---
# Hindi BPE Tokenizer
A Streamlit web application for encoding Hindi text to BPE tokens and decoding tokens back to text.
## Features
- Encode Hindi text to BPE tokens and token IDs
- Decode token IDs back to Hindi text
- Pre-trained on 5,000,000 lines of Hindi text
- Vocabulary size: 4,500 tokens
- Includes special tokens: `<pad>`, `<unk>`, `<s>`, `</s>`
## Usage
1. **Encoding**: Enter Hindi text in the left panel and click "Encode"
2. **Decoding**: Enter comma-separated token IDs in the right panel and click "Decode"
## Technical Details
- BPE (Byte Pair Encoding) tokenizer
- Trained on IndicCorp Hindi dataset
- Compression ratio > 3.2
- Preserves Hindi Unicode range (\\u0900-\\u097F)