--- title: Hindi BPE Tokenizer colorFrom: blue colorTo: red sdk: streamlit sdk_version: 1.31.1 app_file: app.py pinned: false --- # Hindi BPE Tokenizer A Streamlit web application for encoding Hindi text to BPE tokens and decoding tokens back to text. ## Features - Encode Hindi text to BPE tokens and token IDs - Decode token IDs back to Hindi text - Pre-trained on 5,000,000 lines of Hindi text - Vocabulary size: 4,500 tokens - Includes special tokens: ``, ``, ``, `` ## Usage 1. **Encoding**: Enter Hindi text in the left panel and click "Encode" 2. **Decoding**: Enter comma-separated token IDs in the right panel and click "Decode" ## Technical Details - BPE (Byte Pair Encoding) tokenizer - Trained on IndicCorp Hindi dataset - Compression ratio > 3.2 - Preserves Hindi Unicode range (\\u0900-\\u097F)