Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available:
1.52.2
metadata
title: Hindi BPE Tokenizer
colorFrom: blue
colorTo: red
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: false
Hindi BPE Tokenizer
A Streamlit web application for encoding Hindi text to BPE tokens and decoding tokens back to text.
Features
- Encode Hindi text to BPE tokens and token IDs
- Decode token IDs back to Hindi text
- Pre-trained on 5,000,000 lines of Hindi text
- Vocabulary size: 4,500 tokens
- Includes special tokens:
<pad>,<unk>,<s>,</s>
Usage
- Encoding: Enter Hindi text in the left panel and click "Encode"
- Decoding: Enter comma-separated token IDs in the right panel and click "Decode"
Technical Details
- BPE (Byte Pair Encoding) tokenizer
- Trained on IndicCorp Hindi dataset
- Compression ratio > 3.2
- Preserves Hindi Unicode range (\u0900-\u097F)