hindi-tokenizer / README.md
HSinghHuggingFace's picture
Hindi language tokenizer
d83c04d

A newer version of the Streamlit SDK is available: 1.52.2

Upgrade
metadata
title: Hindi BPE Tokenizer
colorFrom: blue
colorTo: red
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: false

Hindi BPE Tokenizer

A Streamlit web application for encoding Hindi text to BPE tokens and decoding tokens back to text.

Features

  • Encode Hindi text to BPE tokens and token IDs
  • Decode token IDs back to Hindi text
  • Pre-trained on 5,000,000 lines of Hindi text
  • Vocabulary size: 4,500 tokens
  • Includes special tokens: <pad>, <unk>, <s>, </s>

Usage

  1. Encoding: Enter Hindi text in the left panel and click "Encode"
  2. Decoding: Enter comma-separated token IDs in the right panel and click "Decode"

Technical Details

  • BPE (Byte Pair Encoding) tokenizer
  • Trained on IndicCorp Hindi dataset
  • Compression ratio > 3.2
  • Preserves Hindi Unicode range (\u0900-\u097F)