Spaces:

HSinghHuggingFace
/

hindi-tokenizer

Running

hindi-tokenizer / README.md

Hindi language tokenizer

d83c04d about 1 year ago

834 Bytes

	---
	title: Hindi BPE Tokenizer
	colorFrom: blue
	colorTo: red
	sdk: streamlit
	sdk_version: 1.31.1
	app_file: app.py
	pinned: false
	---

	# Hindi BPE Tokenizer

	A Streamlit web application for encoding Hindi text to BPE tokens and decoding tokens back to text.

	## Features

	- Encode Hindi text to BPE tokens and token IDs
	- Decode token IDs back to Hindi text
	- Pre-trained on 5,000,000 lines of Hindi text
	- Vocabulary size: 4,500 tokens
	- Includes special tokens: `<pad>`, `<unk>`, `<s>`, `</s>`

	## Usage

	1. Encoding: Enter Hindi text in the left panel and click "Encode"
	2. Decoding: Enter comma-separated token IDs in the right panel and click "Decode"

	## Technical Details

	- BPE (Byte Pair Encoding) tokenizer
	- Trained on IndicCorp Hindi dataset
	- Compression ratio > 3.2
	- Preserves Hindi Unicode range (\\u0900-\\u097F)