bangla-tokenizer / README.md
yasserius's picture
Upload tokenizer
3dd1501 verified
metadata
language:
  - bn
license: apache-2.0
tags:
  - tokenizer
  - bangla
  - bengali

Custom Bangla Tokenizer

A specialized Bangla/Bengali tokenizer extracted from multilingual models and extended with missing characters.

Features

  • Focused on Bangla text tokenization
  • Support for characters like ঢ় and other missing characters
  • Reduced vocabulary size: 21,607 tokens
  • Compatible with Hugging Face Transformers

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("yasserius/bangla-tokenizer")

# Tokenize Bangla text
text = "আমি বাংলায় কথা বলি"
tokens = tokenizer.tokenize(text)
print(tokens)

Model Details

  • Base model: Extracted from google/muril-base-cased
  • Language: Bengali/Bangla
  • Vocabulary size: 21,607