--- language: - bn license: apache-2.0 tags: - tokenizer - bangla - bengali --- # Custom Bangla Tokenizer A specialized Bangla/Bengali tokenizer extracted from multilingual models and extended with missing characters. ## Features - Focused on Bangla text tokenization - Support for characters like ঢ় and other missing characters - Reduced vocabulary size: 21,607 tokens - Compatible with Hugging Face Transformers ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("yasserius/bangla-tokenizer") # Tokenize Bangla text text = "আমি বাংলায় কথা বলি" tokens = tokenizer.tokenize(text) print(tokens) ``` ## Model Details - Base model: Extracted from google/muril-base-cased - Language: Bengali/Bangla - Vocabulary size: 21,607