| | --- |
| | language: |
| | - bn |
| | license: apache-2.0 |
| | tags: |
| | - tokenizer |
| | - bangla |
| | - bengali |
| | --- |
| | |
| | # Custom Bangla Tokenizer |
| |
|
| | A specialized Bangla/Bengali tokenizer extracted from multilingual models and extended with missing characters. |
| |
|
| | ## Features |
| |
|
| | - Focused on Bangla text tokenization |
| | - Support for characters like ঢ় and other missing characters |
| | - Reduced vocabulary size: 21,607 tokens |
| | - Compatible with Hugging Face Transformers |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("yasserius/bangla-tokenizer") |
| | |
| | # Tokenize Bangla text |
| | text = "আমি বাংলায় কথা বলি" |
| | tokens = tokenizer.tokenize(text) |
| | print(tokens) |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - Base model: Extracted from google/muril-base-cased |
| | - Language: Bengali/Bangla |
| | - Vocabulary size: 21,607 |
| |
|