bangla-tokenizer / README.md
yasserius's picture
Upload tokenizer
3dd1501 verified
---
language:
- bn
license: apache-2.0
tags:
- tokenizer
- bangla
- bengali
---
# Custom Bangla Tokenizer
A specialized Bangla/Bengali tokenizer extracted from multilingual models and extended with missing characters.
## Features
- Focused on Bangla text tokenization
- Support for characters like ঢ় and other missing characters
- Reduced vocabulary size: 21,607 tokens
- Compatible with Hugging Face Transformers
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yasserius/bangla-tokenizer")
# Tokenize Bangla text
text = "আমি বাংলায় কথা বলি"
tokens = tokenizer.tokenize(text)
print(tokens)
```
## Model Details
- Base model: Extracted from google/muril-base-cased
- Language: Bengali/Bangla
- Vocabulary size: 21,607