---
language: ml
license: mit
tags:
- malayalam
- tokenizer
- bpe
library_name: tokenizers
version: 0.1.0
---
# Malayalam BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus.
Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library
with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam
Unicode conjuncts.
## Details
| Property | Value |
|---|---|
| Algorithm | BPE (Byte Pair Encoding) |
| Vocabulary size | 16,000 |
| Pre-tokenizer | Metaspace (`▁`) |
| Normalizer | NFC + Strip |
| Special tokens | ``, ``, ``, ``, `` |
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer")
text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
tokens = tokenizer.tokenize(text)
print(tokens)
encoded = tokenizer(text, return_tensors="pt")
print(encoded)
```
## Notes
- Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte
UTF-8 sequences into invalid bytes.
- NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
- Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).