---
language: ml
license: mit
tags:
  - malayalam
  - tokenizer
  - bpe
library_name: tokenizers
version: 0.1.0
---

# Malayalam BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus.
Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library
with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam
Unicode conjuncts.

## Details

| Property | Value |
|---|---|
| Algorithm | BPE (Byte Pair Encoding) |
| Vocabulary size | 16,000 |
| Pre-tokenizer | Metaspace (`▁`) |
| Normalizer | NFC + Strip |
| Special tokens | `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` |

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer")

text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
tokens = tokenizer.tokenize(text)
print(tokens)

encoded = tokenizer(text, return_tensors="pt")
print(encoded)
```

## Notes

- Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte
  UTF-8 sequences into invalid bytes.
- NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
- Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).