| | --- |
| | language: ml |
| | license: mit |
| | tags: |
| | - malayalam |
| | - tokenizer |
| | - bpe |
| | library_name: tokenizers |
| | version: 0.1.0 |
| | --- |
| | |
| | # Malayalam BPE Tokenizer |
| |
|
| | A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus. |
| | Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library |
| | with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam |
| | Unicode conjuncts. |
| |
|
| | ## Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | Algorithm | BPE (Byte Pair Encoding) | |
| | | Vocabulary size | 16,000 | |
| | | Pre-tokenizer | Metaspace (`▁`) | |
| | | Normalizer | NFC + Strip | |
| | | Special tokens | `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer") |
| | |
| | text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്" |
| | tokens = tokenizer.tokenize(text) |
| | print(tokens) |
| | |
| | encoded = tokenizer(text, return_tensors="pt") |
| | print(encoded) |
| | ``` |
| |
|
| | ## Notes |
| |
|
| | - Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte |
| | UTF-8 sequences into invalid bytes. |
| | - NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently. |
| | - Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer). |
| |
|