--- language: ml license: mit tags: - malayalam - tokenizer - bpe library_name: tokenizers version: 0.1.0 --- # Malayalam BPE Tokenizer A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus. Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam Unicode conjuncts. ## Details | Property | Value | |---|---| | Algorithm | BPE (Byte Pair Encoding) | | Vocabulary size | 16,000 | | Pre-tokenizer | Metaspace (`▁`) | | Normalizer | NFC + Strip | | Special tokens | ``, ``, ``, ``, `` | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer") text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്" tokens = tokenizer.tokenize(text) print(tokens) encoded = tokenizer(text, return_tensors="pt") print(encoded) ``` ## Notes - Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte UTF-8 sequences into invalid bytes. - NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently. - Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).