smcproject
/

malayalam-bpe-tokenizer

Model card Files Files and versions

malayalam-bpe-tokenizer / README.md

santhosh's picture

Release 0.1.0

8f93d1a verified 9 days ago

|

history blame contribute delete

1.33 kB

	---
	language: ml
	license: mit
	tags:
	- malayalam
	- tokenizer
	- bpe
	library_name: tokenizers
	version: 0.1.0
	---

	# Malayalam BPE Tokenizer

	A Byte Pair Encoding (BPE) tokenizer trained on Malayalam text corpus.
	Trained using the [HuggingFace tokenizers](https://github.com/huggingface/tokenizers) library
	with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam
	Unicode conjuncts.

	## Details

	\| Property \| Value \|
	\|---\|---\|
	\| Algorithm \| BPE (Byte Pair Encoding) \|
	\| Vocabulary size \| 16,000 \|
	\| Pre-tokenizer \| Metaspace (`▁`) \|
	\| Normalizer \| NFC + Strip \|
	\| Special tokens \| `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` \|

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-bpe-tokenizer")

	text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
	tokens = tokenizer.tokenize(text)
	print(tokens)

	encoded = tokenizer(text, return_tensors="pt")
	print(encoded)
	```

	## Notes

	- Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte
	UTF-8 sequences into invalid bytes.
	- NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
	- Trained and published from [smc/malayalam-tokenizer](https://github.com/smc/malayalam-tokenizer).