Update README.md

fae928a verified 16 days ago

3.63 kB

	---
	license: mit
	language:
	- as
	tags:
	- assamese
	- tokenizer
	- axomiya
	- indic
	---

	# Assamese Tokenizer

	অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।

	A tokenizer for the Assamese language (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.

	## What is a tokenizer?

	Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.

	For example, "অসম এখন ধুনীয়া ৰাজ্য" is split into 5 tokens:

	`অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।`

	Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.

	## Why this tokenizer exists

	Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built from scratch for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set.

	- 32,000 tokens — common words remain intact; rare words split naturally
	- Zero unknown tokens — every Assamese character is recognized
	- Lossless roundtrip — encoding and decoding produces the original text
	- Assamese digits work individually — `২০২৪` is split into separate digits rather than merged

	## Special tokens

	These tokens are used for chat and instruction-following models:

	`<\|system\|>` `<\|user\|>` `<\|assistant\|>` `<\|endoftext\|>`

	## Training data

	Trained on 12.5 million Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.

	## Usage

	```python
	import sentencepiece as spm

	sp = spm.SentencePieceProcessor()
	sp.Load("tokenizer.model")

	text = "অসম এখন ধুনীয়া ৰাজ্য।"
	ids = sp.EncodeAsIds(text)
	pieces = sp.EncodeAsPieces(text)
	decoded = sp.DecodeIds(ids)

	print(f"Tokens: {len(pieces)}, IDs: {ids}")
	print(f"Match: {decoded == text}")
	```

	Output:
	```
	Tokens: 5, IDs: [346, 344, 4628, 550, 282]
	Match: True
	```

	## Training an Assamese language model

	The tokenizer is the foundation. Here is how it fits into a complete training pipeline:

	Step 1 — Tokenize your data
	```python
	import sentencepiece as spm

	sp = spm.SentencePieceProcessor()
	sp.Load("tokenizer.model")

	with open("corpus.txt", "r", encoding="utf-8") as f:
	text = f.read()

	ids = sp.EncodeAsIds(text)
	```

	Step 2 — Train a model
	Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.

	Step 3 — Generate text
	```python
	prompt = "অসম এখন"
	prompt_ids = sp.EncodeAsIds(prompt)

	# The model predicts subsequent tokens one at a time
	# generated_ids = model.generate(prompt_ids)

	# Convert the output back to Assamese
	# generated_text = sp.DecodeIds(generated_ids)
	```

	The tokenizer remains the same throughout, it is used for both training and inference.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `tokenizer.model` \| The trained tokenizer model \|
	\| `tokenizer.vocab` \| Vocabulary of 32,000 tokens with scores \|
	\| `demo.py` \| Example script demonstrating usage \|

	## Author

	Anand Dey

	eMail - ananddey.nic@gmail.com

	## License

	MIT