assamese-tokenizer / README.md
ananddey's picture
Update README.md
fae928a verified
---
license: mit
language:
- as
tags:
- assamese
- tokenizer
- axomiya
- indic
---
# Assamese Tokenizer
অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।
A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.
## What is a tokenizer?
Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.
For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens:
`অসম``এখন``ধুনীয়া``ৰাজ্য``।`
Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.
## Why this tokenizer exists
Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set.
- **32,000 tokens** — common words remain intact; rare words split naturally
- **Zero unknown tokens** — every Assamese character is recognized
- **Lossless roundtrip** — encoding and decoding produces the original text
- **Assamese digits work individually**`২০২৪` is split into separate digits rather than merged
## Special tokens
These tokens are used for chat and instruction-following models:
`<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>`
## Training data
Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.
## Usage
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
text = "অসম এখন ধুনীয়া ৰাজ্য।"
ids = sp.EncodeAsIds(text)
pieces = sp.EncodeAsPieces(text)
decoded = sp.DecodeIds(ids)
print(f"Tokens: {len(pieces)}, IDs: {ids}")
print(f"Match: {decoded == text}")
```
Output:
```
Tokens: 5, IDs: [346, 344, 4628, 550, 282]
Match: True
```
## Training an Assamese language model
The tokenizer is the foundation. Here is how it fits into a complete training pipeline:
**Step 1 — Tokenize your data**
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
with open("corpus.txt", "r", encoding="utf-8") as f:
text = f.read()
ids = sp.EncodeAsIds(text)
```
**Step 2 — Train a model**
Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.
**Step 3 — Generate text**
```python
prompt = "অসম এখন"
prompt_ids = sp.EncodeAsIds(prompt)
# The model predicts subsequent tokens one at a time
# generated_ids = model.generate(prompt_ids)
# Convert the output back to Assamese
# generated_text = sp.DecodeIds(generated_ids)
```
The tokenizer remains the same throughout, it is used for both training and inference.
## Files
| File | Description |
|------|-------------|
| `tokenizer.model` | The trained tokenizer model |
| `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores |
| `demo.py` | Example script demonstrating usage |
## Author
**Anand Dey**
**eMail - ananddey.nic@gmail.com**
## License
MIT