---
license: mit
language:
  - as
tags:
  - assamese
  - tokenizer
  - axomiya
  - indic
---

# Assamese Tokenizer

অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।

A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.

## What is a tokenizer?

Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.

For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens:

`অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।`

Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.

## Why this tokenizer exists

Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set.

- **32,000 tokens** — common words remain intact; rare words split naturally
- **Zero unknown tokens** — every Assamese character is recognized
- **Lossless roundtrip** — encoding and decoding produces the original text
- **Assamese digits work individually** — `২০২৪` is split into separate digits rather than merged

## Special tokens

These tokens are used for chat and instruction-following models:

`<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>`

## Training data

Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.

## Usage

```python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

text = "অসম এখন ধুনীয়া ৰাজ্য।"
ids = sp.EncodeAsIds(text)
pieces = sp.EncodeAsPieces(text)
decoded = sp.DecodeIds(ids)

print(f"Tokens: {len(pieces)}, IDs: {ids}")
print(f"Match: {decoded == text}")
```

Output:
```
Tokens: 5, IDs: [346, 344, 4628, 550, 282]
Match: True
```

## Training an Assamese language model

The tokenizer is the foundation. Here is how it fits into a complete training pipeline:

**Step 1 — Tokenize your data**
```python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

with open("corpus.txt", "r", encoding="utf-8") as f:
    text = f.read()

ids = sp.EncodeAsIds(text)
```

**Step 2 — Train a model**
Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.

**Step 3 — Generate text**
```python
prompt = "অসম এখন"
prompt_ids = sp.EncodeAsIds(prompt)

# The model predicts subsequent tokens one at a time
# generated_ids = model.generate(prompt_ids)

# Convert the output back to Assamese
# generated_text = sp.DecodeIds(generated_ids)
```

The tokenizer remains the same throughout, it is used for both training and inference.

## Files

| File | Description |
|------|-------------|
| `tokenizer.model` | The trained tokenizer model |
| `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores |
| `demo.py` | Example script demonstrating usage |

## Author

**Anand Dey**

**eMail - ananddey.nic@gmail.com**

## License

MIT