| --- |
| license: mit |
| language: |
| - as |
| tags: |
| - assamese |
| - tokenizer |
| - axomiya |
| - indic |
| --- |
| |
| # Assamese Tokenizer |
|
|
| অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ। |
|
|
| A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from. |
|
|
| ## What is a tokenizer? |
|
|
| Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier. |
|
|
| For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens: |
|
|
| `অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।` |
|
|
| Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning. |
|
|
| ## Why this tokenizer exists |
|
|
| Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set. |
|
|
| - **32,000 tokens** — common words remain intact; rare words split naturally |
| - **Zero unknown tokens** — every Assamese character is recognized |
| - **Lossless roundtrip** — encoding and decoding produces the original text |
| - **Assamese digits work individually** — `২০২৪` is split into separate digits rather than merged |
|
|
| ## Special tokens |
|
|
| These tokens are used for chat and instruction-following models: |
|
|
| `<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>` |
|
|
| ## Training data |
|
|
| Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated. |
|
|
| ## Usage |
|
|
| ```python |
| import sentencepiece as spm |
| |
| sp = spm.SentencePieceProcessor() |
| sp.Load("tokenizer.model") |
| |
| text = "অসম এখন ধুনীয়া ৰাজ্য।" |
| ids = sp.EncodeAsIds(text) |
| pieces = sp.EncodeAsPieces(text) |
| decoded = sp.DecodeIds(ids) |
| |
| print(f"Tokens: {len(pieces)}, IDs: {ids}") |
| print(f"Match: {decoded == text}") |
| ``` |
|
|
| Output: |
| ``` |
| Tokens: 5, IDs: [346, 344, 4628, 550, 282] |
| Match: True |
| ``` |
|
|
| ## Training an Assamese language model |
|
|
| The tokenizer is the foundation. Here is how it fits into a complete training pipeline: |
|
|
| **Step 1 — Tokenize your data** |
| ```python |
| import sentencepiece as spm |
| |
| sp = spm.SentencePieceProcessor() |
| sp.Load("tokenizer.model") |
| |
| with open("corpus.txt", "r", encoding="utf-8") as f: |
| text = f.read() |
| |
| ids = sp.EncodeAsIds(text) |
| ``` |
|
|
| **Step 2 — Train a model** |
| Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style. |
|
|
| **Step 3 — Generate text** |
| ```python |
| prompt = "অসম এখন" |
| prompt_ids = sp.EncodeAsIds(prompt) |
| |
| # The model predicts subsequent tokens one at a time |
| # generated_ids = model.generate(prompt_ids) |
| |
| # Convert the output back to Assamese |
| # generated_text = sp.DecodeIds(generated_ids) |
| ``` |
|
|
| The tokenizer remains the same throughout, it is used for both training and inference. |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `tokenizer.model` | The trained tokenizer model | |
| | `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores | |
| | `demo.py` | Example script demonstrating usage | |
|
|
| ## Author |
|
|
| **Anand Dey** |
|
|
| **eMail - ananddey.nic@gmail.com** |
|
|
| ## License |
|
|
| MIT |
|
|