--- license: mit language: - as tags: - assamese - tokenizer - axomiya - indic --- # Assamese Tokenizer অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ। A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from. ## What is a tokenizer? Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier. For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens: `অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।` Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning. ## Why this tokenizer exists Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language, it understands the Assamese script, handles compound words, and covers the full character set. - **32,000 tokens** — common words remain intact; rare words split naturally - **Zero unknown tokens** — every Assamese character is recognized - **Lossless roundtrip** — encoding and decoding produces the original text - **Assamese digits work individually** — `২০২৪` is split into separate digits rather than merged ## Special tokens These tokens are used for chat and instruction-following models: `<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>` ## Training data Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated. ## Usage ```python import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("tokenizer.model") text = "অসম এখন ধুনীয়া ৰাজ্য।" ids = sp.EncodeAsIds(text) pieces = sp.EncodeAsPieces(text) decoded = sp.DecodeIds(ids) print(f"Tokens: {len(pieces)}, IDs: {ids}") print(f"Match: {decoded == text}") ``` Output: ``` Tokens: 5, IDs: [346, 344, 4628, 550, 282] Match: True ``` ## Training an Assamese language model The tokenizer is the foundation. Here is how it fits into a complete training pipeline: **Step 1 — Tokenize your data** ```python import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("tokenizer.model") with open("corpus.txt", "r", encoding="utf-8") as f: text = f.read() ids = sp.EncodeAsIds(text) ``` **Step 2 — Train a model** Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style. **Step 3 — Generate text** ```python prompt = "অসম এখন" prompt_ids = sp.EncodeAsIds(prompt) # The model predicts subsequent tokens one at a time # generated_ids = model.generate(prompt_ids) # Convert the output back to Assamese # generated_text = sp.DecodeIds(generated_ids) ``` The tokenizer remains the same throughout, it is used for both training and inference. ## Files | File | Description | |------|-------------| | `tokenizer.model` | The trained tokenizer model | | `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores | | `demo.py` | Example script demonstrating usage | ## Author **Anand Dey** **eMail - ananddey.nic@gmail.com** ## License MIT