Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +119 -0
corpus.txt +1 -0
demo-2.py +20 -0
demo.py +36 -0
tokenizer.model +3 -0
tokenizer.vocab +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+license: mit
+language:
+  - as
+tags:
+  - assamese
+  - tokenizer
+  - axomiya
+  - indic
+---
+# Assamese Tokenizer
+অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।
+A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.
+## What is a tokenizer?
+Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.
+For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens:
+`অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।`
+Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.
+## Why this tokenizer exists
+Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language — it understands the Assamese script, handles compound words, and covers the full character set.
+- **32,000 tokens** — common words remain intact; rare words split naturally
+- **Zero unknown tokens** — every Assamese character is recognized
+- **Lossless roundtrip** — encoding and decoding produces the original text
+- **Assamese digits work individually** — `২০২৪` is split into separate digits rather than merged
+## Special tokens
+These tokens are used for chat and instruction-following models:
+`<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>`
+## Training data
+Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.
+## Usage
+```python
+import sentencepiece as spm
+sp = spm.SentencePieceProcessor()
+sp.Load("tokenizer.model")
+text = "অসম এখন ধুনীয়া ৰাজ্য।"
+ids = sp.EncodeAsIds(text)
+pieces = sp.EncodeAsPieces(text)
+decoded = sp.DecodeIds(ids)
+print(f"Tokens: {len(pieces)}, IDs: {ids}")
+print(f"Match: {decoded == text}")
+```
+Output:
+```
+Tokens: 5, IDs: [346, 344, 4628, 550, 282]
+Match: True
+```
+## Training an Assamese language model
+The tokenizer is the foundation. Here is how it fits into a complete training pipeline:
+**Step 1 — Tokenize your data**
+```python
+import sentencepiece as spm
+sp = spm.SentencePieceProcessor()
+sp.Load("tokenizer.model")
+with open("corpus.txt", "r", encoding="utf-8") as f:
+    text = f.read()
+ids = sp.EncodeAsIds(text)
+```
+**Step 2 — Train a model**
+Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.
+**Step 3 — Generate text**
+```python
+prompt = "অসম এখন"
+prompt_ids = sp.EncodeAsIds(prompt)
+# The model predicts subsequent tokens one at a time
+# generated_ids = model.generate(prompt_ids)
+# Convert the output back to Assamese
+# generated_text = sp.DecodeIds(generated_ids)
+```
+The tokenizer remains the same throughout — it is used for both training and inference.
+## Files
+| File | Description |
+|------|-------------|
+| `tokenizer.model` | The trained tokenizer model |
+| `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores |
+| `demo.py` | Example script demonstrating usage |
+## Author
+**Anand Dey**
+**eMail - ananddey.nic@gmail.com**
+## License
+MIT

corpus.txt ADDED Viewed

	@@ -0,0 +1 @@

+ অসম ভাৰতৰ উত্তৰ-পূৰ্বাঞ্চলৰ এখন গুৰুত্বপূর্ণ ৰাজ্য। ইয়াৰ ৰাজধানী দিছপুৰ আৰু বৃহত্তম চহৰ গুৱাহাটী। ব্ৰহ্মপুত্ৰ নদী অসমৰ মাজেৰে বৈ গৈ ৰাজ্যখনৰ কৃষি, সংস্কৃতি আৰু অৰ্থনীতিৰ ওপৰত গভীৰ প্ৰভাৱ পেলাইছে।

demo-2.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import os
+import sys
+import sentencepiece as spm
+sys.stdout.reconfigure(encoding="utf-8")
+dir = os.path.dirname(__file__) or "."
+sp = spm.SentencePieceProcessor()
+sp.Load(os.path.join(dir, "tokenizer.model"))
+with open(os.path.join(dir, "corpus.txt"), "r", encoding="utf-8") as f:
+    text = f.read()
+ids = sp.EncodeAsIds(text)
+print(f"Total characters: {len(text):,}")
+print(f"Total tokens: {len(ids):,}")
+print(f"Unique tokens: {len(set(ids)):,}")
+print(f"IDs: {ids}")
+print(f"Has unknown tokens: {sp.unk_id() in ids}")

demo.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""
+Demo: Using the Assamese Unigram tokenizer.
+Run:  cd huggingface && python demo.py
+"""
+import os
+import sys
+import sentencepiece as spm
+sys.stdout.reconfigure(encoding="utf-8")
+dir = os.path.dirname(__file__) or "."
+sp = spm.SentencePieceProcessor()
+sp.Load(os.path.join(dir, "tokenizer.model"))
+sentences = [
+    "অসম ভাৰতৰ উত্তৰ-পূৱ অঞ্চলৰ এখন ৰাজ্য।",
+    "২০২৪ চনত অসমৰ জনসংখ্যা প্ৰায় ৩.৫ কোটি।",
+    "<|user|>কেনে আছা?<|assistant|>মই ভালে আছোঁ।",
+    "Hello, how are you?",
+]
+for text in sentences:
+    ids = sp.EncodeAsIds(text)
+    pieces = sp.EncodeAsPieces(text)
+    decoded = sp.DecodeIds(ids)
+    roundtrip_ok = decoded == text
+    print(f"Input  : {text}")
+    print(f"Tokens : {len(pieces)}")
+    print(f"Pieces : {pieces}")
+    print(f"IDs    : {ids}")
+    print(f"Decoded: {decoded}")
+    print(f"Match  : {'Yes' if roundtrip_ok else 'No'}")
+    print()

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25667a360c140474df473373b41de68075903a7af7d533cac99e71734e679bd9
+size 1137327

tokenizer.vocab ADDED Viewed

The diff for this file is too large to render. See raw diff