ananddey commited on
Commit
ca96bd4
·
verified ·
1 Parent(s): 28f9b62

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +119 -0
  2. corpus.txt +1 -0
  3. demo-2.py +20 -0
  4. demo.py +36 -0
  5. tokenizer.model +3 -0
  6. tokenizer.vocab +0 -0
README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - as
5
+ tags:
6
+ - assamese
7
+ - tokenizer
8
+ - axomiya
9
+ - indic
10
+ ---
11
+
12
+ # Assamese Tokenizer
13
+
14
+ অসমীয়া ভাষাৰ বাবে এটি টোকেনাইজাৰ।
15
+
16
+ A tokenizer for the **Assamese language** (অসমীয়া). It converts Assamese text into tokens, smaller units that AI models can process and learn from.
17
+
18
+ ## What is a tokenizer?
19
+
20
+ Computers & AI models process numerical data, not natural language. A tokenizer bridges this gap by converting text into numerical representations, it breaks sentences into smaller units called tokens and assigns each token a unique numeric identifier.
21
+
22
+ For example, **"অসম এখন ধুনীয়া ৰাজ্য"** is split into 5 tokens:
23
+
24
+ `অসম` → `এখন` → `ধুনীয়া` → `ৰাজ্য` → `।`
25
+
26
+ Each token has a numeric ID. A language model trained on these IDs learns which tokens follow which, capturing grammar, style, and meaning.
27
+
28
+ ## Why this tokenizer exists
29
+
30
+ Most tokenizers are designed for English or Hindi. Assamese support is limited and often inadequate. This tokenizer was built **from scratch** for Assamese language — it understands the Assamese script, handles compound words, and covers the full character set.
31
+
32
+ - **32,000 tokens** — common words remain intact; rare words split naturally
33
+ - **Zero unknown tokens** — every Assamese character is recognized
34
+ - **Lossless roundtrip** — encoding and decoding produces the original text
35
+ - **Assamese digits work individually** — `২০২৪` is split into separate digits rather than merged
36
+
37
+ ## Special tokens
38
+
39
+ These tokens are used for chat and instruction-following models:
40
+
41
+ `<|system|>` `<|user|>` `<|assistant|>` `<|endoftext|>`
42
+
43
+ ## Training data
44
+
45
+ Trained on **12.5 million** Assamese sentences collected from public sources including news, books, Wikipedia, and web content. The data was cleaned, filtered for quality, and deduplicated.
46
+
47
+ ## Usage
48
+
49
+ ```python
50
+ import sentencepiece as spm
51
+
52
+ sp = spm.SentencePieceProcessor()
53
+ sp.Load("tokenizer.model")
54
+
55
+ text = "অসম এখন ধুনীয়া ৰাজ্য।"
56
+ ids = sp.EncodeAsIds(text)
57
+ pieces = sp.EncodeAsPieces(text)
58
+ decoded = sp.DecodeIds(ids)
59
+
60
+ print(f"Tokens: {len(pieces)}, IDs: {ids}")
61
+ print(f"Match: {decoded == text}")
62
+ ```
63
+
64
+ Output:
65
+ ```
66
+ Tokens: 5, IDs: [346, 344, 4628, 550, 282]
67
+ Match: True
68
+ ```
69
+
70
+ ## Training an Assamese language model
71
+
72
+ The tokenizer is the foundation. Here is how it fits into a complete training pipeline:
73
+
74
+ **Step 1 — Tokenize your data**
75
+ ```python
76
+ import sentencepiece as spm
77
+
78
+ sp = spm.SentencePieceProcessor()
79
+ sp.Load("tokenizer.model")
80
+
81
+ with open("corpus.txt", "r", encoding="utf-8") as f:
82
+ text = f.read()
83
+
84
+ ids = sp.EncodeAsIds(text)
85
+ ```
86
+
87
+ **Step 2 — Train a model**
88
+ Feed the token IDs into a transformer architecture. The model learns to predict the next token in a sequence, which teaches it Assamese grammar and style.
89
+
90
+ **Step 3 — Generate text**
91
+ ```python
92
+ prompt = "অসম এখন"
93
+ prompt_ids = sp.EncodeAsIds(prompt)
94
+
95
+ # The model predicts subsequent tokens one at a time
96
+ # generated_ids = model.generate(prompt_ids)
97
+
98
+ # Convert the output back to Assamese
99
+ # generated_text = sp.DecodeIds(generated_ids)
100
+ ```
101
+
102
+ The tokenizer remains the same throughout — it is used for both training and inference.
103
+
104
+ ## Files
105
+
106
+ | File | Description |
107
+ |------|-------------|
108
+ | `tokenizer.model` | The trained tokenizer model |
109
+ | `tokenizer.vocab` | Vocabulary of 32,000 tokens with scores |
110
+ | `demo.py` | Example script demonstrating usage |
111
+
112
+ ## Author
113
+
114
+ **Anand Dey**
115
+ **eMail - ananddey.nic@gmail.com**
116
+
117
+ ## License
118
+
119
+ MIT
corpus.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ অসম ভাৰতৰ উত্তৰ-পূৰ্বাঞ্চলৰ এখন গুৰুত্বপূর্ণ ৰাজ্য। ইয়াৰ ৰাজধানী দিছপুৰ আৰু বৃহত্তম চহৰ গুৱাহাটী। ব্ৰহ্মপুত্ৰ নদী অসমৰ মাজেৰে বৈ গৈ ৰাজ্যখনৰ কৃষি, সংস্কৃতি আৰু অৰ্থনীতিৰ ওপৰত গভীৰ প্ৰভাৱ পেলাইছে।
demo-2.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import sentencepiece as spm
4
+
5
+ sys.stdout.reconfigure(encoding="utf-8")
6
+
7
+ dir = os.path.dirname(__file__) or "."
8
+ sp = spm.SentencePieceProcessor()
9
+ sp.Load(os.path.join(dir, "tokenizer.model"))
10
+
11
+ with open(os.path.join(dir, "corpus.txt"), "r", encoding="utf-8") as f:
12
+ text = f.read()
13
+
14
+ ids = sp.EncodeAsIds(text)
15
+
16
+ print(f"Total characters: {len(text):,}")
17
+ print(f"Total tokens: {len(ids):,}")
18
+ print(f"Unique tokens: {len(set(ids)):,}")
19
+ print(f"IDs: {ids}")
20
+ print(f"Has unknown tokens: {sp.unk_id() in ids}")
demo.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Demo: Using the Assamese Unigram tokenizer.
3
+
4
+ Run: cd huggingface && python demo.py
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import sentencepiece as spm
10
+
11
+ sys.stdout.reconfigure(encoding="utf-8")
12
+
13
+ dir = os.path.dirname(__file__) or "."
14
+ sp = spm.SentencePieceProcessor()
15
+ sp.Load(os.path.join(dir, "tokenizer.model"))
16
+
17
+ sentences = [
18
+ "অসম ভাৰতৰ উত্তৰ-পূৱ অঞ্চলৰ এখন ৰাজ্য।",
19
+ "২০২৪ চনত অসমৰ জনসংখ্যা প্ৰায় ৩.৫ কোটি।",
20
+ "<|user|>কেনে আছা?<|assistant|>মই ভালে আছোঁ।",
21
+ "Hello, how are you?",
22
+ ]
23
+
24
+ for text in sentences:
25
+ ids = sp.EncodeAsIds(text)
26
+ pieces = sp.EncodeAsPieces(text)
27
+ decoded = sp.DecodeIds(ids)
28
+ roundtrip_ok = decoded == text
29
+
30
+ print(f"Input : {text}")
31
+ print(f"Tokens : {len(pieces)}")
32
+ print(f"Pieces : {pieces}")
33
+ print(f"IDs : {ids}")
34
+ print(f"Decoded: {decoded}")
35
+ print(f"Match : {'Yes' if roundtrip_ok else 'No'}")
36
+ print()
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25667a360c140474df473373b41de68075903a7af7d533cac99e71734e679bd9
3
+ size 1137327
tokenizer.vocab ADDED
The diff for this file is too large to render. See raw diff