File size: 8,336 Bytes
b82eb55
 
 
 
 
 
 
 
 
 
 
3d9c3ed
b82eb55
3d9c3ed
 
c6fd968
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
language:
- as
license: mit
library_name: transformers
tags:
- assamese
- tokenizer
- bpe
- llm
- nlp
pipeline_tag: text-generation
---

# Assamese Tokenizer
# Assamese Tokenizer

A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).

This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.

---

# Why I Built This

Most existing multilingual tokenizers do not properly handle Assamese.

Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:

- Poor subword segmentation
- Fragmented Assamese words
- Unnatural token boundaries
- Inefficient token compression
- Reduced language modeling quality
- Weak handling of Assamese morphology and suffix structures

Generic multilingual tokenizers are optimized for many languages simultaneously.
This tokenizer was built specifically for Assamese.

The goal is to:

- Preserve Assamese linguistic structure
- Improve token efficiency
- Reduce fragmentation
- Support large-scale Assamese language model training
- Create a tokenizer optimized for GPT-style autoregressive transformers
- Build foundational infrastructure for future Assamese AI systems

---

# Key Features

## 1. Custom Assamese BPE Vocabulary

This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.

Features:

- Learns Assamese subwords automatically
- Captures common suffixes and morphemes
- Handles compound Assamese words efficiently
- Reduces vocabulary redundancy
- Improves token compression ratio

Vocabulary size:

```python
VOCAB_SIZE = 50_000
```

---

## 2. SQLite Streaming Training Pipeline

One of the most important features of this project is the streaming training architecture.

Instead of:

- loading massive text files into RAM
- generating temporary files
- requiring huge memory usage

this tokenizer streams data directly from SQLite.

Benefits:

- Extremely memory efficient
- Scales to huge datasets
- Faster dataset management
- Easier preprocessing workflows
- Better handling of terabyte-scale corpora in the future

Streaming occurs in configurable batches:

```python
BATCH_SIZE = 50_000
```

This makes the tokenizer suitable for large Assamese corpus training.

---
## 2.1 Dataset used to build this Tokenizer:
| Topic / Dataset                              | Tokens            | Approx. Scale | Source |
|----------------------------------------------|------------------:|---------------|--------|
| Poems Dataset                                | 92.6K             | 0.0000926B    | Kaggle & Sosanko Sarmah (Contributor) |
| Song Lyrics Dataset                          | 4.5M              | 0.0045B       | Kaggle (Spotify API) |
| Story Dataset                                | 52.6B             | 52.6 Billion  | HuggingFace Dataset |
| Crawled Data                                 | 7T                | 7 Trillion    | Various Web Sources |
| CC-100 Dataset                               | 5.9M              | 0.0059B       | Common Crawl |
| Qwen3 Tokens                                 | 2B                | 2 Billion     | Kaggle |
| Kaggle News Articles Dataset                 | 49.6B             | 49.6 Billion  | Kaggle |
| IndicCorp v2 (AI4Bharat)                     | 37.8T             | 37.8 Trillion | AI4Bharat Dataset |
| Assamese Monolingual Corpus (MWire-Labs)     | 38.7B             | 38.7 Billion  | MWire-Labs |
| DailyHunt Dataset                            | 184.2B            | 184.2 Billion | Rahular Varta Dataset |
| Wikipedia Dump (2019–2025)                   | 0.2T              | 200 Billion   | Wikipedia |
|                                    |         || |
| **Total**                                    | **45.3T**         | **45.333 Trillion** | |

---
## 3. Unicode Normalization for Assamese

Indic scripts often contain visually identical Unicode sequences represented differently internally.

This tokenizer applies NFC normalization:

```python
normalizers.NFC()
```

Benefits:

- Prevents token fragmentation
- Standardizes Unicode representations
- Improves vocabulary consistency
- Handles Assamese/Bengali script more reliably

---

## 4. GPT-Style Special Tokens

The tokenizer includes:

- `[UNK]`
- `[PAD]`
- `[BOS]`
- `[EOS]`

These are integrated using Hugging Face post-processing templates.

This design makes the tokenizer compatible with:

- Decoder-only transformers
- GPT-style training
- Autoregressive generation
- Custom Assamese language models

Example formatting:

```text
[BOS] Assamese sentence here [EOS]
```

---

## 5. Hugging Face Compatibility

The tokenizer is exported using:

```python
PreTrainedTokenizerFast
```

This allows direct compatibility with:

- Hugging Face Transformers
- PyTorch training pipelines
- Custom dataloaders
- AutoTokenizer
- Future Assamese LLM checkpoints

Load example:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```

---

## 6. Built-in Sanity Testing

The project includes automatic validation tests.

The tokenizer:

- Encodes Assamese sentences
- Decodes them back
- Verifies reconstruction integrity
- Displays token breakdowns

This ensures:

- Stable encoding
- Proper decoding
- Reliable tokenizer behavior

---

# Technical Architecture

## Tokenizer Type

- Algorithm: BPE (Byte Pair Encoding)
- Model Style: GPT-style autoregressive LM
- Script: Assamese (Bengali-Assamese script)

---

## Pre-tokenization Strategy

This tokenizer intentionally uses simple whitespace pre-tokenization:

```python
pre_tokenizers.Whitespace()
```

The BPE model then learns subword merges automatically.

This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.

---

# Training Pipeline Overview

```text
SQLite Database

Streaming Generator

Unicode Normalization

Whitespace Pre-tokenization

BPE Training

Special Token Processing

Hugging Face Export
```

---

# Project Structure

```text
assamese_tokenizer/

├── tokenizer.json
├── tokenizer_config.json
└── README.md
```

---

# Generated Files

## tokenizer.json

Contains:

- Vocabulary
- Merge rules
- Decoder rules
- Normalization configuration
- Pre-tokenizer configuration
- Post-processing templates
- Special token definitions

This is the core tokenizer model.

---

## tokenizer_config.json

Contains:

- Hugging Face metadata
- Special token configuration
- Model max length
- Tokenizer settings

This enables easy loading with:

```python
AutoTokenizer.from_pretrained()
```

---

# Example Usage

## Loading the Tokenizer

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```

---

## Encoding Assamese Text

```python
text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"

encoded = tokenizer(text)
print(encoded["input_ids"])
```

---

## Decoding

```python
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
```

---

# Design Philosophy

This tokenizer was not built as a generic multilingual tokenizer.

It was designed specifically for Assamese language modeling.

The focus is:

- linguistic preservation
- scalable infrastructure
- efficient training
- Assamese-first optimization
- future LLM compatibility

The long-term vision is to help build fully native Assamese AI systems.

---

# Future Goals

Planned improvements include:

- Larger Assamese corpora training
- Improved punctuation handling
- Advanced normalization research
- Token compression benchmarking
- Comparison against multilingual tokenizers
- Native Assamese conversational models
- Public Hugging Face release
- Integration with custom Assamese GPT models

---

# Author

Ranjit Das

Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.

---

# License

This project is intended for research and educational purposes. And it is free to all user including commercial uses