File size: 8,336 Bytes

b82eb55
 
 
 
 
 
 
 
 
 
 
3d9c3ed
b82eb55
3d9c3ed
 
c6fd968

---
language:
- as
license: mit
library_name: transformers
tags:
- assamese
- tokenizer
- bpe
- llm
- nlp
pipeline_tag: text-generation
---

# Assamese Tokenizer
# Assamese Tokenizer

A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).

This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.

---

# Why I Built This

Most existing multilingual tokenizers do not properly handle Assamese.

Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:

- Poor subword segmentation
- Fragmented Assamese words
- Unnatural token boundaries
- Inefficient token compression
- Reduced language modeling quality
- Weak handling of Assamese morphology and suffix structures

Generic multilingual tokenizers are optimized for many languages simultaneously.
This tokenizer was built specifically for Assamese.

The goal is to:

- Preserve Assamese linguistic structure
- Improve token efficiency
- Reduce fragmentation
- Support large-scale Assamese language model training
- Create a tokenizer optimized for GPT-style autoregressive transformers
- Build foundational infrastructure for future Assamese AI systems

---

# Key Features

## 1. Custom Assamese BPE Vocabulary

This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.

Features:

- Learns Assamese subwords automatically
- Captures common suffixes and morphemes
- Handles compound Assamese words efficiently
- Reduces vocabulary redundancy
- Improves token compression ratio

Vocabulary size:

```python
VOCAB_SIZE = 50_000
```

---

## 2. SQLite Streaming Training Pipeline

One of the most important features of this project is the streaming training architecture.

Instead of:

- loading massive text files into RAM
- generating temporary files
- requiring huge memory usage

this tokenizer streams data directly from SQLite.

Benefits:

- Extremely memory efficient
- Scales to huge datasets
- Faster dataset management
- Easier preprocessing workflows
- Better handling of terabyte-scale corpora in the future

Streaming occurs in configurable batches:

```python
BATCH_SIZE = 50_000
```

This makes the tokenizer suitable for large Assamese corpus training.

---
## 2.1 Dataset used to build this Tokenizer:
| Topic / Dataset                              | Tokens            | Approx. Scale | Source |
|----------------------------------------------|------------------:|---------------|--------|
| Poems Dataset                                | 92.6K             | 0.0000926B    | Kaggle & Sosanko Sarmah (Contributor) |
| Song Lyrics Dataset                          | 4.5M              | 0.0045B       | Kaggle (Spotify API) |
| Story Dataset                                | 52.6B             | 52.6 Billion  | HuggingFace Dataset |
| Crawled Data                                 | 7T                | 7 Trillion    | Various Web Sources |
| CC-100 Dataset                               | 5.9M              | 0.0059B       | Common Crawl |
| Qwen3 Tokens                                 | 2B                | 2 Billion     | Kaggle |
| Kaggle News Articles Dataset                 | 49.6B             | 49.6 Billion  | Kaggle |
| IndicCorp v2 (AI4Bharat)                     | 37.8T             | 37.8 Trillion | AI4Bharat Dataset |
| Assamese Monolingual Corpus (MWire-Labs)     | 38.7B             | 38.7 Billion  | MWire-Labs |
| DailyHunt Dataset                            | 184.2B            | 184.2 Billion | Rahular Varta Dataset |
| Wikipedia Dump (2019–2025)                   | 0.2T              | 200 Billion   | Wikipedia |
|                                    |         || |
| **Total**                                    | **45.3T**         | **45.333 Trillion** | |

---
## 3. Unicode Normalization for Assamese

Indic scripts often contain visually identical Unicode sequences represented differently internally.

This tokenizer applies NFC normalization:

```python
normalizers.NFC()
```

Benefits:

- Prevents token fragmentation
- Standardizes Unicode representations
- Improves vocabulary consistency
- Handles Assamese/Bengali script more reliably

---

## 4. GPT-Style Special Tokens

The tokenizer includes:

- `[UNK]`
- `[PAD]`
- `[BOS]`
- `[EOS]`

These are integrated using Hugging Face post-processing templates.

This design makes the tokenizer compatible with:

- Decoder-only transformers
- GPT-style training
- Autoregressive generation
- Custom Assamese language models

Example formatting:

```text
[BOS] Assamese sentence here [EOS]
```

---

## 5. Hugging Face Compatibility

The tokenizer is exported using:

```python
PreTrainedTokenizerFast
```

This allows direct compatibility with:

- Hugging Face Transformers
- PyTorch training pipelines
- Custom dataloaders
- AutoTokenizer
- Future Assamese LLM checkpoints

Load example:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```

---

## 6. Built-in Sanity Testing

The project includes automatic validation tests.

The tokenizer:

- Encodes Assamese sentences
- Decodes them back
- Verifies reconstruction integrity
- Displays token breakdowns

This ensures:

- Stable encoding
- Proper decoding
- Reliable tokenizer behavior

---

# Technical Architecture

## Tokenizer Type

- Algorithm: BPE (Byte Pair Encoding)
- Model Style: GPT-style autoregressive LM
- Script: Assamese (Bengali-Assamese script)

---

## Pre-tokenization Strategy

This tokenizer intentionally uses simple whitespace pre-tokenization:

```python
pre_tokenizers.Whitespace()
```

The BPE model then learns subword merges automatically.

This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.

---

# Training Pipeline Overview

```text
SQLite Database
      ↓
Streaming Generator
      ↓
Unicode Normalization
      ↓
Whitespace Pre-tokenization
      ↓
BPE Training
      ↓
Special Token Processing
      ↓
Hugging Face Export
```

---

# Project Structure

```text
assamese_tokenizer/
│
├── tokenizer.json
├── tokenizer_config.json
└── README.md
```

---

# Generated Files

## tokenizer.json

Contains:

- Vocabulary
- Merge rules
- Decoder rules
- Normalization configuration
- Pre-tokenizer configuration
- Post-processing templates
- Special token definitions

This is the core tokenizer model.

---

## tokenizer_config.json

Contains:

- Hugging Face metadata
- Special token configuration
- Model max length
- Tokenizer settings

This enables easy loading with:

```python
AutoTokenizer.from_pretrained()
```

---

# Example Usage

## Loading the Tokenizer

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```

---

## Encoding Assamese Text

```python
text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"

encoded = tokenizer(text)
print(encoded["input_ids"])
```

---

## Decoding

```python
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
```

---

# Design Philosophy

This tokenizer was not built as a generic multilingual tokenizer.

It was designed specifically for Assamese language modeling.

The focus is:

- linguistic preservation
- scalable infrastructure
- efficient training
- Assamese-first optimization
- future LLM compatibility

The long-term vision is to help build fully native Assamese AI systems.

---

# Future Goals

Planned improvements include:

- Larger Assamese corpora training
- Improved punctuation handling
- Advanced normalization research
- Token compression benchmarking
- Comparison against multilingual tokenizers
- Native Assamese conversational models
- Public Hugging Face release
- Integration with custom Assamese GPT models

---

# Author

Ranjit Das

Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.

---

# License

This project is intended for research and educational purposes. And it is free to all user including commercial uses