assamese-tokenizer / README.md
Ranjit89's picture
Update README.md
3d9c3ed verified
---
language:
- as
license: mit
library_name: transformers
tags:
- assamese
- tokenizer
- bpe
- llm
- nlp
pipeline_tag: text-generation
---
# Assamese Tokenizer
# Assamese Tokenizer
A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).
This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.
---
# Why I Built This
Most existing multilingual tokenizers do not properly handle Assamese.
Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:
- Poor subword segmentation
- Fragmented Assamese words
- Unnatural token boundaries
- Inefficient token compression
- Reduced language modeling quality
- Weak handling of Assamese morphology and suffix structures
Generic multilingual tokenizers are optimized for many languages simultaneously.
This tokenizer was built specifically for Assamese.
The goal is to:
- Preserve Assamese linguistic structure
- Improve token efficiency
- Reduce fragmentation
- Support large-scale Assamese language model training
- Create a tokenizer optimized for GPT-style autoregressive transformers
- Build foundational infrastructure for future Assamese AI systems
---
# Key Features
## 1. Custom Assamese BPE Vocabulary
This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.
Features:
- Learns Assamese subwords automatically
- Captures common suffixes and morphemes
- Handles compound Assamese words efficiently
- Reduces vocabulary redundancy
- Improves token compression ratio
Vocabulary size:
```python
VOCAB_SIZE = 50_000
```
---
## 2. SQLite Streaming Training Pipeline
One of the most important features of this project is the streaming training architecture.
Instead of:
- loading massive text files into RAM
- generating temporary files
- requiring huge memory usage
this tokenizer streams data directly from SQLite.
Benefits:
- Extremely memory efficient
- Scales to huge datasets
- Faster dataset management
- Easier preprocessing workflows
- Better handling of terabyte-scale corpora in the future
Streaming occurs in configurable batches:
```python
BATCH_SIZE = 50_000
```
This makes the tokenizer suitable for large Assamese corpus training.
---
## 2.1 Dataset used to build this Tokenizer:
| Topic / Dataset | Tokens | Approx. Scale | Source |
|----------------------------------------------|------------------:|---------------|--------|
| Poems Dataset | 92.6K | 0.0000926B | Kaggle & Sosanko Sarmah (Contributor) |
| Song Lyrics Dataset | 4.5M | 0.0045B | Kaggle (Spotify API) |
| Story Dataset | 52.6B | 52.6 Billion | HuggingFace Dataset |
| Crawled Data | 7T | 7 Trillion | Various Web Sources |
| CC-100 Dataset | 5.9M | 0.0059B | Common Crawl |
| Qwen3 Tokens | 2B | 2 Billion | Kaggle |
| Kaggle News Articles Dataset | 49.6B | 49.6 Billion | Kaggle |
| IndicCorp v2 (AI4Bharat) | 37.8T | 37.8 Trillion | AI4Bharat Dataset |
| Assamese Monolingual Corpus (MWire-Labs) | 38.7B | 38.7 Billion | MWire-Labs |
| DailyHunt Dataset | 184.2B | 184.2 Billion | Rahular Varta Dataset |
| Wikipedia Dump (2019–2025) | 0.2T | 200 Billion | Wikipedia |
| | || |
| **Total** | **45.3T** | **45.333 Trillion** | |
---
## 3. Unicode Normalization for Assamese
Indic scripts often contain visually identical Unicode sequences represented differently internally.
This tokenizer applies NFC normalization:
```python
normalizers.NFC()
```
Benefits:
- Prevents token fragmentation
- Standardizes Unicode representations
- Improves vocabulary consistency
- Handles Assamese/Bengali script more reliably
---
## 4. GPT-Style Special Tokens
The tokenizer includes:
- `[UNK]`
- `[PAD]`
- `[BOS]`
- `[EOS]`
These are integrated using Hugging Face post-processing templates.
This design makes the tokenizer compatible with:
- Decoder-only transformers
- GPT-style training
- Autoregressive generation
- Custom Assamese language models
Example formatting:
```text
[BOS] Assamese sentence here [EOS]
```
---
## 5. Hugging Face Compatibility
The tokenizer is exported using:
```python
PreTrainedTokenizerFast
```
This allows direct compatibility with:
- Hugging Face Transformers
- PyTorch training pipelines
- Custom dataloaders
- AutoTokenizer
- Future Assamese LLM checkpoints
Load example:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```
---
## 6. Built-in Sanity Testing
The project includes automatic validation tests.
The tokenizer:
- Encodes Assamese sentences
- Decodes them back
- Verifies reconstruction integrity
- Displays token breakdowns
This ensures:
- Stable encoding
- Proper decoding
- Reliable tokenizer behavior
---
# Technical Architecture
## Tokenizer Type
- Algorithm: BPE (Byte Pair Encoding)
- Model Style: GPT-style autoregressive LM
- Script: Assamese (Bengali-Assamese script)
---
## Pre-tokenization Strategy
This tokenizer intentionally uses simple whitespace pre-tokenization:
```python
pre_tokenizers.Whitespace()
```
The BPE model then learns subword merges automatically.
This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.
---
# Training Pipeline Overview
```text
SQLite Database
Streaming Generator
Unicode Normalization
Whitespace Pre-tokenization
BPE Training
Special Token Processing
Hugging Face Export
```
---
# Project Structure
```text
assamese_tokenizer/
├── tokenizer.json
├── tokenizer_config.json
└── README.md
```
---
# Generated Files
## tokenizer.json
Contains:
- Vocabulary
- Merge rules
- Decoder rules
- Normalization configuration
- Pre-tokenizer configuration
- Post-processing templates
- Special token definitions
This is the core tokenizer model.
---
## tokenizer_config.json
Contains:
- Hugging Face metadata
- Special token configuration
- Model max length
- Tokenizer settings
This enables easy loading with:
```python
AutoTokenizer.from_pretrained()
```
---
# Example Usage
## Loading the Tokenizer
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
```
---
## Encoding Assamese Text
```python
text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"
encoded = tokenizer(text)
print(encoded["input_ids"])
```
---
## Decoding
```python
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
```
---
# Design Philosophy
This tokenizer was not built as a generic multilingual tokenizer.
It was designed specifically for Assamese language modeling.
The focus is:
- linguistic preservation
- scalable infrastructure
- efficient training
- Assamese-first optimization
- future LLM compatibility
The long-term vision is to help build fully native Assamese AI systems.
---
# Future Goals
Planned improvements include:
- Larger Assamese corpora training
- Improved punctuation handling
- Advanced normalization research
- Token compression benchmarking
- Comparison against multilingual tokenizers
- Native Assamese conversational models
- Public Hugging Face release
- Integration with custom Assamese GPT models
---
# Author
Ranjit Das
Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.
---
# License
This project is intended for research and educational purposes. And it is free to all user including commercial uses