How to use from
Docker Model Runner
docker model run hf.co/Ranjit89/assamese-tokenizer
Quick Links

Assamese Tokenizer

Assamese Tokenizer

A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).

This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.


Why I Built This

Most existing multilingual tokenizers do not properly handle Assamese.

Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:

  • Poor subword segmentation
  • Fragmented Assamese words
  • Unnatural token boundaries
  • Inefficient token compression
  • Reduced language modeling quality
  • Weak handling of Assamese morphology and suffix structures

Generic multilingual tokenizers are optimized for many languages simultaneously. This tokenizer was built specifically for Assamese.

The goal is to:

  • Preserve Assamese linguistic structure
  • Improve token efficiency
  • Reduce fragmentation
  • Support large-scale Assamese language model training
  • Create a tokenizer optimized for GPT-style autoregressive transformers
  • Build foundational infrastructure for future Assamese AI systems

Key Features

1. Custom Assamese BPE Vocabulary

This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.

Features:

  • Learns Assamese subwords automatically
  • Captures common suffixes and morphemes
  • Handles compound Assamese words efficiently
  • Reduces vocabulary redundancy
  • Improves token compression ratio

Vocabulary size:

VOCAB_SIZE = 50_000

2. SQLite Streaming Training Pipeline

One of the most important features of this project is the streaming training architecture.

Instead of:

  • loading massive text files into RAM
  • generating temporary files
  • requiring huge memory usage

this tokenizer streams data directly from SQLite.

Benefits:

  • Extremely memory efficient
  • Scales to huge datasets
  • Faster dataset management
  • Easier preprocessing workflows
  • Better handling of terabyte-scale corpora in the future

Streaming occurs in configurable batches:

BATCH_SIZE = 50_000

This makes the tokenizer suitable for large Assamese corpus training.


2.1 Dataset used to build this Tokenizer:

Topic / Dataset Tokens Approx. Scale Source
Poems Dataset 92.6K 0.0000926B Kaggle & Sosanko Sarmah (Contributor)
Song Lyrics Dataset 4.5M 0.0045B Kaggle (Spotify API)
Story Dataset 52.6B 52.6 Billion HuggingFace Dataset
Crawled Data 7T 7 Trillion Various Web Sources
CC-100 Dataset 5.9M 0.0059B Common Crawl
Qwen3 Tokens 2B 2 Billion Kaggle
Kaggle News Articles Dataset 49.6B 49.6 Billion Kaggle
IndicCorp v2 (AI4Bharat) 37.8T 37.8 Trillion AI4Bharat Dataset
Assamese Monolingual Corpus (MWire-Labs) 38.7B 38.7 Billion MWire-Labs
DailyHunt Dataset 184.2B 184.2 Billion Rahular Varta Dataset
Wikipedia Dump (2019–2025) 0.2T 200 Billion Wikipedia
Total 45.3T 45.333 Trillion

3. Unicode Normalization for Assamese

Indic scripts often contain visually identical Unicode sequences represented differently internally.

This tokenizer applies NFC normalization:

normalizers.NFC()

Benefits:

  • Prevents token fragmentation
  • Standardizes Unicode representations
  • Improves vocabulary consistency
  • Handles Assamese/Bengali script more reliably

4. GPT-Style Special Tokens

The tokenizer includes:

  • [UNK]
  • [PAD]
  • [BOS]
  • [EOS]

These are integrated using Hugging Face post-processing templates.

This design makes the tokenizer compatible with:

  • Decoder-only transformers
  • GPT-style training
  • Autoregressive generation
  • Custom Assamese language models

Example formatting:

[BOS] Assamese sentence here [EOS]

5. Hugging Face Compatibility

The tokenizer is exported using:

PreTrainedTokenizerFast

This allows direct compatibility with:

  • Hugging Face Transformers
  • PyTorch training pipelines
  • Custom dataloaders
  • AutoTokenizer
  • Future Assamese LLM checkpoints

Load example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")

6. Built-in Sanity Testing

The project includes automatic validation tests.

The tokenizer:

  • Encodes Assamese sentences
  • Decodes them back
  • Verifies reconstruction integrity
  • Displays token breakdowns

This ensures:

  • Stable encoding
  • Proper decoding
  • Reliable tokenizer behavior

Technical Architecture

Tokenizer Type

  • Algorithm: BPE (Byte Pair Encoding)
  • Model Style: GPT-style autoregressive LM
  • Script: Assamese (Bengali-Assamese script)

Pre-tokenization Strategy

This tokenizer intentionally uses simple whitespace pre-tokenization:

pre_tokenizers.Whitespace()

The BPE model then learns subword merges automatically.

This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.


Training Pipeline Overview

SQLite Database
      ↓
Streaming Generator
      ↓
Unicode Normalization
      ↓
Whitespace Pre-tokenization
      ↓
BPE Training
      ↓
Special Token Processing
      ↓
Hugging Face Export

Project Structure

assamese_tokenizer/
│
├── tokenizer.json
├── tokenizer_config.json
└── README.md

Generated Files

tokenizer.json

Contains:

  • Vocabulary
  • Merge rules
  • Decoder rules
  • Normalization configuration
  • Pre-tokenizer configuration
  • Post-processing templates
  • Special token definitions

This is the core tokenizer model.


tokenizer_config.json

Contains:

  • Hugging Face metadata
  • Special token configuration
  • Model max length
  • Tokenizer settings

This enables easy loading with:

AutoTokenizer.from_pretrained()

Example Usage

Loading the Tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")

Encoding Assamese Text

text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"

encoded = tokenizer(text)
print(encoded["input_ids"])

Decoding

decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)

Design Philosophy

This tokenizer was not built as a generic multilingual tokenizer.

It was designed specifically for Assamese language modeling.

The focus is:

  • linguistic preservation
  • scalable infrastructure
  • efficient training
  • Assamese-first optimization
  • future LLM compatibility

The long-term vision is to help build fully native Assamese AI systems.


Future Goals

Planned improvements include:

  • Larger Assamese corpora training
  • Improved punctuation handling
  • Advanced normalization research
  • Token compression benchmarking
  • Comparison against multilingual tokenizers
  • Native Assamese conversational models
  • Public Hugging Face release
  • Integration with custom Assamese GPT models

Author

Ranjit Das

Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.


License

This project is intended for research and educational purposes. And it is free to all user including commercial uses

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support