Instructions to use Ranjit89/assamese-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ranjit89/assamese-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ranjit89/assamese-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ranjit89/assamese-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Ranjit89/assamese-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ranjit89/assamese-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Ranjit89/assamese-tokenizer
- SGLang
How to use Ranjit89/assamese-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ranjit89/assamese-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ranjit89/assamese-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Ranjit89/assamese-tokenizer with Docker Model Runner:
docker model run hf.co/Ranjit89/assamese-tokenizer
Assamese Tokenizer
Assamese Tokenizer
A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).
This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.
Why I Built This
Most existing multilingual tokenizers do not properly handle Assamese.
Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:
- Poor subword segmentation
- Fragmented Assamese words
- Unnatural token boundaries
- Inefficient token compression
- Reduced language modeling quality
- Weak handling of Assamese morphology and suffix structures
Generic multilingual tokenizers are optimized for many languages simultaneously. This tokenizer was built specifically for Assamese.
The goal is to:
- Preserve Assamese linguistic structure
- Improve token efficiency
- Reduce fragmentation
- Support large-scale Assamese language model training
- Create a tokenizer optimized for GPT-style autoregressive transformers
- Build foundational infrastructure for future Assamese AI systems
Key Features
1. Custom Assamese BPE Vocabulary
This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.
Features:
- Learns Assamese subwords automatically
- Captures common suffixes and morphemes
- Handles compound Assamese words efficiently
- Reduces vocabulary redundancy
- Improves token compression ratio
Vocabulary size:
VOCAB_SIZE = 50_000
2. SQLite Streaming Training Pipeline
One of the most important features of this project is the streaming training architecture.
Instead of:
- loading massive text files into RAM
- generating temporary files
- requiring huge memory usage
this tokenizer streams data directly from SQLite.
Benefits:
- Extremely memory efficient
- Scales to huge datasets
- Faster dataset management
- Easier preprocessing workflows
- Better handling of terabyte-scale corpora in the future
Streaming occurs in configurable batches:
BATCH_SIZE = 50_000
This makes the tokenizer suitable for large Assamese corpus training.
2.1 Dataset used to build this Tokenizer:
| Topic / Dataset | Tokens | Approx. Scale | Source |
|---|---|---|---|
| Poems Dataset | 92.6K | 0.0000926B | Kaggle & Sosanko Sarmah (Contributor) |
| Song Lyrics Dataset | 4.5M | 0.0045B | Kaggle (Spotify API) |
| Story Dataset | 52.6B | 52.6 Billion | HuggingFace Dataset |
| Crawled Data | 7T | 7 Trillion | Various Web Sources |
| CC-100 Dataset | 5.9M | 0.0059B | Common Crawl |
| Qwen3 Tokens | 2B | 2 Billion | Kaggle |
| Kaggle News Articles Dataset | 49.6B | 49.6 Billion | Kaggle |
| IndicCorp v2 (AI4Bharat) | 37.8T | 37.8 Trillion | AI4Bharat Dataset |
| Assamese Monolingual Corpus (MWire-Labs) | 38.7B | 38.7 Billion | MWire-Labs |
| DailyHunt Dataset | 184.2B | 184.2 Billion | Rahular Varta Dataset |
| Wikipedia Dump (2019–2025) | 0.2T | 200 Billion | Wikipedia |
| Total | 45.3T | 45.333 Trillion |
3. Unicode Normalization for Assamese
Indic scripts often contain visually identical Unicode sequences represented differently internally.
This tokenizer applies NFC normalization:
normalizers.NFC()
Benefits:
- Prevents token fragmentation
- Standardizes Unicode representations
- Improves vocabulary consistency
- Handles Assamese/Bengali script more reliably
4. GPT-Style Special Tokens
The tokenizer includes:
[UNK][PAD][BOS][EOS]
These are integrated using Hugging Face post-processing templates.
This design makes the tokenizer compatible with:
- Decoder-only transformers
- GPT-style training
- Autoregressive generation
- Custom Assamese language models
Example formatting:
[BOS] Assamese sentence here [EOS]
5. Hugging Face Compatibility
The tokenizer is exported using:
PreTrainedTokenizerFast
This allows direct compatibility with:
- Hugging Face Transformers
- PyTorch training pipelines
- Custom dataloaders
- AutoTokenizer
- Future Assamese LLM checkpoints
Load example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
6. Built-in Sanity Testing
The project includes automatic validation tests.
The tokenizer:
- Encodes Assamese sentences
- Decodes them back
- Verifies reconstruction integrity
- Displays token breakdowns
This ensures:
- Stable encoding
- Proper decoding
- Reliable tokenizer behavior
Technical Architecture
Tokenizer Type
- Algorithm: BPE (Byte Pair Encoding)
- Model Style: GPT-style autoregressive LM
- Script: Assamese (Bengali-Assamese script)
Pre-tokenization Strategy
This tokenizer intentionally uses simple whitespace pre-tokenization:
pre_tokenizers.Whitespace()
The BPE model then learns subword merges automatically.
This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.
Training Pipeline Overview
SQLite Database
↓
Streaming Generator
↓
Unicode Normalization
↓
Whitespace Pre-tokenization
↓
BPE Training
↓
Special Token Processing
↓
Hugging Face Export
Project Structure
assamese_tokenizer/
│
├── tokenizer.json
├── tokenizer_config.json
└── README.md
Generated Files
tokenizer.json
Contains:
- Vocabulary
- Merge rules
- Decoder rules
- Normalization configuration
- Pre-tokenizer configuration
- Post-processing templates
- Special token definitions
This is the core tokenizer model.
tokenizer_config.json
Contains:
- Hugging Face metadata
- Special token configuration
- Model max length
- Tokenizer settings
This enables easy loading with:
AutoTokenizer.from_pretrained()
Example Usage
Loading the Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
Encoding Assamese Text
text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"
encoded = tokenizer(text)
print(encoded["input_ids"])
Decoding
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
Design Philosophy
This tokenizer was not built as a generic multilingual tokenizer.
It was designed specifically for Assamese language modeling.
The focus is:
- linguistic preservation
- scalable infrastructure
- efficient training
- Assamese-first optimization
- future LLM compatibility
The long-term vision is to help build fully native Assamese AI systems.
Future Goals
Planned improvements include:
- Larger Assamese corpora training
- Improved punctuation handling
- Advanced normalization research
- Token compression benchmarking
- Comparison against multilingual tokenizers
- Native Assamese conversational models
- Public Hugging Face release
- Integration with custom Assamese GPT models
Author
Ranjit Das
Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.
License
This project is intended for research and educational purposes. And it is free to all user including commercial uses
docker model run hf.co/Ranjit89/assamese-tokenizer