Instructions to use Ranjit89/assamese-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ranjit89/assamese-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ranjit89/assamese-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ranjit89/assamese-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Ranjit89/assamese-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ranjit89/assamese-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Ranjit89/assamese-tokenizer
- SGLang
How to use Ranjit89/assamese-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ranjit89/assamese-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ranjit89/assamese-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ranjit89/assamese-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Ranjit89/assamese-tokenizer with Docker Model Runner:
docker model run hf.co/Ranjit89/assamese-tokenizer
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Ranjit89/assamese-tokenizer", dtype="auto")Assamese Tokenizer
Assamese Tokenizer
A high-performance custom BPE tokenizer built specifically for Assamese Large Language Models (LLMs).
This project was created as part of a larger effort to build a fully native Assamese AI ecosystem — including datasets, tokenization pipelines, and future GPT-style language models trained primarily on Assamese text.
Why I Built This
Most existing multilingual tokenizers do not properly handle Assamese.
Assamese is usually grouped together with Bengali or other Indic languages inside multilingual vocabularies. While this works at a basic level, it creates several problems:
- Poor subword segmentation
- Fragmented Assamese words
- Unnatural token boundaries
- Inefficient token compression
- Reduced language modeling quality
- Weak handling of Assamese morphology and suffix structures
Generic multilingual tokenizers are optimized for many languages simultaneously. This tokenizer was built specifically for Assamese.
The goal is to:
- Preserve Assamese linguistic structure
- Improve token efficiency
- Reduce fragmentation
- Support large-scale Assamese language model training
- Create a tokenizer optimized for GPT-style autoregressive transformers
- Build foundational infrastructure for future Assamese AI systems
Key Features
1. Custom Assamese BPE Vocabulary
This tokenizer uses Byte Pair Encoding (BPE) trained directly on Assamese text.
Features:
- Learns Assamese subwords automatically
- Captures common suffixes and morphemes
- Handles compound Assamese words efficiently
- Reduces vocabulary redundancy
- Improves token compression ratio
Vocabulary size:
VOCAB_SIZE = 50_000
2. SQLite Streaming Training Pipeline
One of the most important features of this project is the streaming training architecture.
Instead of:
- loading massive text files into RAM
- generating temporary files
- requiring huge memory usage
this tokenizer streams data directly from SQLite.
Benefits:
- Extremely memory efficient
- Scales to huge datasets
- Faster dataset management
- Easier preprocessing workflows
- Better handling of terabyte-scale corpora in the future
Streaming occurs in configurable batches:
BATCH_SIZE = 50_000
This makes the tokenizer suitable for large Assamese corpus training.
2.1 Dataset used to build this Tokenizer:
| Topic / Dataset | Tokens | Approx. Scale | Source |
|---|---|---|---|
| Poems Dataset | 92.6K | 0.0000926B | Kaggle & Sosanko Sarmah (Contributor) |
| Song Lyrics Dataset | 4.5M | 0.0045B | Kaggle (Spotify API) |
| Story Dataset | 52.6B | 52.6 Billion | HuggingFace Dataset |
| Crawled Data | 7T | 7 Trillion | Various Web Sources |
| CC-100 Dataset | 5.9M | 0.0059B | Common Crawl |
| Qwen3 Tokens | 2B | 2 Billion | Kaggle |
| Kaggle News Articles Dataset | 49.6B | 49.6 Billion | Kaggle |
| IndicCorp v2 (AI4Bharat) | 37.8T | 37.8 Trillion | AI4Bharat Dataset |
| Assamese Monolingual Corpus (MWire-Labs) | 38.7B | 38.7 Billion | MWire-Labs |
| DailyHunt Dataset | 184.2B | 184.2 Billion | Rahular Varta Dataset |
| Wikipedia Dump (2019–2025) | 0.2T | 200 Billion | Wikipedia |
| Total | 45.3T | 45.333 Trillion |
3. Unicode Normalization for Assamese
Indic scripts often contain visually identical Unicode sequences represented differently internally.
This tokenizer applies NFC normalization:
normalizers.NFC()
Benefits:
- Prevents token fragmentation
- Standardizes Unicode representations
- Improves vocabulary consistency
- Handles Assamese/Bengali script more reliably
4. GPT-Style Special Tokens
The tokenizer includes:
[UNK][PAD][BOS][EOS]
These are integrated using Hugging Face post-processing templates.
This design makes the tokenizer compatible with:
- Decoder-only transformers
- GPT-style training
- Autoregressive generation
- Custom Assamese language models
Example formatting:
[BOS] Assamese sentence here [EOS]
5. Hugging Face Compatibility
The tokenizer is exported using:
PreTrainedTokenizerFast
This allows direct compatibility with:
- Hugging Face Transformers
- PyTorch training pipelines
- Custom dataloaders
- AutoTokenizer
- Future Assamese LLM checkpoints
Load example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
6. Built-in Sanity Testing
The project includes automatic validation tests.
The tokenizer:
- Encodes Assamese sentences
- Decodes them back
- Verifies reconstruction integrity
- Displays token breakdowns
This ensures:
- Stable encoding
- Proper decoding
- Reliable tokenizer behavior
Technical Architecture
Tokenizer Type
- Algorithm: BPE (Byte Pair Encoding)
- Model Style: GPT-style autoregressive LM
- Script: Assamese (Bengali-Assamese script)
Pre-tokenization Strategy
This tokenizer intentionally uses simple whitespace pre-tokenization:
pre_tokenizers.Whitespace()
The BPE model then learns subword merges automatically.
This avoids unnecessary early fragmentation while allowing the tokenizer to learn Assamese structure naturally from data.
Training Pipeline Overview
SQLite Database
↓
Streaming Generator
↓
Unicode Normalization
↓
Whitespace Pre-tokenization
↓
BPE Training
↓
Special Token Processing
↓
Hugging Face Export
Project Structure
assamese_tokenizer/
│
├── tokenizer.json
├── tokenizer_config.json
└── README.md
Generated Files
tokenizer.json
Contains:
- Vocabulary
- Merge rules
- Decoder rules
- Normalization configuration
- Pre-tokenizer configuration
- Post-processing templates
- Special token definitions
This is the core tokenizer model.
tokenizer_config.json
Contains:
- Hugging Face metadata
- Special token configuration
- Model max length
- Tokenizer settings
This enables easy loading with:
AutoTokenizer.from_pretrained()
Example Usage
Loading the Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./assamese_tokenizer")
Encoding Assamese Text
text = "অসমীয়া ভাষা এটি সুন্দৰ ভাষা।"
encoded = tokenizer(text)
print(encoded["input_ids"])
Decoding
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
Design Philosophy
This tokenizer was not built as a generic multilingual tokenizer.
It was designed specifically for Assamese language modeling.
The focus is:
- linguistic preservation
- scalable infrastructure
- efficient training
- Assamese-first optimization
- future LLM compatibility
The long-term vision is to help build fully native Assamese AI systems.
Future Goals
Planned improvements include:
- Larger Assamese corpora training
- Improved punctuation handling
- Advanced normalization research
- Token compression benchmarking
- Comparison against multilingual tokenizers
- Native Assamese conversational models
- Public Hugging Face release
- Integration with custom Assamese GPT models
Author
Ranjit Das
Developer focused on Assamese AI infrastructure, tokenization systems, large-scale datasets, and native language model development.
License
This project is intended for research and educational purposes. And it is free to all user including commercial uses
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Ranjit89/assamese-tokenizer")