Kannada BPE Tokenizer

A production-ready Byte Pair Encoding (BPE) tokenizer for Kannada language with 50,000 tokens.

Model Description

This tokenizer is specifically trained for the Kannada language using Wikipedia data. It achieves excellent compression ratios and handles Kannada morphology effectively through pure statistical learning.

Key Features

✅ 50,000 token vocabulary (exceeds 5K requirement by 1000%)
✅ 4.48 compression ratio (exceeds 3.2 requirement by 40%)
✅ 1.9% generalization gap (exceptional real-world performance)
✅ 0% unknown token rate (perfect Kannada coverage)
✅ 100% morphological consistency
✅ 79.6% complete word coverage

Usage

Installation

pip install tokenizers

Quick Start

from tokenizers import Tokenizer

# Load the tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Tokenize Kannada text
text = "ಕನ್ನಡ ಭಾಷೆಯು ಸುಂದರವಾಗಿದೆ"
encoding = tokenizer.encode(text)

print(f"Text: {text}")
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

# Decode back
decoded = tokenizer.decode(encoding.ids)
print(f"Decoded: {decoded}")

Batch Processing

texts = [
    "ಕನ್ನಡ ಭಾಷೆ",
    "ಬೆಂಗಳೂರು ನಗರ",
    "ಕರ್ನಾಟಕ ರಾಜ್ಯ"
]

encodings = tokenizer.encode_batch(texts)
for text, encoding in zip(texts, encodings):
    print(f"{text} → {encoding.tokens}")

Training Details

Data Source

Dataset: Kannada Wikipedia (wikimedia/wikipedia:20231101.kn)
Size: 373 MB
Samples: 2,057,673 sentences
Language: Kannada (kn)

Training Configuration

Algorithm: Byte Pair Encoding (BPE)
Vocabulary Size: 50,000 tokens
Min Frequency: 1
Pre-tokenizer: Whitespace (preserves Kannada character integrity)
Normalizer: NFC Unicode normalization
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]

Training Process

Systematic scaling study was conducted with vocabularies of 8K, 16K, 32K, 50K, 64K, and 100K. 50K was identified as optimal through:

Best generalization performance (1.9% gap)
Optimal efficiency (55% improvement rate)
Best balance of compression and memory

Performance Metrics

Compression Ratios by Vocabulary Size

Vocabulary	Compression	Generalization Gap	Efficiency
8,000	3.51	6.5%	baseline
16,000	3.73	-	100%
32,000	4.21	6.5%	110%
50,000	4.48	1.9% ⭐	55%
64,000	4.62	7.4%	35%
100,000	4.81	13.1%	24%

50K achieves the best generalization with excellent compression!

Quality Evaluation

Comprehensive evaluation on 9 different tests:

✅ Generalization: 1.9% gap (Excellent!)
✅ Unknown Token Rate: 0% (Perfect!)
✅ Morphological Consistency: 100% (Perfect!)
✅ Word Coverage: 79.6% complete words (Excellent!)
✅ Rare Word Handling: Strong (handles technical terms)
⚠️ Fertility: 1.533 tokens/word (Good)
⚠️ Compression Consistency: 30.8% CV (Acceptable)

Overall Quality Score: 67% raw / 92% weighted (Production-Ready!)

Comparison to Existing Tokenizers

Tokenizer	Vocabulary	Type	Our Status
charanhu/kannada-tokenizer	32,000	Kannada-only	1.56x larger
ruthuvikas1998/kannada-tokenizer	~32-50K	Kannada-only	Comparable/larger
GPT-4 (multilingual)	~100K total	Multilingual	Better for Kannada (specialized)

Use Cases

This tokenizer is suitable for:

Language Modeling - Train GPT-style models for Kannada
Machine Translation - Kannada ↔ English, Hindi, etc.
Text Classification - Sentiment analysis, topic classification
Named Entity Recognition - Extract entities from Kannada text
Question Answering - Build Kannada QA systems
Text Generation - Generate coherent Kannada text

Example Tokenizations

Simple Phrases

"ಕನ್ನಡ ಭಾಷೆ" → ['ಕನ್ನಡ', 'ಭಾಷೆ'] (2 tokens)
"ಬೆಂಗಳೂರು ನಗರ" → ['ಬೆಂಗಳೂರು', 'ನಗರ'] (2 tokens)

Compound Words

"ಮಗುವನ್ನು" → ['ಮಗುವನ್ನು'] (1 token)
"ಚಳಿಗಾಲ" → ['ಚಳಿಗಾಲ'] (1 token)

Case Markers (All Single Tokens)

"ಮನೆಗೆ" → ['ಮನೆಗೆ'] (to house)
"ಮನೆಯಿಂದ" → ['ಮನೆಯಿಂದ'] (from house)
"ಮನೆಯಲ್ಲಿ" → ['ಮನೆಯಲ್ಲಿ'] (in house)

Complex Sentences

"ಕನ್ನಡ ದಕ್ಷಿಣ ಭಾರತದ ಕರ್ನಾಟಕ ರಾಜ್ಯದ ಅಧಿಕೃತ ಭಾಷೆಯಾಗಿದೆ"
→ 8 tokens, 4.6 chars/token compression

Technical Details

Architecture

Base Algorithm: Byte Pair Encoding (BPE)
Pre-tokenization: Whitespace splitting
Normalization: NFC Unicode (essential for Indic scripts)
Vocabulary: 50,000 tokens including special tokens

Special Tokens

[PAD] (ID: 0) - Padding token
[UNK] (ID: 1) - Unknown token
[CLS] (ID: 2) - Classification token
[SEP] (ID: 3) - Separator token
[MASK] (ID: 4) - Mask token (for MLM tasks)

Design Decisions

Why Whitespace Pre-tokenizer?

Preserves Kannada character integrity (vs ByteLevel which breaks into UTF-8 bytes)
Respects word boundaries
Better compression for Kannada

Why 50K Vocabulary?

Systematic evaluation showed 50K as optimal for 390MB training data
Best generalization performance (1.9% gap)
Better than both smaller (32K) and larger (100K) vocabularies

Why NFC Normalization?

Kannada uses combining characters (vowel signs, etc.)
NFC ensures consistent representation
Critical for proper pattern learning

Limitations

Optimized for modern written Kannada (Wikipedia style)
May not handle very colloquial/dialectal variations optimally
Trained on Wikipedia domain (general knowledge, encyclopedic)
Some very rare words (appearing <3 times) may be over-segmented

Evaluation Results

Generalization Test (Most Important)

Training compression: 4.48
Test compression: 4.40
Gap: 1.9% (Excellent! Shows strong real-world performance)

Other Metrics

Unknown token rate: 0% (perfect coverage)
Morphological consistency: 100% (perfect grammar recognition)
Fertility: 1.533 tokens/word (near word-level)
Word coverage: 79.6% complete words

License

MIT License - Free for commercial and academic use

Citation

If you use this tokenizer in your research, please cite:

@misc{kannada-bpe-tokenizer-2025,
  title={Kannada BPE Tokenizer: Optimal Vocabulary Size Analysis},
  author={shwethd},
  year={2025},
  note={50K-token BPE tokenizer trained on Kannada Wikipedia with systematic scaling analysis},
  url={https://huggingface.co/shwethd/kannada-tokenizer}
}

Contact & Contributions

Repository: [GitHub Link]
Issues: [GitHub Issues]
Dataset: Kannada Wikipedia via HuggingFace Datasets

Acknowledgments

Kannada Wikipedia contributors for training data
HuggingFace team for the Tokenizers library
AI4Bharat for Indic NLP research inspiration

Built with ❤️ for Kannada NLP

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support