B2NL v6.2.1 - Intelligent Tokenizer
Byte-to-Natural Language Progressive Compression Tokenizer
Model Overview
Release Date: October 6, 2025 Version: 6.2.1 Model Name: B2NL-IntelligentTokenizer-v6.2.1 Architecture: Progressive Splitting Tokenizer with Multi-Query Attention
Key Improvements from v6.1
π Major Updates
Extended Language Support
- v6.1: 6 languages (Korean, English, Chinese, Japanese, Spanish, Arabic)
- v6.2: 204 languages with comprehensive multilingual training
Model Efficiency
- Parameter Reduction: 245M β 137.9M (Encoder) + 106.8M (Decoder)
- Memory Optimization: KV cache with 8x reduction through Multi-Query Attention (MQA)
- Training Stability: Enhanced gradient handling for batch size 128+
Compression Performance
- Achievement: 16:1 compression ratio (4x better than traditional BPE)
- Target Range: 12:1 to 48:1 adaptive compression
- Token Efficiency: 75% reduction in LLM API costs
Technical Specifications
Architecture Details
Encoder:
- Layers: 4
- Hidden Dimension: 1280
- Attention Heads: 16 (Query) / 2 (KV) - MQA
- Compression: 48 bytes β 1-4 tokens
Decoder:
- Layers: 6
- Hidden Dimension: 1280
- Cross-Attention: Multi-level (4 encoder layers)
- Reconstruction: Byte-perfect recovery target
Training Configuration
- Dataset: FLORES-200 by Meta AI (204 languages from all linguistic families)
- Pre-training: 10 epochs of byte relationship learning
- Main Training: 100 epochs with adaptive curriculum
- Batch Size: 128 (with gradient accumulation)
- Optimizer: AdamW with cosine annealing
Performance Metrics
Compression Ratio
| Metric | Value | Note |
|---|---|---|
| Average | 16:1 | 3 tokens per 48 bytes |
| Best Case | 48:1 | 1 token per 48 bytes |
| Worst Case | 12:1 | 4 tokens per 48 bytes |
Reconstruction Quality
| Language Type | Accuracy | Samples |
|---|---|---|
| Single Chunk | 90%+ | < 46 bytes |
| Multi-Chunk | 85%+ | > 46 bytes |
| Average | 88% | All languages |
Language Coverage
- Isolating: Chinese, Vietnamese, Thai (95%+ accuracy)
- Agglutinative: Korean, Turkish, Finnish (92%+ accuracy)
- Fusional: Spanish, Russian, Arabic (90%+ accuracy)
- Polysynthetic: Inuktitut, Mohawk (85%+ accuracy)
Key Features
1. Progressive Splitting
- Dynamic token allocation (1-4 tokens per 48-byte chunk)
- Gumbel-Softmax for differentiable discrete selection
- Semantic boundary learning without hardcoding
2. Multi-Query Attention (MQA)
- 16 query heads β 2 KV heads
- 8x memory reduction in KV cache
- Maintained quality with reduced parameters
3. Adaptive Compression
- Content-aware compression decisions
- Preserves semantic boundaries
- Language-agnostic byte-level learning
Use Cases
Ideal For:
- LLM Cost Reduction: 75% token savings
- Multilingual Applications: 204 language support
- Edge Deployment: Reduced memory footprint
- Real-time Processing: Fixed 48-byte chunks
Applications:
- Chat applications with token limits
- Document compression for LLMs
- Multilingual search systems
- Cross-lingual information retrieval
Limitations
- Multi-chunk Processing: Reconstruction quality decreases for very long texts
- Rare Languages: Lower accuracy for extremely low-resource languages
- Domain Specificity: Optimized for general text, not specialized domains
Model Files
epoch_100.pt: Final checkpoint (best performance)config.yaml: Training configurationtokenizer.py: Tokenizer implementationunified_model.py: Complete model architecture
Citation
If you use this model in your research, please cite:
@software{b2nl_intelligent_tokenizer_2025,
title = {B2NL-IntelligentTokenizer v6.2: Progressive Compression with 204 Language Support},
author = {ggun1o},
year = {2025},
month = {10},
version = {6.2.1},
url = {https://huggingface.co/ggun1o/B2NL-IntelligentTokenizer-v6.2}
}
Links
- Paper: [Coming Soon]
- GitHub: [Repository Link]
- LinkedIn: [Author Profile]
- Demo Space: https://huggingface.co/spaces/ggun1o/B2NL-v6.1.2
License
Apache 2.0
Acknowledgments
This work builds upon the B2NL (Byte-to-Natural Language) tokenization approach, extending it with progressive compression and comprehensive multilingual support.
Contact: For questions and collaborations, please reach out through GitHub issues or LinkedIn.