ggunio

Initial upload: B2NL-IntelligentTokenizer v6.2.1 (Autoregressive Mode)

ffbd655 verified 7 months ago

preview code

raw

history blame contribute delete

4.75 kB

B2NL v6.2.1 - Intelligent Tokenizer

Byte-to-Natural Language Progressive Compression Tokenizer

Model Overview

Release Date: October 6, 2025 Version: 6.2.1 Model Name: B2NL-IntelligentTokenizer-v6.2.1 Architecture: Progressive Splitting Tokenizer with Multi-Query Attention

Key Improvements from v6.1

🚀 Major Updates

Extended Language Support
- v6.1: 6 languages (Korean, English, Chinese, Japanese, Spanish, Arabic)
- v6.2: 204 languages with comprehensive multilingual training
Model Efficiency
- Parameter Reduction: 245M → 137.9M (Encoder) + 106.8M (Decoder)
- Memory Optimization: KV cache with 8x reduction through Multi-Query Attention (MQA)
- Training Stability: Enhanced gradient handling for batch size 128+
Compression Performance
- Achievement: 16:1 compression ratio (4x better than traditional BPE)
- Target Range: 12:1 to 48:1 adaptive compression
- Token Efficiency: 75% reduction in LLM API costs

Technical Specifications

Architecture Details

Encoder:
  - Layers: 4
  - Hidden Dimension: 1280
  - Attention Heads: 16 (Query) / 2 (KV) - MQA
  - Compression: 48 bytes → 1-4 tokens

Decoder:
  - Layers: 6
  - Hidden Dimension: 1280
  - Cross-Attention: Multi-level (4 encoder layers)
  - Reconstruction: Byte-perfect recovery target

Training Configuration

Dataset: FLORES-200 by Meta AI (204 languages from all linguistic families)
Pre-training: 10 epochs of byte relationship learning
Main Training: 100 epochs with adaptive curriculum
Batch Size: 128 (with gradient accumulation)
Optimizer: AdamW with cosine annealing

Performance Metrics

Compression Ratio

Metric	Value	Note
Average	16:1	3 tokens per 48 bytes
Best Case	48:1	1 token per 48 bytes
Worst Case	12:1	4 tokens per 48 bytes

Reconstruction Quality

Language Type	Accuracy	Samples
Single Chunk	90%+	< 46 bytes
Multi-Chunk	85%+	> 46 bytes
Average	88%	All languages

Language Coverage

Isolating: Chinese, Vietnamese, Thai (95%+ accuracy)
Agglutinative: Korean, Turkish, Finnish (92%+ accuracy)
Fusional: Spanish, Russian, Arabic (90%+ accuracy)
Polysynthetic: Inuktitut, Mohawk (85%+ accuracy)

Key Features

1. Progressive Splitting

Dynamic token allocation (1-4 tokens per 48-byte chunk)
Gumbel-Softmax for differentiable discrete selection
Semantic boundary learning without hardcoding

2. Multi-Query Attention (MQA)

16 query heads → 2 KV heads
8x memory reduction in KV cache
Maintained quality with reduced parameters

3. Adaptive Compression

Content-aware compression decisions
Preserves semantic boundaries
Language-agnostic byte-level learning

Use Cases

Ideal For:

LLM Cost Reduction: 75% token savings
Multilingual Applications: 204 language support
Edge Deployment: Reduced memory footprint
Real-time Processing: Fixed 48-byte chunks

Applications:

Chat applications with token limits
Document compression for LLMs
Multilingual search systems
Cross-lingual information retrieval

Limitations

Multi-chunk Processing: Reconstruction quality decreases for very long texts
Rare Languages: Lower accuracy for extremely low-resource languages
Domain Specificity: Optimized for general text, not specialized domains

Model Files

epoch_100.pt: Final checkpoint (best performance)
config.yaml: Training configuration
tokenizer.py: Tokenizer implementation
unified_model.py: Complete model architecture

Citation

If you use this model in your research, please cite:

@software{b2nl_intelligent_tokenizer_2025,
  title = {B2NL-IntelligentTokenizer v6.2: Progressive Compression with 204 Language Support},
  author = {ggun1o},
  year = {2025},
  month = {10},
  version = {6.2.1},
  url = {https://huggingface.co/ggun1o/B2NL-IntelligentTokenizer-v6.2}
}

License

Apache 2.0

Acknowledgments

This work builds upon the B2NL (Byte-to-Natural Language) tokenization approach, extending it with progressive compression and comprehensive multilingual support.

Contact: For questions and collaborations, please reach out through GitHub issues or LinkedIn.

ggunio
/

B2NL-IntelligentTokenizer-v6.2.1