--- language: en license: apache-2.0 tags: - i3-architecture - hybrid-model - rwkv-mamba - custom_code datasets: - agentlans/high-quality-english-sentences - roneneldan/TinyStories - starhopp3r/TinyChat - Salesforce/wikitext library_name: transformers pipeline_tag: text-generation --- # i3-200M - Hybrid Architecture Language Model ![Gemini_Generated_Image_ore10zore10zore1](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/2TolPRQRuypbh7-LTXMe6.png) ## Model Description The **i3-200M Model** (aka Redherring) is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers. > [!NOTE] > To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-200m). ## Model Statistics - **Total Parameters**: ~169.85M - **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers - **Vocabulary Size**: 32,000 tokens - **Hidden Dimension (d_model)**: 512 - **Attention Heads**: 16 - **State Dimension (d_state)**: 32 - **Max Sequence Length**: 256 - **Tokenization**: BPE ### Architecture Breakdown ``` Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv) ├─ RWKVMambaHybrid (Time-mixing + State-space) └─ Feed-Forward Network (4x expansion) Layers 11-16: Full Attention Blocks ├─ Multi-Head Attention (16 heads) └─ Feed-Forward Network (4x expansion) ``` ## Comparison with i3-80M | Feature | i3-22M | i3-80M | i3-200m (This Model) | |---------|--------|--------|----------------------| | **Parameters** | 22.6M | 82.77M | 169.85M | | **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers | 10 Hybrid + 6 Attention Layers | | **Hidden Dimension** | 512 | 512 | 512 | | **Vocabulary Size** | 4,466 | 35,560 | 32,000 | | **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences | TinyStories + TinyChat + HQ Sentences + Wikitext | | **Total Tokens** | ~1M conversations | ~3M+ tokens | N/A | | **Final Loss** | ~2.0 | ~2.0 | 1.6 | | **Final Perplexity** | 7.29-9.70 | 7.29-10.0 | 5.2 | | **Training Time** | ~17 hours | ~2-4 hours | ~1-2 hours | | **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers | 6 Full Attention Layers | ### Key Improvements Over i3-80M To Be added ### Key Features 1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention - Early layers use RWKV-Mamba hybrid for efficient sequence processing - Later layers use full multi-head attention for complex pattern recognition 2. **Memory-Optimized Training**: - Streaming vocabulary building (no full text storage) - Vocabulary caching (build once, reuse) - Efficient chunk frequency counting - Automatic memory cleanup 3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding - TinyStories: Narrative and storytelling - TinyChat: Conversational dynamics - High-Quality English Sentences: Linguistic diversity - wikitext 4. **BPE Tokenization**: - Total tokens processed: **N/A** - Handles unknown tokens gracefully with , , , token ## Training Details ### Training Configuration - **Datasets**: - `agentlans/high-quality-english-sentences` - `roneneldan/TinyStories` - `starhopp3r/TinyChat` - `wikitext` - **Training Steps**: 250 iterations - **Batch Size**: 4 (with gradient accumulation support) - **Learning Rate**: 4e-4 (with warmup and cosine decay) - **Optimizer**: AdamW with gradient clipping (max norm: 1.0) - **Hardware**: NVIDIA P100 (16GB VRAM) - **Training Time**: ~1-2 hours - **Framework**: PyTorch ### Training Dynamics - **GPU Utilization**: Stable at ~N/A% during training - **GPU Memory**: \~20% allocated (\~4GB / 12GB) (Math is not mathing??) - **Power Usage**: ~250W average - **Throughput**: ~300 tokens/sec ### Performance Metrics | Metric | Initial | Final | |--------|---------|-------| | Training Loss | ~10.0 | 1.6 | | Perplexity | ~4000+ | 5.2 | ## Technical Innovations 1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics - Linear complexity for long sequences - Efficient recurrent processing - State-space modeling for temporal dependencies 2. **Hierarchical Processing**: - Lower layers focus on local patterns (conv/recurrent) - Upper layers capture global dependencies (attention) 3. **Memory Efficiency**: - Streaming tokenization during vocab building - No full dataset storage in RAM - Automatic cleanup of intermediate data ## Model Files - `pytorch_model.bin`: Model weights - `config.json`: Model configuration - `tokenizer.json`: Tokenizer vocabulary ## Limitations - Trained on English text only - Limited to 256 token context window - May require fine-tuning for specific downstream tasks - Conversational style influenced by TinyChat dataset ## Model Series - [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture - [i3-80M](https://huggingface.co/FlameF0X/i3-80m) - Scaled version with attention layers and multi-dataset training - **i3-200M** (This model) ## Citation ```bibtex @article{mamba, title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces}, author={Gu, Albert and Dao, Tri}, journal={arXiv preprint arXiv:2312.00752}, year={2023} } @article{RWKV, title={RWKV: Reinventing RNNs for the Transformer Era}, author={Peng, Bo and others}, journal={arXiv preprint arXiv:2305.13048}, year={2023} } ```