---
language: en
license: apache-2.0
tags:
- i3-architecture
- hybrid-model
- rwkv-mamba
- custom_code
datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
- Salesforce/wikitext
library_name: transformers
pipeline_tag: text-generation
---

# i3-200M - Hybrid Architecture Language Model

![Gemini_Generated_Image_ore10zore10zore1](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/2TolPRQRuypbh7-LTXMe6.png)

## Model Description

The **i3-200M Model** (aka Redherring) is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

> [!NOTE]
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-200m).
 
## Model Statistics

- **Total Parameters**: ~169.85M 
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- **Vocabulary Size**: 32,000 tokens
- **Hidden Dimension (d_model)**: 512
- **Attention Heads**: 16
- **State Dimension (d_state)**: 32
- **Max Sequence Length**: 256
- **Tokenization**: BPE

### Architecture Breakdown
```
Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)
```

## Comparison with i3-80M

| Feature | i3-22M | i3-80M | i3-200m (This Model) |
|---------|--------|--------|----------------------|
| **Parameters** | 22.6M | 82.77M | 169.85M |
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers | 10 Hybrid + 6 Attention Layers |
| **Hidden Dimension** | 512 | 512 | 512 |
| **Vocabulary Size** | 4,466 | 35,560 | 32,000 |
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences | TinyStories + TinyChat + HQ Sentences + Wikitext |
| **Total Tokens** | ~1M conversations | ~3M+ tokens | N/A |
| **Final Loss** | ~2.0 | ~2.0 | 1.6 |
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 | 5.2 |
| **Training Time** | ~17 hours | ~2-4 hours | ~1-2 hours |
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers | 6 Full Attention Layers |

### Key Improvements Over i3-80M

To Be added

### Key Features

1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
   - Later layers use full multi-head attention for complex pattern recognition

2. **Memory-Optimized Training**: 
   - Streaming vocabulary building (no full text storage)
   - Vocabulary caching (build once, reuse)
   - Efficient chunk frequency counting
   - Automatic memory cleanup

3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
   - TinyStories: Narrative and storytelling
   - TinyChat: Conversational dynamics
   - High-Quality English Sentences: Linguistic diversity
   - wikitext

4. **BPE Tokenization**:
   - Total tokens processed: **N/A**
   - Handles unknown tokens gracefully with <UNK>, <PAD>, <BOS>, <EOS> token

## Training Details

### Training Configuration

- **Datasets**: 
  - `agentlans/high-quality-english-sentences`
  - `roneneldan/TinyStories`
  - `starhopp3r/TinyChat`
  - `wikitext`
- **Training Steps**: 250 iterations
- **Batch Size**: 4 (with gradient accumulation support)
- **Learning Rate**: 4e-4 (with warmup and cosine decay)
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
- **Hardware**: NVIDIA P100 (16GB VRAM)
- **Training Time**: ~1-2 hours
- **Framework**: PyTorch

### Training Dynamics

- **GPU Utilization**: Stable at ~N/A% during training
- **GPU Memory**: \~20% allocated (\~4GB / 12GB) (Math is not mathing??)
- **Power Usage**: ~250W average
- **Throughput**: ~300 tokens/sec

### Performance Metrics

| Metric | Initial | Final |
|--------|---------|-------|
| Training Loss | ~10.0 | 1.6 |
| Perplexity | ~4000+ | 5.2 |


## Technical Innovations

1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
   - Linear complexity for long sequences
   - Efficient recurrent processing
   - State-space modeling for temporal dependencies

2. **Hierarchical Processing**: 
   - Lower layers focus on local patterns (conv/recurrent)
   - Upper layers capture global dependencies (attention)

3. **Memory Efficiency**: 
   - Streaming tokenization during vocab building
   - No full dataset storage in RAM
   - Automatic cleanup of intermediate data

## Model Files

- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `tokenizer.json`: Tokenizer vocabulary


## Limitations

- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset

## Model Series

- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
- [i3-80M](https://huggingface.co/FlameF0X/i3-80m) - Scaled version with attention layers and multi-dataset training
- **i3-200M** (This model)

## Citation
```bibtex
@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}

```