File size: 8,158 Bytes
cd09b37 d67ca95 e1eb06a d67ca95 f054d1b cd09b37 d67ca95 15bef3e 3808760 8800d78 b37e183 d67ca95 15bef3e d67ca95 15bef3e bab8a37 15bef3e d67ca95 997ce06 a1e55cf d67ca95 1000de1 d67ca95 bab8a37 15fc360 d67ca95 15bef3e d67ca95 15bef3e d67ca95 15bef3e d67ca95 15bef3e d67ca95 46e1611 8b84fe1 e1eb06a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
---
language: en
license: apache-2.0
tags:
- i3-architecture
- hybrid-model
- rwkv-mamba
- custom_code
datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
library_name: transformers
pipeline_tag: text-generation
---
# i3-80M - Hybrid Architecture Language Model
## Model Description
The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training.
> [!NOTE]
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m).
>
> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md)
## Model Statistics
- **Total Parameters**: ~82.77M (82,765,160)
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
- **Hidden Dimension (d_model)**: 512
- **Attention Heads**: 16
- **State Dimension (d_state)**: 32
- **Max Sequence Length**: 256
- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
### Architecture Breakdown
```
Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
├─ RWKVMambaHybrid (Time-mixing + State-space)
└─ Feed-Forward Network (4x expansion)
Layers 11-16: Full Attention Blocks
├─ Multi-Head Attention (16 heads)
└─ Feed-Forward Network (4x expansion)
```
## Comparison with i3-22M
| Feature | i3-22M | i3-80M (This Model) |
|---------|--------|---------------------|
| **Parameters** | 22.6M | 82.77M |
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers |
| **Hidden Dimension** | 512 | 512 |
| **Vocabulary Size** | 4,466 | 35,560 |
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences |
| **Total Tokens** | ~1M conversations | ~3M+ tokens |
| **Final Loss** | ~2.0 | ~2.0 |
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 |
| **Training Time** | ~17 hours | ~2-4 hours |
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers |
### Key Improvements Over i3-22M
1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies
2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset
4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
5. **Enhanced Unknown Token Handling**: Robust <UNK> token system for out-of-vocabulary words
### When to Use Each Model
**Use i3-22M if you need:**
- Smaller model size (~22M params)
- Pure conversational focus (TinyChat specialized)
- Lower memory footprint
- Faster inference
**Use i3-80M if you need:**
- Better general-purpose text generation
- Stronger attention-based reasoning (6 attention layers)
- Larger vocabulary coverage
- Multi-domain text understanding (stories, chat, formal text)
### Key Features
1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
- Early layers use RWKV-Mamba hybrid for efficient sequence processing
- Later layers use full multi-head attention for complex pattern recognition
2. **Memory-Optimized Training**:
- Streaming vocabulary building (no full text storage)
- Vocabulary caching (build once, reuse)
- Efficient chunk frequency counting
- Automatic memory cleanup
3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
- TinyStories: Narrative and storytelling
- TinyChat: Conversational dynamics
- High-Quality English Sentences: Linguistic diversity
4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
- Total tokens processed: **3,000,000+**
- Handles unknown tokens gracefully with <UNK> token
## Training Details
### Training Configuration
- **Datasets**:
- `agentlans/high-quality-english-sentences`
- `roneneldan/TinyStories`
- `starhopp3r/TinyChat`
- **Training Steps**: 5,000 iterations
- **Batch Size**: 4 (with gradient accumulation support)
- **Learning Rate**: 3e-4 (with warmup and cosine decay)
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
- **Hardware**: NVIDIA P100 (16GB VRAM)
- **Training Time**: ~2-4 hours
- **Framework**: PyTorch
### Training Dynamics
- **GPU Utilization**: Stable at ~15-20% during training
- **GPU Memory**: \~18% allocated (~2.2GB / 12GB)
- **Power Usage**: ~40W average
- **Throughput**: ~100-550 tokens/sec
### Performance Metrics
| Metric | Initial | Final |
|--------|---------|-------|
| Training Loss | ~10.0 | ~1.7 |
| Perplexity | ~4000+ | ~6 |

> [!NOTE]
> I dont know why the logging starts at step 4.6k .
**i3-22m** and **i3-80m** comparation?

The model shows strong convergence with stable training dynamics and efficient GPU utilization.
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")
# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.8,
top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)
```
## Technical Innovations
1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
2. **Hierarchical Processing**:
- Lower layers focus on local patterns (conv/recurrent)
- Upper layers capture global dependencies (attention)
3. **Memory Efficiency**:
- Streaming tokenization during vocab building
- No full dataset storage in RAM
- Automatic cleanup of intermediate data
## Model Files
- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `chunk_vocab_combined.json`: Tokenizer vocabulary
## Training Tracking
This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
- Real-time loss and perplexity tracking
- Gradient norm monitoring
- Learning rate scheduling visualization
- Generation samples logged to tables
- Model checkpoints as artifacts
- System resource monitoring
## Limitations
- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset
## Model Series
- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
- **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training
## Citation
```bibtex
@misc{i3-80m,
author = {FlameF0X},
title = {i3-80M: Hybrid Architecture Language Model},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
```
```bibtex
@article{mamba,
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
author={Gu, Albert and Dao, Tri},
journal={arXiv preprint arXiv:2312.00752},
year={2023}
}
@article{RWKV,
title={RWKV: Reinventing RNNs for the Transformer Era},
author={Peng, Bo and others},
journal={arXiv preprint arXiv:2305.13048},
year={2023}
}
``` |