Text Generation
Transformers
PyTorch
Safetensors
English
i3
i3-architecture
hybrid-model
rwkv-mamba
custom_code
File size: 8,158 Bytes
cd09b37
d67ca95
e1eb06a
d67ca95
 
 
 
f054d1b
cd09b37
 
 
 
d67ca95
 
 
 
 
 
 
 
 
 
15bef3e
 
3808760
 
 
8800d78
b37e183
d67ca95
 
15bef3e
d67ca95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15bef3e
 
 
 
 
 
 
 
 
bab8a37
15bef3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d67ca95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
997ce06
a1e55cf
d67ca95
 
 
 
 
1000de1
d67ca95
 
 
 
 
bab8a37
 
 
 
15fc360
 
 
 
 
 
 
 
 
d67ca95
 
 
 
 
 
 
 
15bef3e
 
d67ca95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15bef3e
 
 
 
 
d67ca95
 
 
15bef3e
d67ca95
 
 
15bef3e
d67ca95
46e1611
 
 
 
 
 
 
 
8b84fe1
 
 
 
 
 
 
e1eb06a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
language: en
license: apache-2.0
tags:
- i3-architecture
- hybrid-model
- rwkv-mamba
- custom_code
datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
library_name: transformers
pipeline_tag: text-generation
---

# i3-80M - Hybrid Architecture Language Model

## Model Description

The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training.

> [!NOTE]
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m).
> 
> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md)

## Model Statistics

- **Total Parameters**: ~82.77M (82,765,160)
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
- **Hidden Dimension (d_model)**: 512
- **Attention Heads**: 16
- **State Dimension (d_state)**: 32
- **Max Sequence Length**: 256
- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)

### Architecture Breakdown
```
Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)
```

## Comparison with i3-22M

| Feature | i3-22M | i3-80M (This Model) |
|---------|--------|---------------------|
| **Parameters** | 22.6M | 82.77M |
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers |
| **Hidden Dimension** | 512 | 512 |
| **Vocabulary Size** | 4,466 | 35,560 |
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences |
| **Total Tokens** | ~1M conversations | ~3M+ tokens |
| **Final Loss** | ~2.0 | ~2.0 |
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 |
| **Training Time** | ~17 hours | ~2-4 hours |
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers |

### Key Improvements Over i3-22M

1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies
2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset
4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
5. **Enhanced Unknown Token Handling**: Robust <UNK> token system for out-of-vocabulary words

### When to Use Each Model

**Use i3-22M if you need:**
- Smaller model size (~22M params)
- Pure conversational focus (TinyChat specialized)
- Lower memory footprint
- Faster inference

**Use i3-80M if you need:**
- Better general-purpose text generation
- Stronger attention-based reasoning (6 attention layers)
- Larger vocabulary coverage
- Multi-domain text understanding (stories, chat, formal text)

### Key Features

1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
   - Later layers use full multi-head attention for complex pattern recognition

2. **Memory-Optimized Training**: 
   - Streaming vocabulary building (no full text storage)
   - Vocabulary caching (build once, reuse)
   - Efficient chunk frequency counting
   - Automatic memory cleanup

3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
   - TinyStories: Narrative and storytelling
   - TinyChat: Conversational dynamics
   - High-Quality English Sentences: Linguistic diversity

4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
   - Total tokens processed: **3,000,000+**
   - Handles unknown tokens gracefully with <UNK> token

## Training Details

### Training Configuration

- **Datasets**: 
  - `agentlans/high-quality-english-sentences`
  - `roneneldan/TinyStories`
  - `starhopp3r/TinyChat`
- **Training Steps**: 5,000 iterations
- **Batch Size**: 4 (with gradient accumulation support)
- **Learning Rate**: 3e-4 (with warmup and cosine decay)
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
- **Hardware**: NVIDIA P100 (16GB VRAM)
- **Training Time**: ~2-4 hours
- **Framework**: PyTorch

### Training Dynamics

- **GPU Utilization**: Stable at ~15-20% during training
- **GPU Memory**: \~18% allocated (~2.2GB / 12GB)
- **Power Usage**: ~40W average
- **Throughput**: ~100-550 tokens/sec

### Performance Metrics

| Metric | Initial | Final |
|--------|---------|-------|
| Training Loss | ~10.0 | ~1.7 |
| Perplexity | ~4000+ | ~6 |

![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/ugtJGyEkQfbGieURP2W78.png)
> [!NOTE]
> I dont know why the logging starts at step 4.6k .

**i3-22m** and **i3-80m** comparation?

![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/utj6B7AE_gMMI9jnHc37Z.png)

The model shows strong convergence with stable training dynamics and efficient GPU utilization.

## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")

# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.8,
    top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)
```


## Technical Innovations

1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
   - Linear complexity for long sequences
   - Efficient recurrent processing
   - State-space modeling for temporal dependencies

2. **Hierarchical Processing**: 
   - Lower layers focus on local patterns (conv/recurrent)
   - Upper layers capture global dependencies (attention)

3. **Memory Efficiency**: 
   - Streaming tokenization during vocab building
   - No full dataset storage in RAM
   - Automatic cleanup of intermediate data

## Model Files

- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `chunk_vocab_combined.json`: Tokenizer vocabulary

## Training Tracking

This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
- Real-time loss and perplexity tracking
- Gradient norm monitoring
- Learning rate scheduling visualization
- Generation samples logged to tables
- Model checkpoints as artifacts
- System resource monitoring

## Limitations

- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset

## Model Series

- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
- **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training

## Citation
```bibtex
@misc{i3-80m,
  author = {FlameF0X},
  title = {i3-80M: Hybrid Architecture Language Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
```
```bibtex
@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}

```