Text Generation
Transformers
PyTorch
Safetensors
English
i3
i3-architecture
hybrid-model
rwkv-mamba
custom_code
FlameF0X commited on
Commit
d67ca95
·
verified ·
1 Parent(s): 91911b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -1
README.md CHANGED
@@ -1,7 +1,171 @@
1
  ---
 
2
  license: apache-2.0
 
 
 
 
 
 
3
  datasets:
4
  - agentlans/high-quality-english-sentences
5
  - roneneldan/TinyStories
6
  - starhopp3r/TinyChat
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
  license: apache-2.0
4
+ tags:
5
+ - conversational
6
+ - efficient
7
+ - i3-architecture
8
+ - hybrid-model
9
+ - rwkv-mamba
10
  datasets:
11
  - agentlans/high-quality-english-sentences
12
  - roneneldan/TinyStories
13
  - starhopp3r/TinyChat
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # i3-80M - Hybrid Architecture Language Model
19
+
20
+ ## Model Description
21
+
22
+ The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
23
+
24
+ ## Model Statistics
25
+
26
+ - **Total Parameters**: ~80M
27
+ - **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
28
+ - **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
29
+ - **Hidden Dimension (d_model)**: 512
30
+ - **Attention Heads**: 16
31
+ - **State Dimension (d_state)**: 32
32
+ - **Max Sequence Length**: 256
33
+ - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
34
+
35
+ ### Architecture Breakdown
36
+ ```
37
+ Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
38
+ ├─ RWKVMambaHybrid (Time-mixing + State-space)
39
+ └─ Feed-Forward Network (4x expansion)
40
+
41
+ Layers 11-16: Full Attention Blocks
42
+ ├─ Multi-Head Attention (16 heads)
43
+ └─ Feed-Forward Network (4x expansion)
44
+ ```
45
+
46
+ ### Key Features
47
+
48
+ 1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
49
+ - Early layers use RWKV-Mamba hybrid for efficient sequence processing
50
+ - Later layers use full multi-head attention for complex pattern recognition
51
+
52
+ 2. **Memory-Optimized Training**:
53
+ - Streaming vocabulary building (no full text storage)
54
+ - Vocabulary caching (build once, reuse)
55
+ - Efficient chunk frequency counting
56
+ - Automatic memory cleanup
57
+
58
+ 3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
59
+ - TinyStories: Narrative and storytelling
60
+ - TinyChat: Conversational dynamics
61
+ - High-Quality English Sentences: Linguistic diversity
62
+
63
+ 4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
64
+ - Total tokens processed: **3,000,000+**
65
+ - Handles unknown tokens gracefully with <UNK> token
66
+
67
+ ## Training Details
68
+
69
+ ### Training Configuration
70
+
71
+ - **Datasets**:
72
+ - `agentlans/high-quality-english-sentences`
73
+ - `roneneldan/TinyStories`
74
+ - `starhopp3r/TinyChat`
75
+ - **Training Steps**: 5,000 iterations
76
+ - **Batch Size**: 4 (with gradient accumulation support)
77
+ - **Learning Rate**: 3e-4 (with warmup and cosine decay)
78
+ - **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
79
+ - **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
80
+ - **Training Time**: ~17 hours
81
+ - **Framework**: PyTorch
82
+
83
+ ### Training Dynamics
84
+
85
+ - **GPU Utilization**: Stable at ~15-20% during training
86
+ - **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
87
+ - **Power Usage**: ~40W average
88
+ - **Throughput**: ~100-550 tokens/sec
89
+
90
+ ### Performance Metrics
91
+
92
+ | Metric | Initial | Final | Best |
93
+ |--------|---------|-------|------|
94
+ | Training Loss | ~6.0 | ~2.0 | 1.98 |
95
+ | Perplexity | ~400+ | ~7-10 | 7.29 |
96
+
97
+ The model shows strong convergence with stable training dynamics and efficient GPU utilization.
98
+
99
+ ## Usage
100
+ ```python
101
+ import torch
102
+ from transformers import AutoModelForCausalLM, AutoTokenizer
103
+
104
+ # Load model and tokenizer
105
+ model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
106
+ tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
107
+
108
+ # Generate text
109
+ prompt = "hello"
110
+ inputs = tokenizer(prompt, return_tensors="pt")
111
+ outputs = model.generate(
112
+ inputs.input_ids,
113
+ max_length=100,
114
+ temperature=0.8,
115
+ top_k=40
116
+ )
117
+ generated_text = tokenizer.decode(outputs[0])
118
+ print(generated_text)
119
+ ```
120
+
121
+ For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
122
+
123
+ ## Technical Innovations
124
+
125
+ 1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
126
+ - Linear complexity for long sequences
127
+ - Efficient recurrent processing
128
+ - State-space modeling for temporal dependencies
129
+
130
+ 2. **Hierarchical Processing**:
131
+ - Lower layers focus on local patterns (conv/recurrent)
132
+ - Upper layers capture global dependencies (attention)
133
+
134
+ 3. **Memory Efficiency**:
135
+ - Streaming tokenization during vocab building
136
+ - No full dataset storage in RAM
137
+ - Automatic cleanup of intermediate data
138
+
139
+ ## Model Files
140
+
141
+ - `pytorch_model.bin`: Model weights
142
+ - `config.json`: Model configuration
143
+ - `chunk_vocab_combined.json`: Tokenizer vocabulary
144
+
145
+ ## Training Tracking
146
+
147
+ This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
148
+ - Real-time loss and perplexity tracking
149
+ - Gradient norm monitoring
150
+ - Learning rate scheduling visualization
151
+ - Generation samples logged to tables
152
+ - Model checkpoints as artifacts
153
+ - System resource monitoring
154
+
155
+ ## Limitations
156
+
157
+ - Trained on English text only
158
+ - Limited to 256 token context window
159
+ - May require fine-tuning for specific downstream tasks
160
+ - Conversational style influenced by TinyChat dataset
161
+
162
+ ## Citation
163
+ ```bibtex
164
+ @misc{i3-80m,
165
+ author = {Daniel Fox},
166
+ title = {i3-80M: Hybrid Architecture Language Model},
167
+ year = {2025},
168
+ publisher = {HuggingFace},
169
+ howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
170
+ }
171
+ ```