FlameF0X commited on
Commit
5b7abc0
·
verified ·
1 Parent(s): 6617ea1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -171
README.md CHANGED
@@ -1,171 +0,0 @@
1
- ---
2
- language: en
3
- license: mit
4
- tags:
5
- - conversational
6
- - efficient
7
- - i3-architecture
8
- - hybrid-model
9
- - rwkv-mamba
10
- datasets:
11
- - agentlans/high-quality-english-sentences
12
- - roneneldan/TinyStories
13
- - starhopp3r/TinyChat
14
- library_name: transformers
15
- pipeline_tag: text-generation
16
- ---
17
-
18
- # i3-80M - Hybrid Architecture Language Model
19
-
20
- ## Model Description
21
-
22
- The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
23
-
24
- ## Model Statistics
25
-
26
- - **Total Parameters**: ~80M
27
- - **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
28
- - **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
29
- - **Hidden Dimension (d_model)**: 512
30
- - **Attention Heads**: 16
31
- - **State Dimension (d_state)**: 32
32
- - **Max Sequence Length**: 256
33
- - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
34
-
35
- ### Architecture Breakdown
36
- ```
37
- Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
38
- ├─ RWKVMambaHybrid (Time-mixing + State-space)
39
- └─ Feed-Forward Network (4x expansion)
40
-
41
- Layers 11-16: Full Attention Blocks
42
- ├─ Multi-Head Attention (16 heads)
43
- └─ Feed-Forward Network (4x expansion)
44
- ```
45
-
46
- ### Key Features
47
-
48
- 1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
49
- - Early layers use RWKV-Mamba hybrid for efficient sequence processing
50
- - Later layers use full multi-head attention for complex pattern recognition
51
-
52
- 2. **Memory-Optimized Training**:
53
- - Streaming vocabulary building (no full text storage)
54
- - Vocabulary caching (build once, reuse)
55
- - Efficient chunk frequency counting
56
- - Automatic memory cleanup
57
-
58
- 3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
59
- - TinyStories: Narrative and storytelling
60
- - TinyChat: Conversational dynamics
61
- - High-Quality English Sentences: Linguistic diversity
62
-
63
- 4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
64
- - Total tokens processed: **3,000,000+**
65
- - Handles unknown tokens gracefully with <UNK> token
66
-
67
- ## Training Details
68
-
69
- ### Training Configuration
70
-
71
- - **Datasets**:
72
- - `agentlans/high-quality-english-sentences`
73
- - `roneneldan/TinyStories`
74
- - `starhopp3r/TinyChat`
75
- - **Training Steps**: 5,000 iterations
76
- - **Batch Size**: 4 (with gradient accumulation support)
77
- - **Learning Rate**: 3e-4 (with warmup and cosine decay)
78
- - **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
79
- - **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
80
- - **Training Time**: ~17 hours
81
- - **Framework**: PyTorch
82
-
83
- ### Training Dynamics
84
-
85
- - **GPU Utilization**: Stable at ~15-20% during training
86
- - **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
87
- - **Power Usage**: ~40W average
88
- - **Throughput**: ~100-550 tokens/sec
89
-
90
- ### Performance Metrics
91
-
92
- | Metric | Initial | Final | Best |
93
- |--------|---------|-------|------|
94
- | Training Loss | ~6.0 | ~2.0 | 1.98 |
95
- | Perplexity | ~400+ | ~7-10 | 7.29 |
96
-
97
- The model shows strong convergence with stable training dynamics and efficient GPU utilization.
98
-
99
- ## Usage
100
- ```python
101
- import torch
102
- from transformers import AutoModelForCausalLM, AutoTokenizer
103
-
104
- # Load model and tokenizer
105
- model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
106
- tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
107
-
108
- # Generate text
109
- prompt = "hello"
110
- inputs = tokenizer(prompt, return_tensors="pt")
111
- outputs = model.generate(
112
- inputs.input_ids,
113
- max_length=100,
114
- temperature=0.8,
115
- top_k=40
116
- )
117
- generated_text = tokenizer.decode(outputs[0])
118
- print(generated_text)
119
- ```
120
-
121
- For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
122
-
123
- ## Technical Innovations
124
-
125
- 1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
126
- - Linear complexity for long sequences
127
- - Efficient recurrent processing
128
- - State-space modeling for temporal dependencies
129
-
130
- 2. **Hierarchical Processing**:
131
- - Lower layers focus on local patterns (conv/recurrent)
132
- - Upper layers capture global dependencies (attention)
133
-
134
- 3. **Memory Efficiency**:
135
- - Streaming tokenization during vocab building
136
- - No full dataset storage in RAM
137
- - Automatic cleanup of intermediate data
138
-
139
- ## Model Files
140
-
141
- - `pytorch_model.bin`: Model weights
142
- - `config.json`: Model configuration
143
- - `chunk_vocab_combined.json`: Tokenizer vocabulary
144
-
145
- ## Training Tracking
146
-
147
- This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
148
- - Real-time loss and perplexity tracking
149
- - Gradient norm monitoring
150
- - Learning rate scheduling visualization
151
- - Generation samples logged to tables
152
- - Model checkpoints as artifacts
153
- - System resource monitoring
154
-
155
- ## Limitations
156
-
157
- - Trained on English text only
158
- - Limited to 256 token context window
159
- - May require fine-tuning for specific downstream tasks
160
- - Conversational style influenced by TinyChat dataset
161
-
162
- ## Citation
163
- ```bibtex
164
- @misc{i3-80m,
165
- author = {Daniel Fox},
166
- title = {i3-80M: Hybrid Architecture Language Model},
167
- year = {2025},
168
- publisher = {HuggingFace},
169
- howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
170
- }
171
- ```