Gogs commited on
Commit
123571d
Β·
1 Parent(s): 679c77e

Fix: add metrics for Hugging Face YAML validation

Browse files
Files changed (1) hide show
  1. README.md +302 -194
README.md CHANGED
@@ -1,223 +1,331 @@
1
- ---
2
  language:
3
- - code
 
4
  license: apache-2.0
5
  tags:
6
- - code-generation
7
- - mobile-training
8
- - pytorch
9
- - transformers
10
- - distilgpt2
11
- - zero-budget-ai
 
 
 
 
 
 
12
  datasets:
13
- - bigcode/the-stack-smol-xl
 
14
  metrics:
15
- - perplexity
 
16
  model-index:
17
- - name: Yuuki v0.1
18
- results:
19
- - task:
20
- type: text-generation
21
- dataset:
22
- name: The Stack
23
- type: bigcode/the-stack-smol-xl
 
 
 
 
 
 
 
 
 
 
 
 
24
  ---
25
 
26
- # 🌸 Yuuki v0.1 - The $0 Code LLM
 
 
27
 
28
- > **⚠️ WORK IN PROGRESS** - Currently training on mobile CPU (Day 3/42)
29
 
30
- ## 🎯 The Mission
31
 
32
- **Prove that you DON'T need expensive GPUs to train LLMs.**
33
 
34
- Yuuki is a code generation model trained entirely on a **$150 Android phone** with:
35
- - ❌ No cloud compute
36
- - ❌ No GPU
37
- - ❌ No data center
38
- - βœ… Just determination and time
 
 
 
 
 
 
 
 
 
39
 
40
- ### The Setup
41
  Hardware: Snapdragon 685 (8-core ARM CPU)
42
  RAM: 6GB
43
  Storage: 128GB
44
  NPU: Hexagon 686 (1 TOPS)
45
  GPU: Adreno 610 (243 GFLOPS) - NOT USED for training
46
  Cost: $0 in compute
47
- ## πŸ“Š Current Status
48
-
49
- | Metric | Value |
50
- |--------|-------|
51
- | **Progress** | 1,417 / 37,500 steps (3.78%) |
52
- | **Epoch** | 0.08 / 2.0 |
53
- | **Current Loss** | ~1.70 - 2.23 |
54
- | **Best Loss** | 1.7053 ⭐ |
55
- | **Training Time** | ~3 days |
56
- | **ETA** | ~39 days remaining |
57
- | **Speed** | ~100 sec/step |
58
-
59
- ### Loss Progression
 
 
 
60
  Step 0: Loss 3.35 (baseline)
61
  Step 500: Loss 2.50 ↓ -25%
62
  Step 1000: Loss 2.00 ↓ -40%
63
  Step 1265: Loss 1.83 ↓ -45%
64
  Step 1292: Loss 1.71 ↓ -49% ⭐ RECORD
65
  Step 1417: Loss 2.23 (current, oscillating 1.7-2.3)
66
- ## πŸŽ“ What Yuuki Knows (So Far)
 
67
 
68
  Due to alphabetically-ordered dataset:
69
 
70
- | Language | Exposure | Quality | Status |
71
- |----------|----------|---------|--------|
72
- | **Agda** | High | 85/100 | βœ… Excellent |
73
- | **C** | Starting | 30/100 | ⏳ Learning |
74
- | **Assembly** | Low | 5/100 | 🌱 Minimal |
75
- | **Python** | None | 0/100 | ❌ Not reached yet |
76
-
77
- ### Example Output (Step 1,300)
78
-
79
- **Agda prompt:** `module Main where`
80
-
81
- ```agda
82
- module Main where (x, f) in a
83
-
84
- open import Cubical.Sigma
85
- open import Cubical.Sigma.Core
86
- open import Cubical.Foundations.H
87
- βœ… Real Agda libraries! The model learned actual Cubical type theory modules.
88
- πŸ› οΈ Training Configuration
89
- Model: DistilGPT-2 (82M parameters)
90
- Dataset: The Stack (75,000 examples)
91
- Batch size: 1
92
- Gradient accumulation: 4
93
- Effective batch: 4
94
- Learning rate: 5e-5
95
- Max length: 256 tokens
96
- Optimizer: AdamW
97
- Epochs: 2
98
- Total tokens: ~30M (2 epochs)
99
- Why so slow?
100
- 100 seconds/step Γ— 37,500 steps = 3,750,000 seconds
101
- = 1,042 hours
102
- = 43.4 days
103
- = ~6 weeks of continuous training
104
- No GPU acceleration. Pure CPU grinding. πŸ’ͺ
105
- πŸ“ˆ Roadmap
106
- v0.1 (Current - Proof of Concept)
107
- [x] Setup training pipeline
108
- [x] Start training (Step 0)
109
- [x] Reach Step 1,000
110
- [x] Break loss 2.0 barrier
111
- [x] Break loss 1.8 barrier ⭐
112
- [ ] Checkpoint 2,500 (7%)
113
- [ ] Checkpoint 5,000 (13%)
114
- [ ] Checkpoint 10,000 (27%)
115
- [ ] Checkpoint 18,750 (50% - Epoch 1 complete)
116
- [ ] Checkpoint 37,500 (100% - DONE)
117
- [ ] Quantize to INT8
118
- [ ] Convert to ONNX
119
- [ ] Publish final model
120
- ETA: Mid-March 2026
121
- v0.2 (The Full Dataset)
122
- Dataset: 786,387 examples (full Stack)
123
- Duration: 418 days (~14 months)
124
- Epochs: 2.0
125
- Total tokens: ~314M
126
- Dataset fix: SHUFFLED (not alphabetical)
127
- Languages: All 80+ languages balanced
128
- Start: March 2026
129
- End: May 2027
130
- v0.3+ (PC Era)
131
- Hardware upgrade: RTX 4060/4070
132
- Larger models: 350M-1B parameters
133
- Faster training: ~30x speedup
134
- Advanced techniques: LoRA, QLoRA, etc.
135
- πŸ’‘ Philosophy
136
- "The barrier to AI isn't money. It's mindset."
137
- This project demonstrates:
138
- βœ… You CAN train LLMs without GPUs
139
- βœ… Patience > Hardware
140
- βœ… $0 budget is enough to start
141
- βœ… Limited resources inspire creativity
142
- βœ… Anyone can contribute to AI
143
- The Statement vs The Execution
144
- v0.1-v0.2 (Mobile): "You don't need expensive hardware"
145
- v0.3+ (PC): "Now let's build something competitive"
146
- Start with what you have. Upgrade when you can. Never let hardware stop you.
147
- πŸš€ Usage (After Training Completes)
148
- Basic Usage
149
- from transformers import AutoModelForCausalLM, AutoTokenizer
150
-
151
- # Load model
152
- model = AutoModelForCausalLM.from_pretrained("OpceanAI/Yuuki")
153
- tokenizer = AutoTokenizer.from_pretrained("OpceanAI/Yuuki")
154
-
155
- # Generate code
156
- prompt = "def fibonacci(n):"
157
- inputs = tokenizer(prompt, return_tensors="pt")
158
- outputs = model.generate(**inputs, max_length=100)
159
- code = tokenizer.decode(outputs[0])
160
- print(code)
161
- Quantized (4x faster, 4x smaller)
162
- # Coming after training completes
163
- model = AutoModelForCausalLM.from_pretrained(
164
- "OpceanAI/Yuuki",
165
- subfolder="yuuki-v0.1-int8"
166
- )
167
- ⚠️ Known Limitations
168
- Dataset order: Alphabetical (not shuffled) - learns early languages best
169
- Token count: Only ~30M tokens (vs GPT-2's 40B)
170
- Training speed: Very slow (~100 sec/step)
171
- Model size: Small (82M params)
172
- Language coverage: Incomplete due to alphabetical ordering
173
- These will be addressed in v0.2 with shuffled dataset.
174
- πŸ”¬ Technical Details
175
- Why Mobile Training Works
176
- CPU Training (100 sec/step):
177
- - Forward pass: 40 sec
178
- - Backward pass: 40 sec
179
- - Optimizer: 20 sec
180
- Total: ~100 sec
181
-
182
- vs GPU Training (0.5 sec/step):
183
- - 200x faster
184
- - But costs $0.50-$2.00/hour
185
- - 42 days = $500-$2,000
186
-
187
- Mobile: FREE but SLOW
188
- GPU: FAST but EXPENSIVE
189
-
190
- For proof of concept: Mobile wins. πŸ†
191
- Training Challenges Overcome
192
- Memory management: Gradient accumulation (4 steps)
193
- Thermal throttling: Periodic breaks, room cooling
194
- Battery life: Always plugged in
195
- Storage: Careful checkpoint management
196
- Interruptions: Resume from checkpoints
197
- Patience: 100 sec/step Γ— 37,500 = mental fortitude
198
- πŸ“Š Benchmarks (Post-Training)
199
- Coming soon after training completes (~March 2026).
200
- Expected performance:
201
- Agda: 85-95/100 (primary language)
202
- C: 85-92/100 (secondary language)
203
- Assembly: 75-85/100 (tertiary)
204
- Python: 10-20/100 (barely seen due to alphabet order)
205
- πŸ™ Acknowledgments
206
- Anthropic Claude: Technical guidance and debugging assistance
207
- HuggingFace: Infrastructure and transformers library
208
- BigCode: The Stack dataset
209
- The ML community: For saying "you need GPUs" - best motivation 😏
210
- πŸ“œ License
211
- Apache 2.0 - See LICENSE file.
212
- You can use Yuuki commercially, modify it, distribute it. Just give credit. βœ…
213
- πŸ”— Links
214
- GitHub: (Coming soon)
215
- Twitter: (Coming soon)
216
- Progress updates: Check this model card
217
- πŸ“… Updates
218
- 2026-01-29: Training started
219
- 2026-01-29: Step 1,000 reached - Loss 2.00
220
- 2026-01-29: Step 1,292 - NEW RECORD Loss 1.7053
221
- 2026-01-29: Repository created on HuggingFace
222
- Last updated: 2026-01-29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  Follow the journey of training an LLM with $0 budget. One step at a time. 🌸
 
 
1
  language:
2
+
3
+ code
4
  license: apache-2.0
5
  tags:
6
+
7
+ code-generation
8
+
9
+ mobile-training
10
+
11
+ pytorch
12
+
13
+ transformers
14
+
15
+ distilgpt2
16
+
17
+ zero-budget-ai
18
  datasets:
19
+
20
+ bigcode/the-stack-smol-xl
21
  metrics:
22
+
23
+ perplexity
24
  model-index:
25
+
26
+ name: Yuuki v0.1
27
+ results:
28
+
29
+ task:
30
+ type: text-generation
31
+ dataset:
32
+ name: The Stack
33
+ type: bigcode/the-stack-smol-xl
34
+ metrics:
35
+
36
+ name: perplexity
37
+ type: perplexity
38
+ value: 5.50
39
+
40
+
41
+
42
+
43
+
44
  ---
45
 
46
+ 🌸 Yuuki v0.1 - The $0 Code LLM
47
+
48
+ > ⚠️ WORK IN PROGRESS - Currently training on mobile CPU (Day 3/42)
49
 
 
50
 
 
51
 
52
+ 🎯 The Mission
53
 
54
+ Prove that you DON'T need expensive GPUs to train LLMs.
55
+
56
+ Yuuki is a code generation model trained entirely on a $150 Android phone with:
57
+
58
+ ❌ No cloud compute
59
+
60
+ ❌ No GPU
61
+
62
+ ❌ No data center
63
+
64
+ βœ… Just determination and time
65
+
66
+
67
+ The Setup
68
 
 
69
  Hardware: Snapdragon 685 (8-core ARM CPU)
70
  RAM: 6GB
71
  Storage: 128GB
72
  NPU: Hexagon 686 (1 TOPS)
73
  GPU: Adreno 610 (243 GFLOPS) - NOT USED for training
74
  Cost: $0 in compute
75
+
76
+ πŸ“Š Current Status
77
+
78
+ Metric Value
79
+
80
+ Progress 1,417 / 37,500 steps (3.78%)
81
+ Epoch 0.08 / 2.0
82
+ Current Loss ~1.70 - 2.23
83
+ Best Loss 1.7053 ⭐
84
+ Training Time ~3 days
85
+ ETA ~39 days remaining
86
+ Speed ~100 sec/step
87
+
88
+
89
+ Loss Progression
90
+
91
  Step 0: Loss 3.35 (baseline)
92
  Step 500: Loss 2.50 ↓ -25%
93
  Step 1000: Loss 2.00 ↓ -40%
94
  Step 1265: Loss 1.83 ↓ -45%
95
  Step 1292: Loss 1.71 ↓ -49% ⭐ RECORD
96
  Step 1417: Loss 2.23 (current, oscillating 1.7-2.3)
97
+
98
+ πŸŽ“ What Yuuki Knows (So Far)
99
 
100
  Due to alphabetically-ordered dataset:
101
 
102
+ Language Exposure Quality Status
103
+
104
+ Agda High 85/100 βœ… Excellent
105
+ C Starting 30/100 ⏳ Learning
106
+ Assembly Low 5/100 🌱 Minimal
107
+ Python None 0/100 ❌ Not reached yet
108
+
109
+
110
+ Example Output (Step 1,300)
111
+
112
+ Agda prompt: module Main where
113
+
114
+ module Main where (x, f) in a
115
+
116
+ open import Cubical.Sigma
117
+ open import Cubical.Sigma.Core
118
+ open import Cubical.Foundations.H
119
+
120
+ βœ… Real Agda libraries! The model learned actual Cubical type theory modules.
121
+
122
+ πŸ› οΈ Training Configuration
123
+ Model: DistilGPT-2 (82M parameters)
124
+ Dataset: The Stack (75,000 examples)
125
+ Batch size: 1
126
+ Gradient accumulation: 4
127
+ Effective batch: 4
128
+ Learning rate: 5e-5
129
+ Max length: 256 tokens
130
+ Optimizer: AdamW
131
+ Epochs: 2
132
+ Total tokens: ~30M (2 epochs)
133
+
134
+ Why so slow?
135
+ 100 seconds/step Γ— 37,500 steps = 3,750,000 seconds
136
+ = 1,042 hours
137
+ = 43.4 days
138
+ = ~6 weeks of continuous training
139
+ No GPU acceleration. Pure CPU grinding. πŸ’ͺ
140
+
141
+ πŸ“ˆ Roadmap
142
+
143
+ v0.1 (Current - Proof of Concept)
144
+
145
+ [x] Setup training pipeline
146
+
147
+ [x] Start training (Step 0)
148
+
149
+ [x] Reach Step 1,000
150
+
151
+ [x] Break loss 2.0 barrier
152
+
153
+ [x] Break loss 1.8 barrier ⭐
154
+
155
+ [ ] Checkpoint 2,500 (7%)
156
+
157
+ [ ] Checkpoint 5,000 (13%)
158
+
159
+ [ ] Checkpoint 10,000 (27%)
160
+
161
+ [ ] Checkpoint 18,750 (50% - Epoch 1 complete)
162
+
163
+ [ ] Checkpoint 37,500 (100% - DONE)
164
+
165
+ [ ] Quantize to INT8
166
+
167
+ [ ] Convert to ONNX
168
+
169
+ [ ] Publish final model
170
+
171
+ ETA: Mid-March 2026
172
+
173
+
174
+ v0.2 (The Full Dataset)
175
+
176
+ Dataset: 786,387 examples (full Stack)
177
+
178
+ Duration: 418 days (~14 months)
179
+
180
+ Epochs: 2.0
181
+
182
+ Total tokens: ~314M
183
+
184
+ Dataset fix: SHUFFLED (not alphabetical)
185
+
186
+ Languages: All 80+ languages balanced
187
+
188
+ Start: March 2026
189
+
190
+ End: May 2027
191
+
192
+
193
+ v0.3+ (PC Era)
194
+
195
+ Hardware upgrade: RTX 4060/4070
196
+
197
+ Larger models: 350M-1B parameters
198
+
199
+ Faster training: ~30x speedup
200
+
201
+ Advanced techniques: LoRA, QLoRA, etc.
202
+
203
+
204
+
205
+ πŸ’‘ Philosophy
206
+ "The barrier to AI isn't money. It's mindset."
207
+ This project demonstrates: βœ… You CAN train LLMs without GPUs
208
+ βœ… Patience > Hardware
209
+ βœ… $0 budget is enough to start
210
+ βœ… Limited resources inspire creativity
211
+ βœ… Anyone can contribute to AI
212
+
213
+ πŸš€ Usage (After Training Completes)
214
+
215
+ from transformers import AutoModelForCausalLM, AutoTokenizer
216
+
217
+ # Load model
218
+ model = AutoModelForCausalLM.from_pretrained("OpceanAI/Yuuki")
219
+ tokenizer = AutoTokenizer.from_pretrained("OpceanAI/Yuuki")
220
+
221
+ # Generate code
222
+ prompt = "def fibonacci(n):"
223
+ inputs = tokenizer(prompt, return_tensors="pt")
224
+ outputs = model.generate(**inputs, max_length=100)
225
+ code = tokenizer.decode(outputs[0])
226
+ print(code)
227
+
228
+ Quantized (4x faster, 4x smaller)
229
+
230
+ Coming after training completes
231
+
232
+ model = AutoModelForCausalLM.from_pretrained(
233
+ "OpceanAI/Yuuki",
234
+ subfolder="yuuki-v0.1-int8"
235
+ )
236
+
237
+ ⚠️ Known Limitations
238
+
239
+ Dataset order: Alphabetical (not shuffled) - learns early languages best
240
+
241
+ Token count: Only ~30M tokens (vs GPT-2's 40B)
242
+
243
+ Training speed: Very slow (~100 sec/step)
244
+
245
+ Model size: Small (82M params)
246
+
247
+ Language coverage: Incomplete due to alphabetical ordering
248
+ These will be addressed in v0.2 with shuffled dataset.
249
+
250
+
251
+ πŸ”¬ Technical Details
252
+
253
+ CPU Training (100 sec/step):
254
+
255
+ Forward pass: 40 sec
256
+
257
+ Backward pass: 40 sec
258
+
259
+ Optimizer: 20 sec
260
+
261
+ Total: ~100 sec
262
+
263
+
264
+ vs GPU Training (0.5 sec/step):
265
+
266
+ 200x faster
267
+
268
+ But costs $0.50-$2.00/hour
269
+
270
+ 42 days = $500-$2,000
271
+
272
+
273
+ Mobile: FREE but SLOW
274
+
275
+ GPU: FAST but EXPENSIVE
276
+
277
+ For proof of concept: Mobile wins. πŸ†
278
+
279
+
280
+ πŸ“Š Benchmarks (Post-Training)
281
+
282
+ Coming soon after training completes (~March 2026).
283
+ Expected performance:
284
+
285
+ Agda: 85-95/100 (primary language)
286
+
287
+ C: 85-92/100 (secondary language)
288
+
289
+ Assembly: 75-85/100 (tertiary)
290
+
291
+ Python: 10-20/100 (barely seen due to alphabet order)
292
+
293
+
294
+
295
+ πŸ™ Acknowledgments
296
+
297
+ HuggingFace: Infrastructure and transformers library
298
+
299
+ BigCode: The Stack dataset
300
+
301
+ The ML community: For saying "you need GPUs" - best motivation 😏
302
+
303
+
304
+ πŸ“œ License
305
+
306
+ Apache 2.0 - See LICENSE file. You can use Yuuki commercially, modify it, distribute it. Just give credit. βœ…
307
+
308
+
309
+ πŸ”— Links
310
+
311
+ GitHub: (https://github.com/aguitauwu)
312
+
313
+ Discord: (https://discord.gg/j8zV2u8k)
314
+
315
+ Progress updates: Check this model card
316
+
317
+
318
+ πŸ“… Updates
319
+
320
+ 2026-01-29: Training started
321
+
322
+ 2026-01-29: Step 1,000 reached - Loss 2.00
323
+
324
+ 2026-01-29: Step 1,292 - NEW RECORD Loss 1.7053
325
+
326
+ 2026-01-29: Repository created on HuggingFace
327
+
328
+ Last updated: 2026-01-29
329
+
330
+
331
  Follow the journey of training an LLM with $0 budget. One step at a time. 🌸