Yuuki-3.7 / README.md

Update README.md

eb97ccd verified 4 days ago

6.59 kB

	---
	license: apache-2.0
	datasets:
	- bigcode/the-stack
	language:
	- es
	- en
	base_model:
	- openai-community/gpt2
	pipeline_tag: text-generation
	tags:
	- code
	new_version: OpceanAI/Yuuki-the-best-model
	library_name: transformers
	---
	⚠️ ## Notice on Current Model Scope

	Please note that Yuuki, in its current state, represents approximately 3.7% of the total training planned for version v0.1.

	At this stage, Yuuki should be considered an early and incomplete snapshot of the model. The full v0.1 release, which will include the remaining training stages, additional refinements, and stabilization, will be released at a later time.

	As such, performance, behavior, or capability assessments based on the current version of Yuuki do not reflect the final characteristics of the v0.1 model.

	Further updates will be provided as development progresses.

	🌸 Yuuki v0.1 - The $0 Code LLM

	> ⚠️ WORK IN PROGRESS - Currently training on mobile CPU (Day 3/42)



	🎯 The Mission

	Prove that you DON'T need expensive GPUs to train LLMs.

	Yuuki is a code generation model trained entirely on a $150 Android phone with:

	❌ No cloud compute

	❌ No GPU

	❌ No data center

	✅ Just determination and time


	The Setup

	Hardware: Snapdragon 685 (8-core ARM CPU)
	RAM: 6GB
	Storage: 128GB
	NPU: Hexagon 686 (1 TOPS)
	GPU: Adreno 610 (243 GFLOPS) - NOT USED for training
	Cost: $0 in compute

	📊 Current Status

	Metric Value

	Progress 1,417 / 37,500 steps (3.78%)
	Epoch 0.08 / 2.0
	Current Loss ~1.70 - 2.23
	Best Loss 1.7053 ⭐
	Training Time ~3 days
	ETA ~39 days remaining
	Speed ~100 sec/step


	Loss Progression

	Step 0: Loss 3.35 (baseline)
	Step 500: Loss 2.50 ↓ -25%
	Step 1000: Loss 2.00 ↓ -40%
	Step 1265: Loss 1.83 ↓ -45%
	Step 1292: Loss 1.71 ↓ -49% ⭐ RECORD
	Step 1417: Loss 2.23 (current, oscillating 1.7-2.3)

	🎓 What Yuuki Knows (So Far)

	Due to alphabetically-ordered dataset:

	Language Exposure Quality Status

	Agda High 85/100 ✅ Excellent
	C Starting 30/100 ⏳ Learning
	Assembly Low 5/100 🌱 Minimal
	Python None 0/100 ❌ Not reached yet


	Example Output (Step 1,300)

	Agda prompt: module Main where

	module Main where (x, f) in a

	open import Cubical.Sigma
	open import Cubical.Sigma.Core
	open import Cubical.Foundations.H

	✅ Real Agda libraries! The model learned actual Cubical type theory modules.

	🛠️ Training Configuration
	Model: DistilGPT-2 (82M parameters)
	Dataset: The Stack (75,000 examples)
	Batch size: 1
	Gradient accumulation: 4
	Effective batch: 4
	Learning rate: 5e-5
	Max length: 256 tokens
	Optimizer: AdamW
	Epochs: 2
	Total tokens: ~30M (2 epochs)

	Why so slow?
	100 seconds/step × 37,500 steps = 3,750,000 seconds
	= 1,042 hours
	= 43.4 days
	= ~6 weeks of continuous training
	No GPU acceleration. Pure CPU grinding. 💪

	📈 Roadmap

	v0.1 (Current - Proof of Concept)

	[x] Setup training pipeline

	[x] Start training (Step 0)

	[x] Reach Step 1,000

	[x] Break loss 2.0 barrier

	[x] Break loss 1.8 barrier ⭐

	[ ] Checkpoint 2,500 (7%)

	[ ] Checkpoint 5,000 (13%)

	[ ] Checkpoint 10,000 (27%)

	[ ] Checkpoint 18,750 (50% - Epoch 1 complete)

	[ ] Checkpoint 37,500 (100% - DONE)

	[ ] Quantize to INT8

	[ ] Convert to ONNX

	[ ] Publish final model

	ETA: Mid-March 2026


	v0.2 (The Full Dataset)

	Dataset: 786,387 examples (full Stack)

	Duration: 418 days (~14 months)

	Epochs: 2.0

	Total tokens: ~314M

	Dataset fix: SHUFFLED (not alphabetical)

	Languages: All 80+ languages balanced

	Start: March 2026

	End: May 2027


	v0.3+ (PC Era)

	Hardware upgrade: RTX 4060/4070

	Larger models: 350M-1B parameters

	Faster training: ~30x speedup

	Advanced techniques: LoRA, QLoRA, etc.



	💡 Philosophy
	"The barrier to AI isn't money. It's mindset."
	This project demonstrates: ✅ You CAN train LLMs without GPUs
	✅ Patience > Hardware
	✅ $0 budget is enough to start
	✅ Limited resources inspire creativity
	✅ Anyone can contribute to AI

	🚀 Usage (After Training Completes)

	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model
	model = AutoModelForCausalLM.from_pretrained("OpceanAI/Yuuki")
	tokenizer = AutoTokenizer.from_pretrained("OpceanAI/Yuuki")

	# Generate code
	prompt = "def fibonacci(n):"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100)
	code = tokenizer.decode(outputs[0])
	print(code)

	Quantized (4x faster, 4x smaller)

	Coming after training completes

	model = AutoModelForCausalLM.from_pretrained(
	"OpceanAI/Yuuki",
	subfolder="yuuki-v0.1-int8"
	)

	⚠️ Known Limitations

	Dataset order: Alphabetical (not shuffled) - learns early languages best

	Token count: Only ~30M tokens (vs GPT-2's 40B)

	Training speed: Very slow (~100 sec/step)

	Model size: Small (82M params)

	Language coverage: Incomplete due to alphabetical ordering
	These will be addressed in v0.2 with shuffled dataset.


	🔬 Technical Details

	CPU Training (100 sec/step):

	Forward pass: 40 sec

	Backward pass: 40 sec

	Optimizer: 20 sec

	Total: ~100 sec


	vs GPU Training (0.5 sec/step):

	200x faster

	But costs $0.50-$2.00/hour

	42 days = $500-$2,000


	Mobile: FREE but SLOW

	GPU: FAST but EXPENSIVE

	For proof of concept: Mobile wins. 🏆


	📊 Benchmarks (Post-Training)

	Coming soon after training completes (~March 2026).
	Expected performance:

	Agda: 85-95/100 (primary language)

	C: 85-92/100 (secondary language)

	Assembly: 75-85/100 (tertiary)

	Python: 10-20/100 (barely seen due to alphabet order)



	🙏 Acknowledgments

	HuggingFace: Infrastructure and transformers library

	BigCode: The Stack dataset

	The ML community: For saying "you need GPUs" - best motivation 😏


	📜 License

	Apache 2.0 - See LICENSE file. You can use Yuuki commercially, modify it, distribute it. Just give credit. ✅


	🔗 Links

	GitHub: (https://github.com/aguitauwu)

	Discord: (https://discord.gg/j8zV2u8k)

	Progress updates: Check this model card


	📅 Updates

	2026-01-29: Training started

	2026-01-29: Step 1,000 reached - Loss 2.00

	2026-01-29: Step 1,292 - NEW RECORD Loss 1.7053

	2026-01-29: Repository created on HuggingFace

	Last updated: 2026-01-29


	Follow the journey of training an LLM with $0 budget. One step at a time. 🌸