Yuuki-best / README.md

Update README.md

9821f9f verified 4 days ago

6.72 kB

	---
	license: apache-2.0
	datasets:
	- bigcode/the-stack
	language:
	- en
	- es
	base_model:
	- openai-community/gpt2
	pipeline_tag: text-generation
	library_name: pytorch
	tags:
	- code
	- transformers
	metrics:
	- perplexity
	---

	# 🌸 Yuuki — Code Generation Model Trained on a Phone

	> A multilingual code generation model trained entirely on a smartphone by a single person.

	---

	## ⚠️ Disclaimer

	This is the best Yuuki model available at this moment. The latest release will be Yuuki v0.1 — once that version is published, plans for v0.2 will begin.

	Important notes:
	- 📱 This model is being trained entirely on a smartphone by a single person
	- 📄 A research paper will be published soon exploring whether it's possible to train a code generation model on a mobile device
	- 🚧 This is an early-stage research project, not a production-ready model

	---

	## 🌱 Best Initial Yuuki Model (Early Snapshot)

	This version of Yuuki represents the strongest initial model of the Yuuki project so far.

	While still early in training, this snapshot already demonstrates that:

	- ✅ The training pipeline is functional
	- ✅ The dataset is being correctly learned
	- ✅ The model is capable of generating real, structured code-like outputs
	- ✅ Early language specialization (due to dataset order) is clearly observable

	This is not a polished or production-ready model — but it is the best starting point Yuuki has achieved, and a solid foundation for future versions.

	Below are real generation samples from the current checkpoint, shown transparently without filtering.

	---

	## 📊 Comparative Evaluation — Checkpoint 1400 vs Checkpoint 2000

	\| Metric \| Checkpoint 1400 \| Checkpoint 2000 \|
	\|--------\|-----------------\|-----------------\|
	\| Training Progress \| 1,400 / 37,500 (3.7%) \| 2,000 / 37,500 (5.3%) \|
	\| Avg Loss \| 1.70 – 2.23 \| 1.69 – 2.31 \|
	\| Training Speed \| ~100 sec / step \| ~86 sec / step \|
	\| Model Size \| 988 MB \| 988 MB \|
	\| Evaluated Languages \| Agda, C, Assembly, JS, Python \| Agda, C, Assembly, JS, Python \|

	---

	## 🧪 Language Evaluation Scores

	### 📍 Checkpoint 1400

	\| Language \| Score \|
	\|----------\|-------\|
	\| Agda \| 20 \|
	\| C \| 8 \|
	\| Assembly \| 2 \|
	\| Average \| ~10 / 100 \|

	### 📍 Checkpoint 2000

	\| Language \| Score \|
	\|----------\|-------\|
	\| Agda \| 55 \|
	\| C \| 20 \|
	\| Assembly \| 15 \|
	\| Average \| 24.6 / 100 \|

	> 📈 +146% average improvement despite minimal additional training (≈1.6% more steps).

	---

	## 🧠 Code Quality Evolution

	### 🔻 Checkpoint 1400 — Early, Noisy, Unstable

	#### [Test 1/5] Agda
	Prompt: `module Main where`

	```agda
	BLAS : Bad memory unallocation! : 50 0x72ec90a000
	module Main where

	open import Data.PropositionalEquality using (PropositionalEquality;
	open import Data.Nat

	data import Data.PropositionalEquality using (Propos;
	open-Equality -PropositionalEquality) using (Prop-} (Prop-Prop-ElAdvertisements)
	```
	❌ Severe corruption, mixed tokens, runtime noise.

	---

	#### [Test 2/5] C
	Prompt: `int main() {`

	```c
	int main() {
	prostitution of '/' (prostitution of ''
	prost = $0)
	```
	❌ No syntactic coherence, random token associations.

	---

	#### [Test 3/5] Assembly
	Prompt: `mov eax,`

	```asm
	mov eax,

	the International Press S.
	```
	❌ No instruction continuation.

	---

	#### [Test 4/5] Generic
	Prompt: `function test() {`

	```javascript
	function test() {
	all test(1) {
	all a,b : a
	```
	❌ Structural confusion, repeated patterns.

	---

	#### [Test 5/5] Python
	Prompt: `def hello():`

	```python
	def hello():

	open program
	open import HUnit.H.P.B.P.P-E
	```
	❌ Python not learned, heavy cross-language contamination.

	---

	### 🔺 Checkpoint 2000 — Structured, Language-Aware (Best Initial Yuuki)

	#### [Test 1/5] Agda
	Prompt: `module Main where`

	```agda
	module Main where

	open import Function
	--
	open import Data.Nat
	open import Function

	open import Data.Nat
	open import Data.Unit
	open import Data.Nat
	open import Data.Nat.Dec
	open import Data.Properties.Nat
	-- Dec
	open import Data.Nat
	open import Data.Properties
	open import Data.Base
	open import Data.Nat.Properties
	open import Data.Unary

	open import data
	```
	✅ Shows strong familiarity with real Agda modules and imports.

	---

	#### [Test 2/5] C
	Prompt: `int main() {`

	```c
	int main() {
	no-print "no"
	no-print "no"
	print "no-print "
	print "no"
	print "no"
	print "no"
	fault-print "no"
	}
	run run
	```
	⚠️ Syntax is unstable, but structure and intent are present.

	---

	#### [Test 3/5] Assembly
	Prompt: `mov eax,`

	```asm
	mov eax, "v2.2"
	ov eax, "v1"
	ov eax, "v2.2"
	ov eax, "v2._v1"
	ov eax, "v3"
	```
	🌱 Demonstrates token-level familiarity with assembly-like patterns.

	---

	#### [Test 4/5] Generic
	Prompt: `function test() {`

	```javascript
	function test() {
	for (i=0;i=0; i++, 1) {
	(a=0; i++) {
	.-1 . A
	```
	⚠️ Logic is broken, but syntactic scaffolding is emerging.

	---

	#### [Test 5/5] Python
	Prompt: `def hello():`

	```python
	def hello():
	"Hello!
	tell-tell-it -o:Hello!
	if not exist (
	echo "Hello!
	```
	❌ Python not yet learned (expected due to alphabetical dataset order).

	---

	## 🧠 Interpretation

	These outputs confirm that Yuuki:

	- 📚 Learns real libraries and language-specific tokens
	- 🏗️ Shows early structure before correctness
	- 📊 Reflects dataset ordering effects honestly
	- 📈 Improves gradually, not magically

	This behavior is expected and healthy at ~5% total training.

	---

	## 🧠 Key Takeaway

	Between 3.7% → 5.3% training progress, Yuuki shows:

	- ✅ Major qualitative gains
	- ✅ Clear specialization trends
	- ✅ Rapid early learning despite CPU-only constraints

	This validates the project's core claim:

	> Progress is real, measurable, and reproducible — even at $0 cost.

	---

	## 📜 License

	This project is licensed under the Apache 2.0 License.

	```
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	https://huggingface.co/OpceanAI/Yuuki-the-best-model/blob/main/LICENSE
	```

	---

	## 🔗 Links

	- 🤗 [Hugging Face Model](https://huggingface.co/OpceanAI/Yuuki-the-best-model)
	- 📄 Research Paper (Coming Soon)
	- [Training code](https://github.com/YuuKi-OS/yuuki-training)

	---

	<p align="center">
	<i>Built with patience, a phone, and zero budget.</i><br>
	<b>🌸 Yuuki Project</b>
	</p>