Yuuki-best / README.md
OpceanAI's picture
Update README.md
9821f9f verified
---
license: apache-2.0
datasets:
- bigcode/the-stack
language:
- en
- es
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
library_name: pytorch
tags:
- code
- transformers
metrics:
- perplexity
---
# 🌸 Yuuki β€” Code Generation Model Trained on a Phone
> **A multilingual code generation model trained entirely on a smartphone by a single person.**
---
## ⚠️ Disclaimer
This is the **best Yuuki model available at this moment**. The latest release will be **Yuuki v0.1** β€” once that version is published, plans for **v0.2** will begin.
**Important notes:**
- πŸ“± This model is being trained **entirely on a smartphone** by a **single person**
- πŸ“„ A **research paper** will be published soon exploring whether it's possible to train a code generation model on a mobile device
- 🚧 This is an **early-stage research project**, not a production-ready model
---
## 🌱 Best Initial Yuuki Model (Early Snapshot)
This version of Yuuki represents the **strongest initial model** of the Yuuki project so far.
While still early in training, this snapshot already demonstrates that:
- βœ… The training pipeline is **functional**
- βœ… The dataset is being **correctly learned**
- βœ… The model is capable of generating **real, structured code-like outputs**
- βœ… Early language specialization (due to dataset order) is **clearly observable**
This is not a polished or production-ready model β€” but it is the **best starting point** Yuuki has achieved, and a **solid foundation** for future versions.
Below are real generation samples from the current checkpoint, shown **transparently without filtering**.
---
## πŸ“Š Comparative Evaluation β€” Checkpoint 1400 vs Checkpoint 2000
| Metric | Checkpoint 1400 | Checkpoint 2000 |
|--------|-----------------|-----------------|
| **Training Progress** | 1,400 / 37,500 (3.7%) | 2,000 / 37,500 (5.3%) |
| **Avg Loss** | 1.70 – 2.23 | 1.69 – 2.31 |
| **Training Speed** | ~100 sec / step | ~86 sec / step |
| **Model Size** | 988 MB | 988 MB |
| **Evaluated Languages** | Agda, C, Assembly, JS, Python | Agda, C, Assembly, JS, Python |
---
## πŸ§ͺ Language Evaluation Scores
### πŸ“ Checkpoint 1400
| Language | Score |
|----------|-------|
| Agda | 20 |
| C | 8 |
| Assembly | 2 |
| **Average** | **~10 / 100** |
### πŸ“ Checkpoint 2000
| Language | Score |
|----------|-------|
| Agda | 55 |
| C | 20 |
| Assembly | 15 |
| **Average** | **24.6 / 100** |
> πŸ“ˆ **+146% average improvement** despite minimal additional training (β‰ˆ1.6% more steps).
---
## 🧠 Code Quality Evolution
### πŸ”» Checkpoint 1400 β€” Early, Noisy, Unstable
#### [Test 1/5] Agda
**Prompt:** `module Main where`
```agda
BLAS : Bad memory unallocation! : 50 0x72ec90a000
module Main where
open import Data.PropositionalEquality using (PropositionalEquality;
open import Data.Nat
data import Data.PropositionalEquality using (Propos;
open-Equality -PropositionalEquality) using (Prop-} (Prop-Prop-ElAdvertisements)
```
❌ Severe corruption, mixed tokens, runtime noise.
---
#### [Test 2/5] C
**Prompt:** `int main() {`
```c
int main() {
prostitution of '/*' (prostitution of '*'
prost = $0)
```
❌ No syntactic coherence, random token associations.
---
#### [Test 3/5] Assembly
**Prompt:** `mov eax,`
```asm
mov eax,
the International Press S.
```
❌ No instruction continuation.
---
#### [Test 4/5] Generic
**Prompt:** `function test() {`
```javascript
function test() {
all test(1) {
all a,b : a
```
❌ Structural confusion, repeated patterns.
---
#### [Test 5/5] Python
**Prompt:** `def hello():`
```python
def hello():
open program
open import HUnit.H.P.B.P.P-E
```
❌ Python not learned, heavy cross-language contamination.
---
### πŸ”Ί Checkpoint 2000 β€” Structured, Language-Aware (Best Initial Yuuki)
#### [Test 1/5] Agda
**Prompt:** `module Main where`
```agda
module Main where
open import Function
--
open import Data.Nat
open import Function
open import Data.Nat
open import Data.Unit
open import Data.Nat
open import Data.Nat.Dec
open import Data.Properties.Nat
-- Dec
open import Data.Nat
open import Data.Properties
open import Data.Base
open import Data.Nat.Properties
open import Data.Unary
open import data
```
βœ… Shows strong familiarity with real Agda modules and imports.
---
#### [Test 2/5] C
**Prompt:** `int main() {`
```c
int main() {
no-print "no"
no-print "no"
print "no-print "
print "no"
print "no"
print "no"
fault-print "no"
}
run run
```
⚠️ Syntax is unstable, but structure and intent are present.
---
#### [Test 3/5] Assembly
**Prompt:** `mov eax,`
```asm
mov eax, "v2.2"
ov eax, "v1"
ov eax, "v2.2"
ov eax, "v2._v1"
ov eax, "v3"
```
🌱 Demonstrates token-level familiarity with assembly-like patterns.
---
#### [Test 4/5] Generic
**Prompt:** `function test() {`
```javascript
function test() {
for (i=0;i=0; i++, 1) {
(a=0; i++) {
.-1 . A
```
⚠️ Logic is broken, but syntactic scaffolding is emerging.
---
#### [Test 5/5] Python
**Prompt:** `def hello():`
```python
def hello():
"Hello!
tell-tell-it -o:Hello!
if not exist (
echo "Hello!
```
❌ Python not yet learned (expected due to alphabetical dataset order).
---
## 🧠 Interpretation
These outputs confirm that Yuuki:
- πŸ“š **Learns real libraries** and language-specific tokens
- πŸ—οΈ **Shows early structure** before correctness
- πŸ“Š **Reflects dataset ordering effects** honestly
- πŸ“ˆ **Improves gradually**, not magically
This behavior is **expected and healthy** at ~5% total training.
---
## 🧠 Key Takeaway
Between **3.7% β†’ 5.3%** training progress, Yuuki shows:
- βœ… Major qualitative gains
- βœ… Clear specialization trends
- βœ… Rapid early learning despite CPU-only constraints
This validates the project's core claim:
> **Progress is real, measurable, and reproducible β€” even at $0 cost.**
---
## πŸ“œ License
This project is licensed under the **Apache 2.0 License**.
```
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://huggingface.co/OpceanAI/Yuuki-the-best-model/blob/main/LICENSE
```
---
## πŸ”— Links
- πŸ€— [Hugging Face Model](https://huggingface.co/OpceanAI/Yuuki-the-best-model)
- πŸ“„ Research Paper (Coming Soon)
- [Training code](https://github.com/YuuKi-OS/yuuki-training)
---
<p align="center">
<i>Built with patience, a phone, and zero budget.</i><br>
<b>🌸 Yuuki Project</b>
</p>