feat: highlight Qwen 3.5-9B QLoRA scaling probe alongside 0.8B distilled student
Browse files
README.md
CHANGED
|
@@ -144,7 +144,7 @@ Invalid writes are rolled back instantly. The reward comes from actual match out
|
|
| 144 |
|
| 145 |
### 1. Policy Learning (Teacher → Student → RL)
|
| 146 |
|
| 147 |
-
The agent learns *how* to engineer. Teacher distillation solves the cold-start problem, SFT teaches the workflow, and GRPO optimizes for real match reward.
|
| 148 |
|
| 149 |
### 2. Codex Swarm (Autonomous Engine Search)
|
| 150 |
|
|
|
|
| 144 |
|
| 145 |
### 1. Policy Learning (Teacher → Student → RL)
|
| 146 |
|
| 147 |
+
The agent learns *how* to engineer. Teacher distillation solves the cold-start problem, SFT teaches the workflow, and GRPO optimizes for real match reward. We trained both **Qwen 3.5-0.8B** (distilled student) and **Qwen 3.5-9B** (QLoRA GRPO scaling probe on H100) — the 0.8B student proved that even a tiny model can learn the full engineering workflow when properly bootstrapped, while the 9B experiments validated that the environment and reward design scale to larger models.
|
| 148 |
|
| 149 |
### 2. Codex Swarm (Autonomous Engine Search)
|
| 150 |
|