qtzx06 commited on
Commit
ac6b2ee
·
1 Parent(s): 7b15ef1

feat: highlight Qwen 3.5-9B QLoRA scaling probe alongside 0.8B distilled student

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -144,7 +144,7 @@ Invalid writes are rolled back instantly. The reward comes from actual match out
144
 
145
  ### 1. Policy Learning (Teacher → Student → RL)
146
 
147
- The agent learns *how* to engineer. Teacher distillation solves the cold-start problem, SFT teaches the workflow, and GRPO optimizes for real match reward.
148
 
149
  ### 2. Codex Swarm (Autonomous Engine Search)
150
 
 
144
 
145
  ### 1. Policy Learning (Teacher → Student → RL)
146
 
147
+ The agent learns *how* to engineer. Teacher distillation solves the cold-start problem, SFT teaches the workflow, and GRPO optimizes for real match reward. We trained both **Qwen 3.5-0.8B** (distilled student) and **Qwen 3.5-9B** (QLoRA GRPO scaling probe on H100) — the 0.8B student proved that even a tiny model can learn the full engineering workflow when properly bootstrapped, while the 9B experiments validated that the environment and reward design scale to larger models.
148
 
149
  ### 2. Codex Swarm (Autonomous Engine Search)
150