Kwaipilot
/

HiPO-8B

@@ -41,22 +41,17 @@ HIPO has two main components:
 # Experimental Findings
-**Think-on Only Training (Overthinking).**
-Training the model solely on Think-on data causes it to reason on all problems, regardless of difficulty — a typical case of *overthinking*.
-**GRPO on Cold-Start(on).**
-Applying GRPO improves accuracy by **+3.1%**, but fails to reduce token length or thinking rate. Instead, token length on simpler datasets even increases to achieve higher accuracy.
 **Think-on/Think-off Mix.**
-Training on a mixed dataset boosts accuracy by **+4.0%** compared to Cold-Start(on), while significantly reducing token length (**–10.8%**) and thinking rate (**–22%**). Adding GRPO here brings little additional gain.
 **HiPO Advantage.**
-With HiPO, the Cold-Start model achieves the best performance:
-- **Accuracy: +6.2%**
-- **Token length: –30%**
-- **Thinking rate: –39%**
-Overall, HiPO outperforms existing methods in both **efficiency** and **accuracy**.
 ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)

 # Experimental Findings
+**Think-on Only (Overthinking).**
+Training only on Think-on data makes the model reason on all problems, causing inefficiency.
+**GRPO.**
+Improves accuracy by **+3.1%**, but increases token length on simple tasks.
 **Think-on/Think-off Mix.**
+Yields higher accuracy (**+4.0%**) while reducing token length (**–10.8%**) and thinking rate (**–22%**).
 **HiPO Advantage.**
+Achieves the best results: **+6.2% accuracy**, **–30% token length**, **–39% thinking rate**, outperforming existing methods in both **efficiency** and **accuracy**.
 ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)