Kwaipilot
/

HiPO-8B

@@ -3,28 +3,31 @@ license: apache-2.0
 base_model:
 - Qwen/Qwen3-8B
 ---
 <div align="center">
-# HIPO: HYBRID POLICY OPTIMIZATION FOR DYNAMIC REASONING IN LLMS
-</div>
-<div align="center">
-  <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot" />
-</div>
-<hr>
-<div align="center" style="line-height: 1;">
-  <a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
-    <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
-  </a>
-  <a href="https://arxiv.org/abs/2504.14286" target="_blank">
-    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
-  </a>
-## Overview
 We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efﬁciency.
@@ -36,20 +39,36 @@ HIPO has two main components:
 ![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
-## Evaluation Results
 ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
-## Data Format
-**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable.
-Two modes are supported:
 ![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
-## Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -91,7 +110,7 @@ print("content:\n", content)
 ***
-## Citation
 ```
 @article{Zhan2025HiPO,

 base_model:
 - Qwen/Qwen3-8B
 ---
 <div align="center">
+# HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
+<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot"/>
+---
+<a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
+  <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
+</a>
+<a href="https://arxiv.org/abs/2504.14286" target="_blank">
+  <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
+</a>
+<br>
+<a href="https://arxiv.org/abs/2507.08297"></a>
+</div>
+This work is a companion to our earlier report [**KAT-V1: Kwai-AutoThink Technical Report**](https://arxiv.org/abs/2507.08297), where we first introduced the **AutoThink paradigm** for controllable reasoning. While KAT-V1 outlined the overall framework of **SFT + RL** for adaptive reasoning, this paper provides the **detailed algorithmic design** of that training recipe.
+# Overview
 We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efﬁciency.
 ![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
+# Experimental Findings
+**Think-on Only Training (Overthinking).**
+Training the model solely on Think-on data causes it to reason on all problems, regardless of difficulty — a typical case of *overthinking*.
+**GRPO on Cold-Start(on).**
+Applying GRPO improves accuracy by **+3.1%**, but fails to reduce token length or thinking rate. Instead, token length on simpler datasets even increases to achieve higher accuracy.
+**Think-on/Think-off Mix.**
+Training on a mixed dataset boosts accuracy by **+4.0%** compared to Cold-Start(on), while significantly reducing token length (**–10.8%**) and thinking rate (**–22%**). Adding GRPO here brings little additional gain.
+**HiPO Advantage.**
+With HiPO, the Cold-Start model achieves the best performance:
+- **Accuracy: +6.2%**
+- **Token length: –30%**
+- **Thinking rate: –39%**
+Overall, HiPO outperforms existing methods in both **efficiency** and **accuracy**.
 ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
+# Data Format
+**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported:
 ![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
+# Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 ***
+# Citation
 ```
 @article{Zhan2025HiPO,