arieldeng commited on
Commit
a02de28
·
verified ·
1 Parent(s): 365dea7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -22
README.md CHANGED
@@ -3,28 +3,31 @@ license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-8B
5
  ---
 
6
  <div align="center">
7
-
8
- # HIPO: HYBRID POLICY OPTIMIZATION FOR DYNAMIC REASONING IN LLMS
9
 
10
- </div>
11
 
12
- <div align="center">
13
- <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot" />
14
- </div>
15
 
16
- <hr>
 
 
 
 
 
 
 
17
 
18
- <div align="center" style="line-height: 1;">
19
- <a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
20
- <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
21
- </a>
22
-
23
- <a href="https://arxiv.org/abs/2504.14286" target="_blank">
24
- <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
25
- </a>
26
 
27
- ## Overview
 
 
 
 
 
 
28
 
29
  We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
30
 
@@ -36,20 +39,36 @@ HIPO has two main components:
36
  ![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
37
 
38
 
39
- ## Evaluation Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
42
 
43
 
44
- ## Data Format
45
 
46
- **HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable.
47
- Two modes are supported:
48
 
49
  ![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
50
 
51
 
52
- ## Quick Start
53
 
54
  ```python
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -91,7 +110,7 @@ print("content:\n", content)
91
 
92
  ***
93
 
94
- ## Citation
95
 
96
  ```
97
  @article{Zhan2025HiPO,
 
3
  base_model:
4
  - Qwen/Qwen3-8B
5
  ---
6
+
7
  <div align="center">
 
 
8
 
9
+ # HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
10
 
11
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot"/>
 
 
12
 
13
+ ---
14
+
15
+ <a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
16
+ <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
17
+ </a>
18
+ <a href="https://arxiv.org/abs/2504.14286" target="_blank">
19
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
20
+ </a>
21
 
22
+ <br>
 
 
 
 
 
 
 
23
 
24
+ <a href="https://arxiv.org/abs/2507.08297"></a>
25
+
26
+ </div>
27
+
28
+ This work is a companion to our earlier report [**KAT-V1: Kwai-AutoThink Technical Report**](https://arxiv.org/abs/2507.08297), where we first introduced the **AutoThink paradigm** for controllable reasoning. While KAT-V1 outlined the overall framework of **SFT + RL** for adaptive reasoning, this paper provides the **detailed algorithmic design** of that training recipe.
29
+
30
+ # Overview
31
 
32
  We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
33
 
 
39
  ![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
40
 
41
 
42
+ # Experimental Findings
43
+
44
+ **Think-on Only Training (Overthinking).**
45
+ Training the model solely on Think-on data causes it to reason on all problems, regardless of difficulty — a typical case of *overthinking*.
46
+
47
+ **GRPO on Cold-Start(on).**
48
+ Applying GRPO improves accuracy by **+3.1%**, but fails to reduce token length or thinking rate. Instead, token length on simpler datasets even increases to achieve higher accuracy.
49
+
50
+ **Think-on/Think-off Mix.**
51
+ Training on a mixed dataset boosts accuracy by **+4.0%** compared to Cold-Start(on), while significantly reducing token length (**–10.8%**) and thinking rate (**–22%**). Adding GRPO here brings little additional gain.
52
+
53
+ **HiPO Advantage.**
54
+ With HiPO, the Cold-Start model achieves the best performance:
55
+ - **Accuracy: +6.2%**
56
+ - **Token length: –30%**
57
+ - **Thinking rate: –39%**
58
+
59
+ Overall, HiPO outperforms existing methods in both **efficiency** and **accuracy**.
60
 
61
  ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
62
 
63
 
64
+ # Data Format
65
 
66
+ **HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported:
 
67
 
68
  ![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
69
 
70
 
71
+ # Quick Start
72
 
73
  ```python
74
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
110
 
111
  ***
112
 
113
+ # Citation
114
 
115
  ```
116
  @article{Zhan2025HiPO,