File size: 4,412 Bytes
16d3ec8 b5320a9 16d3ec8 a02de28 16d3ec8 a02de28 16d3ec8 a02de28 16d3ec8 a02de28 5ca791f a02de28 16d3ec8 a02de28 16d3ec8 5ca791f a02de28 65de90a a02de28 35b4586 a02de28 16d3ec8 2902167 16d3ec8 a02de28 4e22bee a02de28 4e22bee a02de28 4e22bee a02de28 4e22bee 16d3ec8 a02de28 16d3ec8 a02de28 16d3ec8 a02de28 16d3ec8 1b8bc17 16d3ec8 8df699a 16d3ec8 a02de28 16d3ec8 5ca791f 16d3ec8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
license: apache-2.0
base_model:
- Qwen/Qwen3-8B
library_name: transformers
---
<div align="center">
# HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot"/>
---
<a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
<img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
</a>
<a href="https://arxiv.org/abs/2509.23967" target="_blank">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2509.23967-b31b1b.svg?style=for-the-badge"/>
</a>
<br>
<a href="https://arxiv.org/abs/2509.23967"></a>
</div>
This work is a companion to our earlier report [**HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs**](https://arxiv.org/abs/2509.23967), where we first introduced the **AutoThink paradigm** for controllable reasoning. While KAT-V1 outlined the overall framework of **SFT + RL** for adaptive reasoning, this paper provides the **detailed algorithmic design** of that training recipe.
***
# Overview
We introduce **HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
HIPO has two main components:
- **Hybrid Data Pipeline** – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
- **Hybrid Reward System** – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.

# Experimental Findings
**Think-on Only (Overthinking).**
Training only on Think-on data makes the model reason on all problems, causing inefficiency.
**GRPO.**
Improves accuracy by **+3.1%**, but increases token length on simple tasks.
**Think-on/Think-off Mix.**
Yields higher accuracy (**+4.0%**) while reducing token length (**–10.8%**) and thinking rate (**–22%**).
**HiPO Advantage.**
Achieves the best results: **+6.2% accuracy**, **–30% token length**, **–39% thinking rate**, outperforming existing methods in both **efficiency** and **accuracy**.

# Data Format
**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable. Two modes are supported:

# Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Kwaipilot/HiPO-8B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("prompt:\n", prompt)
print("content:\n", content)
```
***
# Citation
```
@article{Zhan2025HiPO,
title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
year={2025},
institution={arXiv preprint arXiv:2509.23967},
number={arXiv:2509.23967},
url={https://arxiv.org/abs/2509.23967}
}
``` |