Kwaipilot
/

HiPO-8B

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-8B
+---
+<div align="center">
+# HIPO: HYBRID POLICY OPTIMIZATION FOR DYNAMIC REASONING IN LLMS
+</div>
+<div align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot" />
+</div>
+<hr>
+<div align="center" style="line-height: 1;">
+  <a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
+  </a>
+  <a href="https://arxiv.org/abs/2504.14286" target="_blank">
+    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
+  </a>
+## Overview
+We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efﬁciency.
+HIPO has two main components:
+- **Hybrid Data Pipeline** – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
+- **Hybrid Reward System** – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.
+![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
+## Evaluation Results
+![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
+## Data Format
+**HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable.
+Two modes are supported:
+![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
+## Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "Kwaipilot/KAT-V1-40B"
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+# prepare the model input
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=65536,
+    temperature=0.6,
+    top_p=0.95,
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
+print("prompt:\n", prompt)
+print("content:\n", content)
+```
+***
+## Citation
+```
+@article{Zhan2025HiPO,
+  title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
+  author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
+  year={2025},
+  institution={arXiv preprint arXiv:2507.08297},
+  number={arXiv:2507.08297},
+  url={https://arxiv.org/abs/2507.08297}
+}
+```