shunxing1234 commited on
Commit
16d3ec8
·
verified ·
1 Parent(s): ec72e5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -3
README.md CHANGED
@@ -1,3 +1,105 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-8B
5
+ ---
6
+ <div align="center">
7
+
8
+ # HIPO: HYBRID POLICY OPTIMIZATION FOR DYNAMIC REASONING IN LLMS
9
+
10
+ </div>
11
+
12
+ <div align="center">
13
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="60%" alt="Kwaipilot" />
14
+ </div>
15
+
16
+ <hr>
17
+
18
+ <div align="center" style="line-height: 1;">
19
+ <a href="https://huggingface.co/Kwaipilot/HIPO-8B" target="_blank">
20
+ <img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
21
+ </a>
22
+
23
+ <a href="https://arxiv.org/abs/2504.14286" target="_blank">
24
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2504.14286-b31b1b.svg?style=for-the-badge"/>
25
+ </a>
26
+
27
+ ## Overview
28
+
29
+ We introduce **HIPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs)**, a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency.
30
+
31
+ HIPO has two main components:
32
+
33
+ - **Hybrid Data Pipeline** – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
34
+ - **Hybrid Reward System** – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.
35
+
36
+ ![Kim 2025-09-26 145531](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/ZUk76mhDiVITfUsLcvv6F.png)
37
+
38
+
39
+ ## Evaluation Results
40
+
41
+ ![Kim 2025-09-26 145349](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/_qzxhMRTL_NTfaGb13LHc.png)
42
+
43
+
44
+ ## Data Format
45
+
46
+ **HiPO** produces responses in a **structured template** that makes the reasoning path explicit and machine-parsable.
47
+ Two modes are supported:
48
+
49
+ ![Kim 2025-09-26 145842](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/FXfAAN0WVpsaOn1wROInL.png)
50
+
51
+
52
+ ## Quick Start
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+
57
+ model_name = "Kwaipilot/KAT-V1-40B"
58
+
59
+ # load the tokenizer and the model
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_name,
63
+ torch_dtype="auto",
64
+ device_map="auto"
65
+ )
66
+
67
+ # prepare the model input
68
+ prompt = "Give me a short introduction to large language model."
69
+ messages = [
70
+ {"role": "user", "content": prompt}
71
+ ]
72
+ text = tokenizer.apply_chat_template(
73
+ messages,
74
+ tokenize=False,
75
+ add_generation_prompt=True
76
+ )
77
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
78
+
79
+ # conduct text completion
80
+ generated_ids = model.generate(
81
+ **model_inputs,
82
+ max_new_tokens=65536,
83
+ temperature=0.6,
84
+ top_p=0.95,
85
+ )
86
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
87
+ content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
88
+ print("prompt:\n", prompt)
89
+ print("content:\n", content)
90
+ ```
91
+
92
+ ***
93
+
94
+ ## Citation
95
+
96
+ ```
97
+ @article{Zhan2025HiPO,
98
+ title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
99
+ author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
100
+ year={2025},
101
+ institution={arXiv preprint arXiv:2507.08297},
102
+ number={arXiv:2507.08297},
103
+ url={https://arxiv.org/abs/2507.08297}
104
+ }
105
+ ```