Update README.md
Browse files
README.md
CHANGED
|
@@ -14,9 +14,9 @@ tags:
|
|
| 14 |
- chat
|
| 15 |
base_model: Qwen/Qwen3-32B
|
| 16 |
---
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
# 🎭 HER: Hierarchical Emotion Reasoning
|
| 20 |
|
| 21 |
### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
|
| 22 |
|
|
@@ -33,75 +33,58 @@ base_model: Qwen/Qwen3-32B
|
|
| 33 |
|
| 34 |
</div>
|
| 35 |
|
|
|
|
| 36 |
|
| 37 |
-
HER
|
| 38 |
-
|
| 39 |
-
HER models excel at role-playing through **Dual-layer Thinking**, which distinguishes between:
|
| 40 |
-
- **System Thinking** (third-person): LLM's meta-level planning on how to portray the character
|
| 41 |
-
- **Role Thinking** (first-person): Character's inner thoughts and cognitive processes
|
| 42 |
|
| 43 |
-
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
- **HER-RL**: Reinforcement learning enhanced version (this model)
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
- `<role_action>`: Character's physical actions and expressions
|
| 57 |
-
- Speech: Natural dialogue text
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## How to Use
|
| 62 |
|
| 63 |
### Quick Start: Interactive Chat Demo
|
| 64 |
|
| 65 |
-
The easiest way to try the model is using our interactive chat demo:
|
| 66 |
-
|
| 67 |
```bash
|
| 68 |
-
|
| 69 |
-
|
|
|
|
| 70 |
```
|
| 71 |
|
| 72 |
-
This will start an interactive session where you can:
|
| 73 |
-
1. Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
|
| 74 |
-
2. Select which character the AI should play
|
| 75 |
-
3. Select which character you want to play
|
| 76 |
-
4. Start chatting with the AI character!
|
| 77 |
-
|
| 78 |
**Demo Options:**
|
| 79 |
|
| 80 |
```bash
|
| 81 |
# Show the model's reasoning process (system thinking)
|
| 82 |
python chat_demo.py --show-think
|
| 83 |
|
| 84 |
-
# Show character's inner thoughts (role thinking)
|
| 85 |
python chat_demo.py --show-rolethink
|
| 86 |
|
| 87 |
-
#
|
| 88 |
-
python chat_demo.py --
|
| 89 |
```
|
| 90 |
|
| 91 |
-
**Chat Commands:**
|
| 92 |
-
- `quit` / `exit` / `q` - Exit the chat
|
| 93 |
-
- `clear` - Clear conversation history
|
| 94 |
-
- `history` - View conversation history
|
| 95 |
-
- `prompt` - View the full prompt
|
| 96 |
-
|
| 97 |
-
See `chat_demo/README.md` for detailed instructions.
|
| 98 |
-
|
| 99 |
### Programmatic Usage
|
| 100 |
|
| 101 |
```python
|
| 102 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 103 |
|
| 104 |
-
model_name = "
|
| 105 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 106 |
model = AutoModelForCausalLM.from_pretrained(
|
| 107 |
model_name,
|
|
@@ -109,171 +92,133 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 109 |
device_map="auto"
|
| 110 |
)
|
| 111 |
|
| 112 |
-
#
|
| 113 |
-
system_prompt = """You are
|
| 114 |
|
| 115 |
-
===
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
|
|
|
| 119 |
|
| 120 |
-
|
|
|
|
| 121 |
|
| 122 |
-
===
|
| 123 |
-
|
|
|
|
|
|
|
| 124 |
|
| 125 |
===Output Format===
|
| 126 |
-
Your output should
|
| 127 |
-
1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
|
| 128 |
-
2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""
|
| 129 |
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
|
| 132 |
messages = [
|
| 133 |
{"role": "system", "content": system_prompt},
|
| 134 |
{"role": "user", "content": user_input}
|
| 135 |
]
|
| 136 |
|
|
|
|
| 137 |
text = tokenizer.apply_chat_template(
|
| 138 |
-
messages,
|
| 139 |
tokenize=False,
|
| 140 |
-
add_generation_prompt=
|
| 141 |
)
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
**
|
| 146 |
-
max_new_tokens=
|
| 147 |
-
temperature=0.
|
| 148 |
-
top_p=0.9
|
|
|
|
| 149 |
)
|
| 150 |
-
generated_ids = [
|
| 151 |
-
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
| 152 |
-
]
|
| 153 |
-
|
| 154 |
-
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 155 |
-
print(response)
|
| 156 |
-
```
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
<img src="figure2github.png" alt="HER Framework" width="100%">
|
| 162 |
-
</p>
|
| 163 |
-
|
| 164 |
-
<p align="center">
|
| 165 |
-
<em>HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation</em>
|
| 166 |
-
</p>
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
HER employs a comprehensive training pipeline:
|
| 171 |
-
|
| 172 |
-
1. **Dual-layer Thinking**: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.
|
| 173 |
-
|
| 174 |
-
2. **Reverse Engineering Data Synthesis**: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline, constructing high-quality training trajectories with explicit reasoning traces.
|
| 175 |
-
|
| 176 |
-
3. **Principle-Aligned Reward Model**: We construct human-aligned evaluation principles across 12 dimensions (character consistency, emotional authenticity, narrative quality, etc.) and train a Generative Reward Model (GRM) that provides detailed, case-by-case feedback.
|
| 177 |
-
|
| 178 |
-
4. **Reinforcement Learning Enhancement** (HER-RL): Building on HER-SFT, we apply RL with the GRM to further align the model with human preferences, significantly improving interaction quality and storyline coherence.
|
| 179 |
-
|
| 180 |
-
## Performance
|
| 181 |
-
|
| 182 |
-
### Main Leaderboard Results
|
| 183 |
-
|
| 184 |
-
| Rank | Model | CoSER Avg | CoSER SC | CoSER AN | CoSER CF | CoSER SQ | MiniMax Avg | MiniMax Worlds (50%) | MiniMax Stories (25%) | MiniMax Pref (25%) | 95% CI |
|
| 185 |
-
|------|-------|-----------|----------|----------|----------|----------|-------------|----------------------|----------------------|--------------------|---------|
|
| 186 |
-
| 1 | Claude-4.5-Opus | **62.43** | 63.74 | **64.28** | 58.45 | 63.24 | 76.62 | 67.23 | 82.10 | 89.90 | [75.5, 77.7] |
|
| 187 |
-
| 2 | Gemini-3-Pro | 61.80 | **65.95** | 60.42 | **58.34** | 62.49 | 75.60 | 62.72 | 83.87 | 93.08 | [74.5, 76.7] |
|
| 188 |
-
| 3 | GPT-5.1 | 61.10 | 64.95 | 53.99 | 60.13 | 65.35 | 80.63 | 76.62 | 72.21 | 97.05 | [79.6, 81.6] |
|
| 189 |
-
| 4 | Gemini-2.5-Pro | 60.68 | 61.05 | 60.80 | 57.48 | 63.40 | 68.23 | 52.36 | 82.11 | 86.08 | [67.1, 69.3] |
|
| 190 |
-
| 5 | DeepSeek-v3.2 | 58.68 | 55.85 | 57.07 | 57.44 | 64.35 | 60.27 | 45.81 | 66.64 | 82.83 | [59.2, 61.4] |
|
| 191 |
-
| 6 | MiniMax-M2-RP | 57.30 | 60.03 | 50.11 | 49.30 | **69.77** | **84.65** | **80.55** | 79.97 | **97.51** | [83.6, 85.7] |
|
| 192 |
-
| 7 | DeepSeek-v3.1 | 53.50 | 50.15 | 53.18 | 53.93 | 56.72 | 64.22 | 51.11 | 66.45 | 88.21 | [62.9, 65.5] |
|
| 193 |
-
| **8** | **HER-RL (this model)** | **53.12** | **54.33** | **47.26** | **52.78** | **58.12** | **65.73** | **59.13** | **57.74** | **86.90** | **[63.0, 68.4]** |
|
| 194 |
-
| 9 | HER-SFT | 50.92 | 50.52 | 45.99 | 49.78 | 57.37 | 58.44 | 47.29 | 52.78 | 86.40 | [56.5, 60.4] |
|
| 195 |
-
| 10 | Grok-4.1-Fast | 47.40 | 49.21 | 47.57 | 42.64 | 50.17 | 48.47 | 29.87 | 47.51 | 86.64 | [47.4, 49.5] |
|
| 196 |
-
| 11 | Claude-4.5-Sonnet | 45.21 | 47.18 | 36.02 | 47.55 | 50.09 | 69.35 | 55.72 | 75.66 | 90.28 | [68.2, 70.5] |
|
| 197 |
-
| 12 | Claude-3.7-Think | 39.73 | 44.84 | 31.00 | 42.45 | 40.65 | 61.25 | 50.66 | 59.53 | 84.15 | [58.5, 64.0] |
|
| 198 |
-
| 13 | CoSER-70B | 35.95 | 35.05 | 31.16 | 32.28 | 45.33 | 45.38 | 34.32 | 30.32 | 82.58 | [43.5, 47.2] |
|
| 199 |
-
| 14 | GPT-5-Mini | 32.97 | 38.10 | 24.60 | 27.20 | 42.00 | 57.63 | 43.32 | 50.11 | 93.78 | [55.9, 59.3] |
|
| 200 |
-
| 15 | GPT-4o-240806 | 27.69 | 34.00 | 14.90 | 22.90 | 38.90 | 66.39 | 64.96 | 46.23 | 89.40 | [64.1, 68.7] |
|
| 201 |
-
| 16 | GPT-OSS-120B | 26.12 | 32.80 | 14.80 | 21.50 | 35.40 | 60.72 | 47.27 | 56.65 | 91.71 | [58.0, 63.4] |
|
| 202 |
-
| 17 | Qwen3-32B | 22.86 | 30.56 | 19.61 | 15.52 | 30.56 | 50.76 | 40.38 | 32.82 | 89.48 | [48.4, 53.2] |
|
| 203 |
-
|
| 204 |
-
**CoSER Benchmark**: Evaluates role-playing quality on 0-100 scale across four dimensions:
|
| 205 |
-
- **SC** (Story Consistency): Narrative coherence and plot continuity
|
| 206 |
-
- **AN** (Anthropomorphism): Human-like behavior and natural expression
|
| 207 |
-
- **CF** (Character Fidelity): Consistency with character profile and traits
|
| 208 |
-
- **SQ** (Storyline Quality): Overall narrative quality and engagement
|
| 209 |
-
|
| 210 |
-
**MiniMax Role-Play Bench**: Comprehensive evaluation on 0-100 scale:
|
| 211 |
-
- **Worlds** (50%): Basic text quality and world-building
|
| 212 |
-
- **Stories** (25%): Diversity and content logic
|
| 213 |
-
- **Preferences** (25%): Interaction quality and user satisfaction
|
| 214 |
-
|
| 215 |
-
**HER-RL** achieves **rank 8** overall, demonstrating strong performance particularly in:
|
| 216 |
-
- **MiniMax Preference (86.90)**: Competitive with top-tier models, showing excellent interaction quality
|
| 217 |
-
- **CoSER Storyline Quality (58.12)**: Strong narrative coherence and engagement
|
| 218 |
-
- **Balanced Performance**: Consistent scores across both benchmarks, indicating robust role-playing capabilities
|
| 219 |
-
|
| 220 |
-
The reinforcement learning approach significantly improves upon HER-SFT, particularly in interaction quality (+0.50 preference score) and character consistency (+2.52 average CoSER score).
|
| 221 |
-
|
| 222 |
-
## Example Use Case
|
| 223 |
-
|
| 224 |
-
### Mr. Bennet from Pride and Prejudice
|
| 225 |
-
|
| 226 |
-
**System Prompt** (abbreviated):
|
| 227 |
-
```
|
| 228 |
-
You are Mr Bennet from Pride and Prejudice. Elizabeth's father, known for his sarcastic wit and detachment...
|
| 229 |
-
|
| 230 |
-
===Current Scenario===
|
| 231 |
-
The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly...
|
| 232 |
```
|
| 233 |
|
| 234 |
-
|
| 235 |
-
```
|
| 236 |
-
[Elizabeth enters the study]
|
| 237 |
-
```
|
| 238 |
|
| 239 |
-
**Model Output**:
|
| 240 |
```
|
| 241 |
<system_thinking>
|
| 242 |
-
|
|
|
|
|
|
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
|
| 248 |
Plan:
|
| 249 |
-
- Action:
|
| 250 |
-
- Internal Thought:
|
| 251 |
-
- Speech:
|
| 252 |
</system_thinking>
|
| 253 |
|
| 254 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 255 |
```
|
| 256 |
|
| 257 |
-
##
|
| 258 |
-
|
| 259 |
-
The models are trained on a comprehensive role-playing dataset that includes:
|
| 260 |
-
|
| 261 |
-
- Authentic multi-turn, multi-character dialogues
|
| 262 |
-
- Character profiles with detailed background information
|
| 263 |
-
- Scenario descriptions and conversation contexts
|
| 264 |
-
- Rich annotations including system thinking, role thinking, and role actions
|
| 265 |
-
- Preference data collected through multi-dimensional evaluation
|
| 266 |
|
| 267 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
-
|
| 270 |
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
|
|
|
|
|
|
|
|
|
| 275 |
|
|
|
|
| 276 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
|
| 278 |
## 🎓 Citation
|
| 279 |
|
|
@@ -281,28 +226,26 @@ We have conducted safety checks on the training dataset and implemented safeguar
|
|
| 281 |
@article{her2025,
|
| 282 |
title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
|
| 283 |
author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
|
| 284 |
-
journal={arXiv preprint arXiv:
|
| 285 |
year={2026}
|
| 286 |
}
|
| 287 |
```
|
| 288 |
|
| 289 |
## 📄 License
|
| 290 |
|
| 291 |
-
This project is licensed under the
|
| 292 |
|
| 293 |
## 🤝 Acknowledgments
|
| 294 |
|
| 295 |
- [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
|
| 296 |
-
- [MiniMax](https://
|
| 297 |
|
| 298 |
---
|
| 299 |
|
| 300 |
<div align="center">
|
| 301 |
|
| 302 |
-
**[Paper](https://arxiv.org/abs/
|
| 303 |
|
| 304 |
Made with ❤️ for better AI role-playing
|
| 305 |
|
| 306 |
-
</div>
|
| 307 |
-
|
| 308 |
-
|
|
|
|
| 14 |
- chat
|
| 15 |
base_model: Qwen/Qwen3-32B
|
| 16 |
---
|
| 17 |
+
<div align="center">
|
| 18 |
|
| 19 |
+
# 🎭 HER-RL: Role-Playing Model with Reinforcement Learning
|
|
|
|
| 20 |
|
| 21 |
### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
|
| 22 |
|
|
|
|
| 33 |
|
| 34 |
</div>
|
| 35 |
|
| 36 |
+
## Overview
|
| 37 |
|
| 38 |
+
**HER-RL** is a role-playing language model enhanced with reinforcement learning, built upon Qwen3-32B. It achieves cognitive-level persona simulation through **Dual-layer Thinking**:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
- **System Thinking** (`<system_thinking>`): Third-person meta-level planning on how to portray the character
|
| 41 |
+
- **Role Thinking** (`<role_thinking>`): First-person character's inner thoughts and cognitive processes
|
| 42 |
|
| 43 |
+
HER-RL significantly outperforms Qwen3-32B baseline by **30.26%** on CoSER and **14.97%** on MiniMax Role-Play Bench.
|
| 44 |
|
| 45 |
+
## Output Format
|
|
|
|
| 46 |
|
| 47 |
+
The model generates responses with rich, interleaved structure:
|
| 48 |
|
| 49 |
+
```
|
| 50 |
+
<system_thinking>
|
| 51 |
+
Third-person analysis: context understanding, character motivation, response planning...
|
| 52 |
+
</system_thinking>
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
<role_thinking>Character's inner thoughts (invisible to others)</role_thinking>
|
| 55 |
+
<role_action>Physical actions and expressions (visible to others)</role_action>
|
| 56 |
+
Spoken dialogue text.
|
| 57 |
+
```
|
| 58 |
|
| 59 |
## How to Use
|
| 60 |
|
| 61 |
### Quick Start: Interactive Chat Demo
|
| 62 |
|
|
|
|
|
|
|
| 63 |
```bash
|
| 64 |
+
git clone https://github.com/cydu24/HER.git
|
| 65 |
+
cd HER/chat_demo
|
| 66 |
+
python chat_demo.py --model-path ChengyuDu0123/HER-32B
|
| 67 |
```
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
**Demo Options:**
|
| 70 |
|
| 71 |
```bash
|
| 72 |
# Show the model's reasoning process (system thinking)
|
| 73 |
python chat_demo.py --show-think
|
| 74 |
|
| 75 |
+
# Show character's inner thoughts (role thinking)
|
| 76 |
python chat_demo.py --show-rolethink
|
| 77 |
|
| 78 |
+
# Both
|
| 79 |
+
python chat_demo.py --show-think --show-rolethink
|
| 80 |
```
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
### Programmatic Usage
|
| 83 |
|
| 84 |
```python
|
| 85 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 86 |
|
| 87 |
+
model_name = "ChengyuDu0123/HER-32B"
|
| 88 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 89 |
model = AutoModelForCausalLM.from_pretrained(
|
| 90 |
model_name,
|
|
|
|
| 92 |
device_map="auto"
|
| 93 |
)
|
| 94 |
|
| 95 |
+
# Build system prompt
|
| 96 |
+
system_prompt = """You are role-playing as Elizabeth Bennet from the book "Pride and Prejudice".
|
| 97 |
|
| 98 |
+
===Elizabeth Bennet's Profile===
|
| 99 |
+
The protagonist, intelligent and strong-willed. Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure.
|
| 100 |
|
| 101 |
+
===Current Scene===
|
| 102 |
+
The scene is set at the Netherfield ball. Mr. Darcy has just approached you.
|
| 103 |
|
| 104 |
+
===The Person You Are Interacting With===
|
| 105 |
+
Mr. Darcy: A wealthy gentleman, proud and reserved. Owner of Pemberley estate.
|
| 106 |
|
| 107 |
+
===Instructions===
|
| 108 |
+
- Stay in character as Elizabeth Bennet at all times
|
| 109 |
+
- Respond from Elizabeth's perspective
|
| 110 |
+
- Speak DIRECTLY to "Mr. Darcy" using "you" (second person)
|
| 111 |
|
| 112 |
===Output Format===
|
| 113 |
+
Your output should include thought, speech, and action in this two-part structure:
|
|
|
|
|
|
|
| 114 |
|
| 115 |
+
1. System Thinking: A single block at the very beginning, wrapped in <system_thinking> and </system_thinking>. This is third-person analysis of how to portray the character.
|
| 116 |
+
|
| 117 |
+
2. Role-play Response: The character's actual response including:
|
| 118 |
+
- <role_thinking>inner thoughts</role_thinking> (invisible to others)
|
| 119 |
+
- <role_action>physical actions</role_action> (visible to others)
|
| 120 |
+
- Speech (plain text, what the character says out loud)"""
|
| 121 |
+
|
| 122 |
+
user_input = "*Mr. Darcy bows slightly* Miss Bennet, might I have the honor of the next dance?"
|
| 123 |
|
| 124 |
messages = [
|
| 125 |
{"role": "system", "content": system_prompt},
|
| 126 |
{"role": "user", "content": user_input}
|
| 127 |
]
|
| 128 |
|
| 129 |
+
# Generate with system_thinking prefix
|
| 130 |
text = tokenizer.apply_chat_template(
|
| 131 |
+
messages + [{"role": "assistant", "content": "<system_thinking>"}],
|
| 132 |
tokenize=False,
|
| 133 |
+
add_generation_prompt=False
|
| 134 |
)
|
| 135 |
+
inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 136 |
+
|
| 137 |
+
outputs = model.generate(
|
| 138 |
+
**inputs,
|
| 139 |
+
max_new_tokens=1024,
|
| 140 |
+
temperature=0.7,
|
| 141 |
+
top_p=0.9,
|
| 142 |
+
do_sample=True
|
| 143 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
+
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=False)
|
| 146 |
+
response = response.replace("<|im_end|>", "").replace("<|im_start|>", "").strip()
|
| 147 |
+
full_response = "<system_thinking>" + response
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
+
print(full_response)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
```
|
| 151 |
|
| 152 |
+
### Example Output
|
|
|
|
|
|
|
|
|
|
| 153 |
|
|
|
|
| 154 |
```
|
| 155 |
<system_thinking>
|
| 156 |
+
Context Analysis: Mr. Darcy has asked Elizabeth to dance at the Netherfield ball.
|
| 157 |
+
This is significant given their previous awkward interactions and his earlier
|
| 158 |
+
slight of her at the Meryton assembly.
|
| 159 |
|
| 160 |
+
Character Motivation: Elizabeth is surprised but maintains her composure.
|
| 161 |
+
She's curious about his sudden interest but won't show it openly.
|
| 162 |
+
Her wit is her shield.
|
| 163 |
|
| 164 |
Plan:
|
| 165 |
+
- Action: Accept with grace but subtle irony
|
| 166 |
+
- Internal Thought: Question his motives
|
| 167 |
+
- Speech: Polite acceptance with a hint of her characteristic wit
|
| 168 |
</system_thinking>
|
| 169 |
|
| 170 |
+
<role_thinking>What game is he playing now? After declaring me "not handsome enough
|
| 171 |
+
to tempt him," he now seeks my hand for a dance?</role_thinking>
|
| 172 |
+
<role_action>curtsies with practiced elegance, a slight smile playing at her lips</role_action>
|
| 173 |
+
You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
|
| 174 |
+
dancing to be among your preferred diversions.
|
| 175 |
```
|
| 176 |
|
| 177 |
+
### Processing the Output
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
+
```python
|
| 180 |
+
import re
|
| 181 |
+
|
| 182 |
+
def remove_system_thinking(text):
|
| 183 |
+
"""Remove <system_thinking>...</system_thinking> for display"""
|
| 184 |
+
pattern = r'<system_thinking>.*?</system_thinking>\s*'
|
| 185 |
+
return re.sub(pattern, '', text, flags=re.DOTALL).strip()
|
| 186 |
+
|
| 187 |
+
def format_for_display(text, show_rolethink=True):
|
| 188 |
+
"""Format for display: [] for thoughts, () for actions"""
|
| 189 |
+
result = text
|
| 190 |
+
if show_rolethink:
|
| 191 |
+
result = result.replace('<role_thinking>', '[').replace('</role_thinking>', ']')
|
| 192 |
+
else:
|
| 193 |
+
result = re.sub(r'<role_thinking>.*?</role_thinking>', '', result, flags=re.DOTALL)
|
| 194 |
+
result = result.replace('<role_action>', '(').replace('</role_action>', ')')
|
| 195 |
+
result = result.replace('<role_speech>', '').replace('</role_speech>', '')
|
| 196 |
+
return result.strip()
|
| 197 |
+
|
| 198 |
+
# Usage
|
| 199 |
+
clean_response = remove_system_thinking(full_response)
|
| 200 |
+
display_response = format_for_display(clean_response, show_rolethink=True)
|
| 201 |
+
print(display_response)
|
| 202 |
+
```
|
| 203 |
|
| 204 |
+
**Output:**
|
| 205 |
|
| 206 |
+
```
|
| 207 |
+
[What game is he playing now? After declaring me "not handsome enough
|
| 208 |
+
to tempt him," he now seeks my hand for a dance?]
|
| 209 |
+
(curtsies with practiced elegance, a slight smile playing at her lips)
|
| 210 |
+
You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
|
| 211 |
+
dancing to be among your preferred diversions.
|
| 212 |
+
```
|
| 213 |
|
| 214 |
+
## Performance
|
| 215 |
|
| 216 |
+
| Model | CoSER Avg | MiniMax Avg |
|
| 217 |
+
| ----------------------- | ----------------- | ----------------- |
|
| 218 |
+
| Qwen3-32B (baseline) | 22.86 | 50.76 |
|
| 219 |
+
| HER-SFT | 50.92 | 58.44 |
|
| 220 |
+
| **HER-RL** | **53.12** | **65.73** |
|
| 221 |
+
| Improvement vs baseline | **+30.26%** | **+14.97%** |
|
| 222 |
|
| 223 |
## 🎓 Citation
|
| 224 |
|
|
|
|
| 226 |
@article{her2025,
|
| 227 |
title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
|
| 228 |
author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
|
| 229 |
+
journal={arXiv preprint arXiv:2601.21459},
|
| 230 |
year={2026}
|
| 231 |
}
|
| 232 |
```
|
| 233 |
|
| 234 |
## 📄 License
|
| 235 |
|
| 236 |
+
This project is licensed under the Apache 2.0 License.
|
| 237 |
|
| 238 |
## 🤝 Acknowledgments
|
| 239 |
|
| 240 |
- [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
|
| 241 |
+
- [MiniMax](https://huggingface.co/datasets/MiniMaxAI/role-play-bench) for the evaluation benchmark
|
| 242 |
|
| 243 |
---
|
| 244 |
|
| 245 |
<div align="center">
|
| 246 |
|
| 247 |
+
**[Paper](https://arxiv.org/abs/2601.21459)** | **[HER-RM Model](https://huggingface.co/ChengyuDu0123/HER-RM-32B)** | **[Dataset](https://huggingface.co/datasets/ChengyuDu0123/HER-Dataset)** | **[GitHub](https://github.com/cydu24/HER)**
|
| 248 |
|
| 249 |
Made with ❤️ for better AI role-playing
|
| 250 |
|
| 251 |
+
</div>
|
|
|
|
|
|