ChengyuDu0123 commited on
Commit
43c268a
·
verified ·
1 Parent(s): 09fdebc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -177
README.md CHANGED
@@ -14,9 +14,9 @@ tags:
14
  - chat
15
  base_model: Qwen/Qwen3-32B
16
  ---
 
17
 
18
-
19
- # 🎭 HER: Hierarchical Emotion Reasoning
20
 
21
  ### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
22
 
@@ -33,75 +33,58 @@ base_model: Qwen/Qwen3-32B
33
 
34
  </div>
35
 
 
36
 
37
- HER (Human Emulation Reasoning) models are state-of-the-art models for role-playing language agents (RPLAs), built upon Qwen-32B base model. HER is a unified framework that enables cognitive-level persona simulation through structured reasoning and preference-aligned reinforcement learning.
38
-
39
- HER models excel at role-playing through **Dual-layer Thinking**, which distinguishes between:
40
- - **System Thinking** (third-person): LLM's meta-level planning on how to portray the character
41
- - **Role Thinking** (first-person): Character's inner thoughts and cognitive processes
42
 
43
- This dual-layer approach enables models to produce highly human-like responses that include reasoning traces, inner thoughts, physical actions, and natural dialogue. Extensive experiments demonstrate that HER models achieve competitive role-playing performance on multiple benchmarks, with HER-RL significantly outperforming the Qwen3-32B baseline by 30.26% on CoSER and 14.97% on MiniMax Role-Play Bench.
 
44
 
45
- ## Model Variants
46
 
47
- - **HER-SFT**: Supervised fine-tuned version from Qwen-32B
48
- - **HER-RL**: Reinforcement learning enhanced version (this model)
49
 
50
- ## Key Features
51
 
52
- Our models generate responses with rich, interleaved structure:
53
-
54
- - `<system_thinking>`: Third-person analysis of how to portray the role
55
- - `<role_thinking>`: Character's inner thoughts (invisible to others)
56
- - `<role_action>`: Character's physical actions and expressions
57
- - Speech: Natural dialogue text
58
 
59
- This hierarchical approach enables more nuanced and authentic character portrayal.
 
 
 
60
 
61
  ## How to Use
62
 
63
  ### Quick Start: Interactive Chat Demo
64
 
65
- The easiest way to try the model is using our interactive chat demo:
66
-
67
  ```bash
68
- cd chat_demo
69
- python chat_demo.py
 
70
  ```
71
 
72
- This will start an interactive session where you can:
73
- 1. Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
74
- 2. Select which character the AI should play
75
- 3. Select which character you want to play
76
- 4. Start chatting with the AI character!
77
-
78
  **Demo Options:**
79
 
80
  ```bash
81
  # Show the model's reasoning process (system thinking)
82
  python chat_demo.py --show-think
83
 
84
- # Show character's inner thoughts (role thinking)
85
  python chat_demo.py --show-rolethink
86
 
87
- # Directly specify scenario and character
88
- python chat_demo.py --scenario 0 --character 1
89
  ```
90
 
91
- **Chat Commands:**
92
- - `quit` / `exit` / `q` - Exit the chat
93
- - `clear` - Clear conversation history
94
- - `history` - View conversation history
95
- - `prompt` - View the full prompt
96
-
97
- See `chat_demo/README.md` for detailed instructions.
98
-
99
  ### Programmatic Usage
100
 
101
  ```python
102
  from transformers import AutoModelForCausalLM, AutoTokenizer
103
 
104
- model_name = "your-username/her-qwen-32b"
105
  tokenizer = AutoTokenizer.from_pretrained(model_name)
106
  model = AutoModelForCausalLM.from_pretrained(
107
  model_name,
@@ -109,171 +92,133 @@ model = AutoModelForCausalLM.from_pretrained(
109
  device_map="auto"
110
  )
111
 
112
- # Example: Role-playing as Mr. Bennet from Pride and Prejudice
113
- system_prompt = """You are Mr Bennet from Pride and Prejudice.
114
 
115
- ===Mr Bennet's Profile===
116
- Elizabeth's father, known for his sarcastic wit and detachment. Mr. Bennet is the patriarch of the Bennet family, a genteel country gentleman residing at Longbourn estate in rural England.
117
 
118
- Background: Father to five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia). Owner of the Longbourn estate, which is entailed away from female inheritance.
 
119
 
120
- Personality: Highly intelligent and well-read, preferring the solitude of his library. Known for his biting sarcasm and sardonic humor. Emotionally detached and often passive in family matters.
 
121
 
122
- ===Current Scenario===
123
- The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly, and Mr. Bennet holds a letter that seems to spark his characteristic sardonic amusement.
 
 
124
 
125
  ===Output Format===
126
- Your output should follow this structure:
127
- 1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
128
- 2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""
129
 
130
- user_input = "[Elizabeth enters the study]"
 
 
 
 
 
 
 
131
 
132
  messages = [
133
  {"role": "system", "content": system_prompt},
134
  {"role": "user", "content": user_input}
135
  ]
136
 
 
137
  text = tokenizer.apply_chat_template(
138
- messages,
139
  tokenize=False,
140
- add_generation_prompt=True
141
  )
142
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
143
-
144
- generated_ids = model.generate(
145
- **model_inputs,
146
- max_new_tokens=512,
147
- temperature=0.8,
148
- top_p=0.9
 
149
  )
150
- generated_ids = [
151
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
152
- ]
153
-
154
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
155
- print(response)
156
- ```
157
 
158
- ## Framework Overview
159
-
160
- <p align="center">
161
- <img src="figure2github.png" alt="HER Framework" width="100%">
162
- </p>
163
-
164
- <p align="center">
165
- <em>HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation</em>
166
- </p>
167
 
168
- ## Training Methodology
169
-
170
- HER employs a comprehensive training pipeline:
171
-
172
- 1. **Dual-layer Thinking**: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.
173
-
174
- 2. **Reverse Engineering Data Synthesis**: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline, constructing high-quality training trajectories with explicit reasoning traces.
175
-
176
- 3. **Principle-Aligned Reward Model**: We construct human-aligned evaluation principles across 12 dimensions (character consistency, emotional authenticity, narrative quality, etc.) and train a Generative Reward Model (GRM) that provides detailed, case-by-case feedback.
177
-
178
- 4. **Reinforcement Learning Enhancement** (HER-RL): Building on HER-SFT, we apply RL with the GRM to further align the model with human preferences, significantly improving interaction quality and storyline coherence.
179
-
180
- ## Performance
181
-
182
- ### Main Leaderboard Results
183
-
184
- | Rank | Model | CoSER Avg | CoSER SC | CoSER AN | CoSER CF | CoSER SQ | MiniMax Avg | MiniMax Worlds (50%) | MiniMax Stories (25%) | MiniMax Pref (25%) | 95% CI |
185
- |------|-------|-----------|----------|----------|----------|----------|-------------|----------------------|----------------------|--------------------|---------|
186
- | 1 | Claude-4.5-Opus | **62.43** | 63.74 | **64.28** | 58.45 | 63.24 | 76.62 | 67.23 | 82.10 | 89.90 | [75.5, 77.7] |
187
- | 2 | Gemini-3-Pro | 61.80 | **65.95** | 60.42 | **58.34** | 62.49 | 75.60 | 62.72 | 83.87 | 93.08 | [74.5, 76.7] |
188
- | 3 | GPT-5.1 | 61.10 | 64.95 | 53.99 | 60.13 | 65.35 | 80.63 | 76.62 | 72.21 | 97.05 | [79.6, 81.6] |
189
- | 4 | Gemini-2.5-Pro | 60.68 | 61.05 | 60.80 | 57.48 | 63.40 | 68.23 | 52.36 | 82.11 | 86.08 | [67.1, 69.3] |
190
- | 5 | DeepSeek-v3.2 | 58.68 | 55.85 | 57.07 | 57.44 | 64.35 | 60.27 | 45.81 | 66.64 | 82.83 | [59.2, 61.4] |
191
- | 6 | MiniMax-M2-RP | 57.30 | 60.03 | 50.11 | 49.30 | **69.77** | **84.65** | **80.55** | 79.97 | **97.51** | [83.6, 85.7] |
192
- | 7 | DeepSeek-v3.1 | 53.50 | 50.15 | 53.18 | 53.93 | 56.72 | 64.22 | 51.11 | 66.45 | 88.21 | [62.9, 65.5] |
193
- | **8** | **HER-RL (this model)** | **53.12** | **54.33** | **47.26** | **52.78** | **58.12** | **65.73** | **59.13** | **57.74** | **86.90** | **[63.0, 68.4]** |
194
- | 9 | HER-SFT | 50.92 | 50.52 | 45.99 | 49.78 | 57.37 | 58.44 | 47.29 | 52.78 | 86.40 | [56.5, 60.4] |
195
- | 10 | Grok-4.1-Fast | 47.40 | 49.21 | 47.57 | 42.64 | 50.17 | 48.47 | 29.87 | 47.51 | 86.64 | [47.4, 49.5] |
196
- | 11 | Claude-4.5-Sonnet | 45.21 | 47.18 | 36.02 | 47.55 | 50.09 | 69.35 | 55.72 | 75.66 | 90.28 | [68.2, 70.5] |
197
- | 12 | Claude-3.7-Think | 39.73 | 44.84 | 31.00 | 42.45 | 40.65 | 61.25 | 50.66 | 59.53 | 84.15 | [58.5, 64.0] |
198
- | 13 | CoSER-70B | 35.95 | 35.05 | 31.16 | 32.28 | 45.33 | 45.38 | 34.32 | 30.32 | 82.58 | [43.5, 47.2] |
199
- | 14 | GPT-5-Mini | 32.97 | 38.10 | 24.60 | 27.20 | 42.00 | 57.63 | 43.32 | 50.11 | 93.78 | [55.9, 59.3] |
200
- | 15 | GPT-4o-240806 | 27.69 | 34.00 | 14.90 | 22.90 | 38.90 | 66.39 | 64.96 | 46.23 | 89.40 | [64.1, 68.7] |
201
- | 16 | GPT-OSS-120B | 26.12 | 32.80 | 14.80 | 21.50 | 35.40 | 60.72 | 47.27 | 56.65 | 91.71 | [58.0, 63.4] |
202
- | 17 | Qwen3-32B | 22.86 | 30.56 | 19.61 | 15.52 | 30.56 | 50.76 | 40.38 | 32.82 | 89.48 | [48.4, 53.2] |
203
-
204
- **CoSER Benchmark**: Evaluates role-playing quality on 0-100 scale across four dimensions:
205
- - **SC** (Story Consistency): Narrative coherence and plot continuity
206
- - **AN** (Anthropomorphism): Human-like behavior and natural expression
207
- - **CF** (Character Fidelity): Consistency with character profile and traits
208
- - **SQ** (Storyline Quality): Overall narrative quality and engagement
209
-
210
- **MiniMax Role-Play Bench**: Comprehensive evaluation on 0-100 scale:
211
- - **Worlds** (50%): Basic text quality and world-building
212
- - **Stories** (25%): Diversity and content logic
213
- - **Preferences** (25%): Interaction quality and user satisfaction
214
-
215
- **HER-RL** achieves **rank 8** overall, demonstrating strong performance particularly in:
216
- - **MiniMax Preference (86.90)**: Competitive with top-tier models, showing excellent interaction quality
217
- - **CoSER Storyline Quality (58.12)**: Strong narrative coherence and engagement
218
- - **Balanced Performance**: Consistent scores across both benchmarks, indicating robust role-playing capabilities
219
-
220
- The reinforcement learning approach significantly improves upon HER-SFT, particularly in interaction quality (+0.50 preference score) and character consistency (+2.52 average CoSER score).
221
-
222
- ## Example Use Case
223
-
224
- ### Mr. Bennet from Pride and Prejudice
225
-
226
- **System Prompt** (abbreviated):
227
- ```
228
- You are Mr Bennet from Pride and Prejudice. Elizabeth's father, known for his sarcastic wit and detachment...
229
-
230
- ===Current Scenario===
231
- The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly...
232
  ```
233
 
234
- **User Input**:
235
- ```
236
- [Elizabeth enters the study]
237
- ```
238
 
239
- **Model Output**:
240
  ```
241
  <system_thinking>
242
- I need to establish the opening of this scene effectively, grounding Mr Bennet in his private sanctuary while acknowledging the tension of the moment. The scenario places us in his study, a place of refuge, yet he holds a letter that disrupts his usual solitude.
 
 
243
 
244
- Context Analysis: Elizabeth has been summoned. The atmosphere is charged. Mr Bennet has received a letter regarding Elizabeth's engagement to Darcy. He is amused by the drama but aware of the gravity of the situation.
245
-
246
- Character Motivation: Mr Bennet often uses humor to deflect from serious emotional labor. He wants to gauge Elizabeth's state of mind regarding the recent events without being overly sentimental.
247
 
248
  Plan:
249
- - Action: Use the physical prop—the letter. Handling it slowly will emphasize its importance.
250
- - Internal Thought: Acknowledge the dichotomy—finding the situation amusing yet worrying for her reputation.
251
- - Speech: Address her directly but with a teasing opener.
252
  </system_thinking>
253
 
254
- Mr Bennet: <role_action>picks up the letter, tapping it thoughtfully against the desk</role_action><role_thinking>It is a delicate matter, this business with Darcy, yet I cannot help but find the drama amusing, though I must play the part of the concerned father</role_thinking>Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?
 
 
 
 
255
  ```
256
 
257
- ## Training Data
258
-
259
- The models are trained on a comprehensive role-playing dataset that includes:
260
-
261
- - Authentic multi-turn, multi-character dialogues
262
- - Character profiles with detailed background information
263
- - Scenario descriptions and conversation contexts
264
- - Rich annotations including system thinking, role thinking, and role actions
265
- - Preference data collected through multi-dimensional evaluation
266
 
267
- ## Ethical Considerations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
 
269
- We have conducted safety checks on the training dataset and implemented safeguards. However, users should be aware that:
270
 
271
- - The models may generate content that reflects biases present in the training data
272
- - Role-playing as certain characters might involve generating content with specific personality traits or behaviors
273
- - Users should implement appropriate content filtering when deploying these models in production applications
274
- - The models include safety evaluation dimensions to minimize harmful outputs
 
 
 
275
 
 
276
 
 
 
 
 
 
 
277
 
278
  ## 🎓 Citation
279
 
@@ -281,28 +226,26 @@ We have conducted safety checks on the training dataset and implemented safeguar
281
  @article{her2025,
282
  title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
283
  author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
284
- journal={arXiv preprint arXiv:2026.21459},
285
  year={2026}
286
  }
287
  ```
288
 
289
  ## 📄 License
290
 
291
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
292
 
293
  ## 🤝 Acknowledgments
294
 
295
  - [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
296
- - [MiniMax](https://www.minimax.io/news/a-deep-dive-into-the-minimax-m2-her-2) for the evaluation benchmark
297
 
298
  ---
299
 
300
  <div align="center">
301
 
302
- **[Paper](https://arxiv.org/abs/2025.21459)** | **[Model](https://huggingface.co/ChengyuDu0123/HER-32B)** | **[Data](https://huggingface.co/datasets/ChengyuDu0123/HER-Dataset)**
303
 
304
  Made with ❤️ for better AI role-playing
305
 
306
- </div>
307
-
308
-
 
14
  - chat
15
  base_model: Qwen/Qwen3-32B
16
  ---
17
+ <div align="center">
18
 
19
+ # 🎭 HER-RL: Role-Playing Model with Reinforcement Learning
 
20
 
21
  ### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
22
 
 
33
 
34
  </div>
35
 
36
+ ## Overview
37
 
38
+ **HER-RL** is a role-playing language model enhanced with reinforcement learning, built upon Qwen3-32B. It achieves cognitive-level persona simulation through **Dual-layer Thinking**:
 
 
 
 
39
 
40
+ - **System Thinking** (`<system_thinking>`): Third-person meta-level planning on how to portray the character
41
+ - **Role Thinking** (`<role_thinking>`): First-person character's inner thoughts and cognitive processes
42
 
43
+ HER-RL significantly outperforms Qwen3-32B baseline by **30.26%** on CoSER and **14.97%** on MiniMax Role-Play Bench.
44
 
45
+ ## Output Format
 
46
 
47
+ The model generates responses with rich, interleaved structure:
48
 
49
+ ```
50
+ <system_thinking>
51
+ Third-person analysis: context understanding, character motivation, response planning...
52
+ </system_thinking>
 
 
53
 
54
+ <role_thinking>Character's inner thoughts (invisible to others)</role_thinking>
55
+ <role_action>Physical actions and expressions (visible to others)</role_action>
56
+ Spoken dialogue text.
57
+ ```
58
 
59
  ## How to Use
60
 
61
  ### Quick Start: Interactive Chat Demo
62
 
 
 
63
  ```bash
64
+ git clone https://github.com/cydu24/HER.git
65
+ cd HER/chat_demo
66
+ python chat_demo.py --model-path ChengyuDu0123/HER-32B
67
  ```
68
 
 
 
 
 
 
 
69
  **Demo Options:**
70
 
71
  ```bash
72
  # Show the model's reasoning process (system thinking)
73
  python chat_demo.py --show-think
74
 
75
+ # Show character's inner thoughts (role thinking)
76
  python chat_demo.py --show-rolethink
77
 
78
+ # Both
79
+ python chat_demo.py --show-think --show-rolethink
80
  ```
81
 
 
 
 
 
 
 
 
 
82
  ### Programmatic Usage
83
 
84
  ```python
85
  from transformers import AutoModelForCausalLM, AutoTokenizer
86
 
87
+ model_name = "ChengyuDu0123/HER-32B"
88
  tokenizer = AutoTokenizer.from_pretrained(model_name)
89
  model = AutoModelForCausalLM.from_pretrained(
90
  model_name,
 
92
  device_map="auto"
93
  )
94
 
95
+ # Build system prompt
96
+ system_prompt = """You are role-playing as Elizabeth Bennet from the book "Pride and Prejudice".
97
 
98
+ ===Elizabeth Bennet's Profile===
99
+ The protagonist, intelligent and strong-willed. Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure.
100
 
101
+ ===Current Scene===
102
+ The scene is set at the Netherfield ball. Mr. Darcy has just approached you.
103
 
104
+ ===The Person You Are Interacting With===
105
+ Mr. Darcy: A wealthy gentleman, proud and reserved. Owner of Pemberley estate.
106
 
107
+ ===Instructions===
108
+ - Stay in character as Elizabeth Bennet at all times
109
+ - Respond from Elizabeth's perspective
110
+ - Speak DIRECTLY to "Mr. Darcy" using "you" (second person)
111
 
112
  ===Output Format===
113
+ Your output should include thought, speech, and action in this two-part structure:
 
 
114
 
115
+ 1. System Thinking: A single block at the very beginning, wrapped in <system_thinking> and </system_thinking>. This is third-person analysis of how to portray the character.
116
+
117
+ 2. Role-play Response: The character's actual response including:
118
+ - <role_thinking>inner thoughts</role_thinking> (invisible to others)
119
+ - <role_action>physical actions</role_action> (visible to others)
120
+ - Speech (plain text, what the character says out loud)"""
121
+
122
+ user_input = "*Mr. Darcy bows slightly* Miss Bennet, might I have the honor of the next dance?"
123
 
124
  messages = [
125
  {"role": "system", "content": system_prompt},
126
  {"role": "user", "content": user_input}
127
  ]
128
 
129
+ # Generate with system_thinking prefix
130
  text = tokenizer.apply_chat_template(
131
+ messages + [{"role": "assistant", "content": "<system_thinking>"}],
132
  tokenize=False,
133
+ add_generation_prompt=False
134
  )
135
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
136
+
137
+ outputs = model.generate(
138
+ **inputs,
139
+ max_new_tokens=1024,
140
+ temperature=0.7,
141
+ top_p=0.9,
142
+ do_sample=True
143
  )
 
 
 
 
 
 
 
144
 
145
+ response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=False)
146
+ response = response.replace("<|im_end|>", "").replace("<|im_start|>", "").strip()
147
+ full_response = "<system_thinking>" + response
 
 
 
 
 
 
148
 
149
+ print(full_response)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  ```
151
 
152
+ ### Example Output
 
 
 
153
 
 
154
  ```
155
  <system_thinking>
156
+ Context Analysis: Mr. Darcy has asked Elizabeth to dance at the Netherfield ball.
157
+ This is significant given their previous awkward interactions and his earlier
158
+ slight of her at the Meryton assembly.
159
 
160
+ Character Motivation: Elizabeth is surprised but maintains her composure.
161
+ She's curious about his sudden interest but won't show it openly.
162
+ Her wit is her shield.
163
 
164
  Plan:
165
+ - Action: Accept with grace but subtle irony
166
+ - Internal Thought: Question his motives
167
+ - Speech: Polite acceptance with a hint of her characteristic wit
168
  </system_thinking>
169
 
170
+ <role_thinking>What game is he playing now? After declaring me "not handsome enough
171
+ to tempt him," he now seeks my hand for a dance?</role_thinking>
172
+ <role_action>curtsies with practiced elegance, a slight smile playing at her lips</role_action>
173
+ You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
174
+ dancing to be among your preferred diversions.
175
  ```
176
 
177
+ ### Processing the Output
 
 
 
 
 
 
 
 
178
 
179
+ ```python
180
+ import re
181
+
182
+ def remove_system_thinking(text):
183
+ """Remove <system_thinking>...</system_thinking> for display"""
184
+ pattern = r'<system_thinking>.*?</system_thinking>\s*'
185
+ return re.sub(pattern, '', text, flags=re.DOTALL).strip()
186
+
187
+ def format_for_display(text, show_rolethink=True):
188
+ """Format for display: [] for thoughts, () for actions"""
189
+ result = text
190
+ if show_rolethink:
191
+ result = result.replace('<role_thinking>', '[').replace('</role_thinking>', ']')
192
+ else:
193
+ result = re.sub(r'<role_thinking>.*?</role_thinking>', '', result, flags=re.DOTALL)
194
+ result = result.replace('<role_action>', '(').replace('</role_action>', ')')
195
+ result = result.replace('<role_speech>', '').replace('</role_speech>', '')
196
+ return result.strip()
197
+
198
+ # Usage
199
+ clean_response = remove_system_thinking(full_response)
200
+ display_response = format_for_display(clean_response, show_rolethink=True)
201
+ print(display_response)
202
+ ```
203
 
204
+ **Output:**
205
 
206
+ ```
207
+ [What game is he playing now? After declaring me "not handsome enough
208
+ to tempt him," he now seeks my hand for a dance?]
209
+ (curtsies with practiced elegance, a slight smile playing at her lips)
210
+ You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
211
+ dancing to be among your preferred diversions.
212
+ ```
213
 
214
+ ## Performance
215
 
216
+ | Model | CoSER Avg | MiniMax Avg |
217
+ | ----------------------- | ----------------- | ----------------- |
218
+ | Qwen3-32B (baseline) | 22.86 | 50.76 |
219
+ | HER-SFT | 50.92 | 58.44 |
220
+ | **HER-RL** | **53.12** | **65.73** |
221
+ | Improvement vs baseline | **+30.26%** | **+14.97%** |
222
 
223
  ## 🎓 Citation
224
 
 
226
  @article{her2025,
227
  title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
228
  author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
229
+ journal={arXiv preprint arXiv:2601.21459},
230
  year={2026}
231
  }
232
  ```
233
 
234
  ## 📄 License
235
 
236
+ This project is licensed under the Apache 2.0 License.
237
 
238
  ## 🤝 Acknowledgments
239
 
240
  - [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
241
+ - [MiniMax](https://huggingface.co/datasets/MiniMaxAI/role-play-bench) for the evaluation benchmark
242
 
243
  ---
244
 
245
  <div align="center">
246
 
247
+ **[Paper](https://arxiv.org/abs/2601.21459)** | **[HER-RM Model](https://huggingface.co/ChengyuDu0123/HER-RM-32B)** | **[Dataset](https://huggingface.co/datasets/ChengyuDu0123/HER-Dataset)** | **[GitHub](https://github.com/cydu24/HER)**
248
 
249
  Made with ❤️ for better AI role-playing
250
 
251
+ </div>