ADOHAHA123 commited on
Commit
42ecfc6
·
verified ·
1 Parent(s): c5a870c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -295
README.md CHANGED
@@ -1,291 +1,3 @@
1
- ---
2
- language:
3
- - zh
4
- - en
5
- license: apache-2.0
6
- library_name: transformers
7
- pipeline_tag: text-generation
8
- tags:
9
- - roleplay
10
- - dialogue
11
- - multi-turn
12
- - qwen
13
- - sft
14
- - chat
15
- base_model: Qwen/Qwen3-32B
16
- ---
17
-
18
- # HER-SFT-Qwen-32B
19
-
20
- <p align="center">
21
- <a href="https://arxiv.org/abs/xxxx.xxxxx"><img src="https://img.shields.io/badge/Paper-arXiv-red?logo=arxiv" alt="Paper"></a>
22
- <a href="https://huggingface.co/datasets/ADOHAHA123/HER-Dataset"><img src="https://img.shields.io/badge/🤗%20Dataset-HER--Dataset-yellow" alt="Dataset"></a>
23
- <a href="https://huggingface.co/ADOHAHA123/HER-Qwen3-32B"><img src="https://img.shields.io/badge/🤗%20Model-HER--RL-blue" alt="HER-RL"></a>
24
- <a href="https://huggingface.co/ADOHAHA123/HER-SFT-Qwen3-32B"><img src="https://img.shields.io/badge/🤗%20Model-HER--SFT-green" alt="HER-SFT"></a>
25
- <a href="https://github.com/your-username/HER"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github" alt="GitHub"></a>
26
- </p>
27
-
28
- HER-SFT (Human Emulation Reasoning - Supervised Fine-Tuning) is a role-playing language model built upon Qwen-32B base model. This is the supervised fine-tuned version of HER, trained on reasoning-augmented role-playing data constructed through reverse engineering synthesis.
29
-
30
- HER-SFT serves as the foundation for HER-RL (the reinforcement learning enhanced version) and demonstrates strong role-playing capabilities through **Dual-layer Thinking**:
31
- - **System Thinking** (third-person): LLM's meta-level planning on how to portray the character
32
- - **Role Thinking** (first-person): Character's inner thoughts and cognitive processes
33
-
34
- This model achieves competitive performance on role-playing benchmarks, with HER-RL further improving upon it through preference-aligned reinforcement learning.
35
-
36
- ## Model Information
37
-
38
- - **Base Model**: Qwen-32B
39
- - **Training Method**: Supervised Fine-Tuning (SFT)
40
- - **Training Data**: Reasoning-augmented role-playing dialogues with dual-layer thinking
41
- - **Model Size**: 32B parameters
42
- - **Enhanced Version**: [HER-RL](../her-rl-qwen-32b) (RL-enhanced version available)
43
-
44
- ## Key Features
45
-
46
- Our model generates responses with rich, interleaved structure:
47
-
48
- - `<system_thinking>`: Third-person analysis of how to portray the role
49
- - `<role_thinking>`: Character's inner thoughts (invisible to others)
50
- - `<role_action>`: Character's physical actions and expressions
51
- - Speech: Natural dialogue text
52
-
53
- This hierarchical approach enables more nuanced and authentic character portrayal.
54
-
55
- ## How to Use
56
-
57
- ### Quick Start: Interactive Chat Demo
58
-
59
- The easiest way to try the model is using our interactive chat demo:
60
-
61
- ```bash
62
- cd chat_demo
63
- python chat_demo.py
64
- ```
65
-
66
- This will start an interactive session where you can:
67
- 1. Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
68
- 2. Select which character the AI should play
69
- 3. Select which character you want to play
70
- 4. Start chatting with the AI character!
71
-
72
- **Demo Options:**
73
-
74
- ```bash
75
- # Show the model's reasoning process (system thinking)
76
- python chat_demo.py --show-think
77
-
78
- # Show character's inner thoughts (role thinking)
79
- python chat_demo.py --show-rolethink
80
-
81
- # Directly specify scenario and character
82
- python chat_demo.py --scenario 0 --character 1
83
-
84
- # Use simple built-in scenarios
85
- python chat_demo.py --simple
86
- ```
87
-
88
- **Chat Commands:**
89
- - `quit` / `exit` / `q` - Exit the chat
90
- - `clear` - Clear conversation history
91
- - `history` - View conversation history
92
- - `prompt` - View the full prompt
93
-
94
- See `chat_demo/README.md` for detailed instructions.
95
-
96
- ### Programmatic Usage
97
-
98
- ```python
99
- from transformers import AutoModelForCausalLM, AutoTokenizer
100
-
101
- model_name = "your-username/her-sft-qwen-32b"
102
- tokenizer = AutoTokenizer.from_pretrained(model_name)
103
- model = AutoModelForCausalLM.from_pretrained(
104
- model_name,
105
- torch_dtype="auto",
106
- device_map="auto"
107
- )
108
-
109
- # Example: Role-playing as Elizabeth Bennet from Pride and Prejudice
110
- system_prompt = """You are Elizabeth Bennet from Pride and Prejudice.
111
-
112
- ===Elizabeth Bennet's Profile===
113
- The protagonist, intelligent and strong-willed. Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure while harboring deep emotions beneath the surface.
114
-
115
- Background: Second of five daughters in the Bennet family. Known for her intelligence, independence, and refusal to conform to societal expectations.
116
-
117
- Personality: Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure.
118
-
119
- ===Current Scenario===
120
- The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly after Lady Catherine's confrontational visit, where she refused to promise not to marry Mr. Darcy. The tension is palpable as Mr. Bennet holds a mysterious letter.
121
-
122
- ===Output Format===
123
- Your output should follow this structure:
124
- 1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
125
- 2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""
126
-
127
- user_input = "Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?"
128
-
129
- messages = [
130
- {"role": "system", "content": system_prompt},
131
- {"role": "user", "content": user_input}
132
- ]
133
-
134
- text = tokenizer.apply_chat_template(
135
- messages,
136
- tokenize=False,
137
- add_generation_prompt=True
138
- )
139
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
140
-
141
- generated_ids = model.generate(
142
- **model_inputs,
143
- max_new_tokens=512,
144
- temperature=0.8,
145
- top_p=0.9
146
- )
147
- generated_ids = [
148
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
149
- ]
150
-
151
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
152
- print(response)
153
- ```
154
-
155
- ## Framework Overview
156
-
157
- <p align="center">
158
- <img src="figure2github.png" alt="HER Framework" width="100%">
159
- </p>
160
-
161
- <p align="center">
162
- <em>HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation</em>
163
- </p>
164
-
165
- ## Training Methodology
166
-
167
- HER-SFT employs a comprehensive training pipeline:
168
-
169
- 1. **Dual-layer Thinking**: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.
170
-
171
- 2. **Reverse Engineering Data Synthesis**: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline:
172
- - Stage 1: Role thinking enhancement - Enriching characters' psychological activities
173
- - Stage 2: Pattern diversification - Creating diverse response patterns
174
- - Stage 3: System thinking generation - Adding meta-level planning analysis
175
-
176
- 3. **Multi-layered Response Generation**: Models learn to generate system thinking (planning), role thinking (inner thoughts), role actions, and speech in an interleaved manner, avoiding monotonous patterns.
177
-
178
- ## Performance
179
-
180
- ### Main Leaderboard Results
181
-
182
- | Rank | Model | CoSER Avg | CoSER SC | CoSER AN | CoSER CF | CoSER SQ | MiniMax Avg | MiniMax Worlds (50%) | MiniMax Stories (25%) | MiniMax Pref (25%) | 95% CI |
183
- |------|-------|-----------|----------|----------|----------|----------|-------------|----------------------|----------------------|--------------------|---------|
184
- | 1 | Claude-4.5-Opus | **62.43** | 63.74 | **64.28** | 58.45 | 63.24 | 76.62 | 67.23 | 82.10 | 89.90 | [75.5, 77.7] |
185
- | 2 | Gemini-3-Pro | 61.80 | **65.95** | 60.42 | **58.34** | 62.49 | 75.60 | 62.72 | 83.87 | 93.08 | [74.5, 76.7] |
186
- | 3 | GPT-5.1 | 61.10 | 64.95 | 53.99 | 60.13 | 65.35 | 80.63 | 76.62 | 72.21 | 97.05 | [79.6, 81.6] |
187
- | 4 | Gemini-2.5-Pro | 60.68 | 61.05 | 60.80 | 57.48 | 63.40 | 68.23 | 52.36 | 82.11 | 86.08 | [67.1, 69.3] |
188
- | 5 | DeepSeek-v3.2 | 58.68 | 55.85 | 57.07 | 57.44 | 64.35 | 60.27 | 45.81 | 66.64 | 82.83 | [59.2, 61.4] |
189
- | 6 | MiniMax-M2-RP | 57.30 | 60.03 | 50.11 | 49.30 | **69.77** | **84.65** | **80.55** | 79.97 | **97.51** | [83.6, 85.7] |
190
- | 7 | DeepSeek-v3.1 | 53.50 | 50.15 | 53.18 | 53.93 | 56.72 | 64.22 | 51.11 | 66.45 | 88.21 | [62.9, 65.5] |
191
- | 8 | HER-RL | 53.12 | 54.33 | 47.26 | 52.78 | 58.12 | 65.73 | 59.13 | 57.74 | 86.90 | [63.0, 68.4] |
192
- | **9** | **HER-SFT (this model)** | **50.92** | **50.52** | **45.99** | **49.78** | **57.37** | **58.44** | **47.29** | **52.78** | **86.40** | **[56.5, 60.4]** |
193
- | 10 | Grok-4.1-Fast | 47.40 | 49.21 | 47.57 | 42.64 | 50.17 | 48.47 | 29.87 | 47.51 | 86.64 | [47.4, 49.5] |
194
- | 11 | Claude-4.5-Sonnet | 45.21 | 47.18 | 36.02 | 47.55 | 50.09 | 69.35 | 55.72 | 75.66 | 90.28 | [68.2, 70.5] |
195
- | 12 | Claude-3.7-Think | 39.73 | 44.84 | 31.00 | 42.45 | 40.65 | 61.25 | 50.66 | 59.53 | 84.15 | [58.5, 64.0] |
196
- | 13 | CoSER-70B | 35.95 | 35.05 | 31.16 | 32.28 | 45.33 | 45.38 | 34.32 | 30.32 | 82.58 | [43.5, 47.2] |
197
- | 14 | GPT-5-Mini | 32.97 | 38.10 | 24.60 | 27.20 | 42.00 | 57.63 | 43.32 | 50.11 | 93.78 | [55.9, 59.3] |
198
- | 15 | GPT-4o-240806 | 27.69 | 34.00 | 14.90 | 22.90 | 38.90 | 66.39 | 64.96 | 46.23 | 89.40 | [64.1, 68.7] |
199
- | 16 | GPT-OSS-120B | 26.12 | 32.80 | 14.80 | 21.50 | 35.40 | 60.72 | 47.27 | 56.65 | 91.71 | [58.0, 63.4] |
200
- | 17 | Qwen3-32B | 22.86 | 30.56 | 19.61 | 15.52 | 30.56 | 50.76 | 40.38 | 32.82 | 89.48 | [48.4, 53.2] |
201
-
202
- **CoSER Benchmark**: Evaluates role-playing quality on 0-100 scale across four dimensions:
203
- - **SC** (Story Consistency): Narrative coherence and plot continuity
204
- - **AN** (Anthropomorphism): Human-like behavior and natural expression
205
- - **CF** (Character Fidelity): Consistency with character profile and traits
206
- - **SQ** (Storyline Quality): Overall narrative quality and engagement
207
-
208
- **MiniMax Role-Play Bench**: Comprehensive evaluation on 0-100 scale:
209
- - **Worlds** (50%): Basic text quality and world-building
210
- - **Stories** (25%): Diversity and content logic
211
- - **Preferences** (25%): Interaction quality and user satisfaction
212
-
213
- **HER-SFT** achieves **rank 9** overall, demonstrating strong baseline performance:
214
- - **Significant improvement over base model**: +28.06 CoSER average vs Qwen3-32B baseline (+122.8% relative improvement)
215
- - **Strong preference scores (86.40)**: Competitive interaction quality, nearly matching HER-RL (86.90)
216
- - **Solid foundation for RL**: HER-RL builds upon this SFT model to achieve further gains (+2.20 CoSER average)
217
-
218
- The supervised fine-tuning establishes a strong baseline, with reinforcement learning providing incremental improvements in character consistency and narrative coherence.
219
-
220
- ## Example Use Case
221
-
222
- ### Elizabeth Bennet from Pride and Prejudice
223
-
224
- **System Prompt** (abbreviated):
225
- ```
226
- You are Elizabeth Bennet from Pride and Prejudice. The protagonist, intelligent and strong-willed...
227
-
228
- ===Current Scenario===
229
- The scene is set in Mr. Bennet's private study. Elizabeth has been summoned after Lady Catherine's visit...
230
- ```
231
-
232
- **User Input**:
233
- ```
234
- Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?
235
- ```
236
-
237
- **Model Output**:
238
- ```
239
- <system_thinking>
240
- I need to generate a response for Elizabeth Bennet at the beginning of this scene.
241
-
242
- Context Analysis: Mr. Bennet has just spoken, picking up a letter and asking if I am "greatly troubled" by recent events. His tone is sardonic, yet there is underlying tension. The scenario implies the aftermath of Lady Catherine's visit.
243
-
244
- Character State (Elizabeth):
245
- - Internal: Feeling a mix of relief (that I stood my ground) and anxiety (about the consequences)
246
- - External: Need to appear composed, cannot let him see me crumble
247
-
248
- Goal: Answer honestly but carefully. Convey that I am holding up well.
249
-
250
- Plan:
251
- - Role Thinking: Reflect on the complexity of feelings
252
- - Role Action: Use physical action to show composure
253
- - Speech: Direct answer showing resilience
254
- </system_thinking>
255
-
256
- Elizabeth Bennet: <role_thinking>His tone is light, but the air feels heavy. I cannot let him see how much Lady Catherine's intrusion still stings, nor how anxious I am about what he knows</role_thinking><role_action>takes a steadying breath, smoothing the folds of her dress to mask a tremor</role_action>I believe I can manage, Father. Though I must admit, I am curious about what this letter contains.
257
- ```
258
-
259
- ## Training Data
260
-
261
- The model is trained on a comprehensive role-playing dataset that includes:
262
-
263
- - Authentic multi-turn, multi-character dialogues from classic literature
264
- - Character profiles with detailed background information
265
- - Scenario descriptions and conversation contexts
266
- - Rich annotations including system thinking, role thinking, and role actions
267
- - Diverse response patterns avoiding monotonous structures
268
-
269
- ## Ethical Considerations
270
-
271
- We have conducted safety checks on the training dataset and implemented safeguards. However, users should be aware that:
272
-
273
- - The models may generate content that reflects biases present in the training data
274
- - Role-playing as certain characters might involve generating content with specific personality traits or behaviors
275
- - Users should implement appropriate content filtering when deploying these models in production applications
276
- - The models include safety evaluation dimensions to minimize harmful outputs
277
-
278
- ## Model Comparison
279
-
280
- | Feature | HER-SFT (this model) | HER-RL |
281
- |---------|---------------------|---------|
282
- | Training Method | Supervised Fine-Tuning | SFT + Reinforcement Learning |
283
- | CoSER Average | 50.92 | 53.12 (+2.20) |
284
- | MiniMax Average | 58.44 | 65.73 (+7.29) |
285
- | Character Consistency | Good | Better (+2.00 CF) |
286
- | Interaction Quality | 86.40 Pref | 86.90 Pref (+0.50) |
287
- | Best Use Case | General role-playing | Preference-aligned interactions |
288
-
289
  ## Citation
290
 
291
  If you use HER-SFT in your research, please cite our paper:
@@ -303,10 +15,4 @@ If you use HER-SFT in your research, please cite our paper:
303
 
304
  Apache-2.0
305
 
306
- ## Acknowledgments
307
-
308
- This model is based on Qwen-32B developed by Alibaba Cloud. We thank the Qwen team for their excellent base model.
309
-
310
- ## Related Models
311
-
312
- - **[HER-RL](../her-rl-qwen-32b)**: Enhanced version with reinforcement learning for improved character consistency and preference alignment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## Citation
2
 
3
  If you use HER-SFT in your research, please cite our paper:
 
15
 
16
  Apache-2.0
17
 
18
+ ## Acknowledgments