Update README.md

Browse files

Files changed (1) hide show

README.md +120 -177

README.md CHANGED Viewed

@@ -14,9 +14,9 @@ tags:
   - chat
 base_model: Qwen/Qwen3-32B
 ---
-# 🎭 HER: Hierarchical Emotion Reasoning
 ### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
@@ -33,75 +33,58 @@ base_model: Qwen/Qwen3-32B
 </div>
-HER (Human Emulation Reasoning) models are state-of-the-art models for role-playing language agents (RPLAs), built upon Qwen-32B base model. HER is a unified framework that enables cognitive-level persona simulation through structured reasoning and preference-aligned reinforcement learning.
-HER models excel at role-playing through **Dual-layer Thinking**, which distinguishes between:
-- **System Thinking** (third-person): LLM's meta-level planning on how to portray the character
-- **Role Thinking** (first-person): Character's inner thoughts and cognitive processes
-This dual-layer approach enables models to produce highly human-like responses that include reasoning traces, inner thoughts, physical actions, and natural dialogue. Extensive experiments demonstrate that HER models achieve competitive role-playing performance on multiple benchmarks, with HER-RL significantly outperforming the Qwen3-32B baseline by 30.26% on CoSER and 14.97% on MiniMax Role-Play Bench.
-## Model Variants
-- **HER-SFT**: Supervised fine-tuned version from Qwen-32B
-- **HER-RL**: Reinforcement learning enhanced version (this model)
-## Key Features
-Our models generate responses with rich, interleaved structure:
-- `<system_thinking>`: Third-person analysis of how to portray the role
-- `<role_thinking>`: Character's inner thoughts (invisible to others)
-- `<role_action>`: Character's physical actions and expressions
-- Speech: Natural dialogue text
-This hierarchical approach enables more nuanced and authentic character portrayal.
 ## How to Use
 ### Quick Start: Interactive Chat Demo
-The easiest way to try the model is using our interactive chat demo:
 ```bash
-cd chat_demo
-python chat_demo.py
 ```
-This will start an interactive session where you can:
-1. Choose a scenario from classic literature (Pride and Prejudice, The Great Gatsby, etc.)
-2. Select which character the AI should play
-3. Select which character you want to play
-4. Start chatting with the AI character!
 **Demo Options:**
 ```bash
 # Show the model's reasoning process (system thinking)
 python chat_demo.py --show-think
-# Show character's inner thoughts (role thinking)
 python chat_demo.py --show-rolethink
-# Directly specify scenario and character
-python chat_demo.py --scenario 0 --character 1
 ```
-**Chat Commands:**
-- `quit` / `exit` / `q` - Exit the chat
-- `clear` - Clear conversation history
-- `history` - View conversation history
-- `prompt` - View the full prompt
-See `chat_demo/README.md` for detailed instructions.
 ### Programmatic Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "your-username/her-qwen-32b"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
@@ -109,171 +92,133 @@ model = AutoModelForCausalLM.from_pretrained(
     device_map="auto"
 )
-# Example: Role-playing as Mr. Bennet from Pride and Prejudice
-system_prompt = """You are Mr Bennet from Pride and Prejudice.
-===Mr Bennet's Profile===
-Elizabeth's father, known for his sarcastic wit and detachment. Mr. Bennet is the patriarch of the Bennet family, a genteel country gentleman residing at Longbourn estate in rural England.
-Background: Father to five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia). Owner of the Longbourn estate, which is entailed away from female inheritance.
-Personality: Highly intelligent and well-read, preferring the solitude of his library. Known for his biting sarcasm and sardonic humor. Emotionally detached and often passive in family matters.
-===Current Scenario===
-The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly, and Mr. Bennet holds a letter that seems to spark his characteristic sardonic amusement.
 ===Output Format===
-Your output should follow this structure:
-1. System Thinking: Wrapped in <system_thinking></system_thinking> tags - third-person analysis of how to portray the role
-2. Role-play Response: Including <role_thinking> for inner thoughts, <role_action> for actions, and plain text for speech"""
-user_input = "[Elizabeth enters the study]"
 messages = [
     {"role": "system", "content": system_prompt},
     {"role": "user", "content": user_input}
 ]
 text = tokenizer.apply_chat_template(
-    messages,
     tokenize=False,
-    add_generation_prompt=True
 )
-model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
-generated_ids = model.generate(
-    **model_inputs,
-    max_new_tokens=512,
-    temperature=0.8,
-    top_p=0.9
 )
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-print(response)
-```
-## Framework Overview
-<p align="center">
-  <img src="figure2github.png" alt="HER Framework" width="100%">
-</p>
-<p align="center">
-  <em>HER Framework: Dual-layer Thinking for Cognitive-Level Persona Simulation</em>
-</p>
-## Training Methodology
-HER employs a comprehensive training pipeline:
-1. **Dual-layer Thinking**: Separates hidden third-person system thinking (how the LLM plans to portray the character) from first-person role thinking (the character's actual inner thoughts). This dual-layer structure enables more authentic and cognitively grounded character simulation.
-2. **Reverse Engineering Data Synthesis**: We curate reasoning-augmented role-playing data through a three-stage reverse synthesis pipeline, constructing high-quality training trajectories with explicit reasoning traces.
-3. **Principle-Aligned Reward Model**: We construct human-aligned evaluation principles across 12 dimensions (character consistency, emotional authenticity, narrative quality, etc.) and train a Generative Reward Model (GRM) that provides detailed, case-by-case feedback.
-4. **Reinforcement Learning Enhancement** (HER-RL): Building on HER-SFT, we apply RL with the GRM to further align the model with human preferences, significantly improving interaction quality and storyline coherence.
-## Performance
-### Main Leaderboard Results
-| Rank | Model | CoSER Avg | CoSER SC | CoSER AN | CoSER CF | CoSER SQ | MiniMax Avg | MiniMax Worlds (50%) | MiniMax Stories (25%) | MiniMax Pref (25%) | 95% CI |
-|------|-------|-----------|----------|----------|----------|----------|-------------|----------------------|----------------------|--------------------|---------|
-| 1 | Claude-4.5-Opus | **62.43** | 63.74 | **64.28** | 58.45 | 63.24 | 76.62 | 67.23 | 82.10 | 89.90 | [75.5, 77.7] |
-| 2 | Gemini-3-Pro | 61.80 | **65.95** | 60.42 | **58.34** | 62.49 | 75.60 | 62.72 | 83.87 | 93.08 | [74.5, 76.7] |
-| 3 | GPT-5.1 | 61.10 | 64.95 | 53.99 | 60.13 | 65.35 | 80.63 | 76.62 | 72.21 | 97.05 | [79.6, 81.6] |
-| 4 | Gemini-2.5-Pro | 60.68 | 61.05 | 60.80 | 57.48 | 63.40 | 68.23 | 52.36 | 82.11 | 86.08 | [67.1, 69.3] |
-| 5 | DeepSeek-v3.2 | 58.68 | 55.85 | 57.07 | 57.44 | 64.35 | 60.27 | 45.81 | 66.64 | 82.83 | [59.2, 61.4] |
-| 6 | MiniMax-M2-RP | 57.30 | 60.03 | 50.11 | 49.30 | **69.77** | **84.65** | **80.55** | 79.97 | **97.51** | [83.6, 85.7] |
-| 7 | DeepSeek-v3.1 | 53.50 | 50.15 | 53.18 | 53.93 | 56.72 | 64.22 | 51.11 | 66.45 | 88.21 | [62.9, 65.5] |
-| **8** | **HER-RL (this model)** | **53.12** | **54.33** | **47.26** | **52.78** | **58.12** | **65.73** | **59.13** | **57.74** | **86.90** | **[63.0, 68.4]** |
-| 9 | HER-SFT | 50.92 | 50.52 | 45.99 | 49.78 | 57.37 | 58.44 | 47.29 | 52.78 | 86.40 | [56.5, 60.4] |
-| 10 | Grok-4.1-Fast | 47.40 | 49.21 | 47.57 | 42.64 | 50.17 | 48.47 | 29.87 | 47.51 | 86.64 | [47.4, 49.5] |
-| 11 | Claude-4.5-Sonnet | 45.21 | 47.18 | 36.02 | 47.55 | 50.09 | 69.35 | 55.72 | 75.66 | 90.28 | [68.2, 70.5] |
-| 12 | Claude-3.7-Think | 39.73 | 44.84 | 31.00 | 42.45 | 40.65 | 61.25 | 50.66 | 59.53 | 84.15 | [58.5, 64.0] |
-| 13 | CoSER-70B | 35.95 | 35.05 | 31.16 | 32.28 | 45.33 | 45.38 | 34.32 | 30.32 | 82.58 | [43.5, 47.2] |
-| 14 | GPT-5-Mini | 32.97 | 38.10 | 24.60 | 27.20 | 42.00 | 57.63 | 43.32 | 50.11 | 93.78 | [55.9, 59.3] |
-| 15 | GPT-4o-240806 | 27.69 | 34.00 | 14.90 | 22.90 | 38.90 | 66.39 | 64.96 | 46.23 | 89.40 | [64.1, 68.7] |
-| 16 | GPT-OSS-120B | 26.12 | 32.80 | 14.80 | 21.50 | 35.40 | 60.72 | 47.27 | 56.65 | 91.71 | [58.0, 63.4] |
-| 17 | Qwen3-32B | 22.86 | 30.56 | 19.61 | 15.52 | 30.56 | 50.76 | 40.38 | 32.82 | 89.48 | [48.4, 53.2] |
-**CoSER Benchmark**: Evaluates role-playing quality on 0-100 scale across four dimensions:
-- **SC** (Story Consistency): Narrative coherence and plot continuity
-- **AN** (Anthropomorphism): Human-like behavior and natural expression
-- **CF** (Character Fidelity): Consistency with character profile and traits
-- **SQ** (Storyline Quality): Overall narrative quality and engagement
-**MiniMax Role-Play Bench**: Comprehensive evaluation on 0-100 scale:
-- **Worlds** (50%): Basic text quality and world-building
-- **Stories** (25%): Diversity and content logic
-- **Preferences** (25%): Interaction quality and user satisfaction
-**HER-RL** achieves **rank 8** overall, demonstrating strong performance particularly in:
-- **MiniMax Preference (86.90)**: Competitive with top-tier models, showing excellent interaction quality
-- **CoSER Storyline Quality (58.12)**: Strong narrative coherence and engagement
-- **Balanced Performance**: Consistent scores across both benchmarks, indicating robust role-playing capabilities
-The reinforcement learning approach significantly improves upon HER-SFT, particularly in interaction quality (+0.50 preference score) and character consistency (+2.52 average CoSER score).
-## Example Use Case
-### Mr. Bennet from Pride and Prejudice
-**System Prompt** (abbreviated):
-```
-You are Mr Bennet from Pride and Prejudice. Elizabeth's father, known for his sarcastic wit and detachment...
-===Current Scenario===
-The scene is set in Mr. Bennet's private study. Elizabeth has been summoned unexpectedly...
 ```
-**User Input**:
-```
-[Elizabeth enters the study]
-```
-**Model Output**:
 ```
 <system_thinking>
-I need to establish the opening of this scene effectively, grounding Mr Bennet in his private sanctuary while acknowledging the tension of the moment. The scenario places us in his study, a place of refuge, yet he holds a letter that disrupts his usual solitude.
-Context Analysis: Elizabeth has been summoned. The atmosphere is charged. Mr Bennet has received a letter regarding Elizabeth's engagement to Darcy. He is amused by the drama but aware of the gravity of the situation.
-Character Motivation: Mr Bennet often uses humor to deflect from serious emotional labor. He wants to gauge Elizabeth's state of mind regarding the recent events without being overly sentimental.
 Plan:
-- Action: Use the physical prop—the letter. Handling it slowly will emphasize its importance.
-- Internal Thought: Acknowledge the dichotomy—finding the situation amusing yet worrying for her reputation.
-- Speech: Address her directly but with a teasing opener.
 </system_thinking>
-Mr Bennet: <role_action>picks up the letter, tapping it thoughtfully against the desk</role_action><role_thinking>It is a delicate matter, this business with Darcy, yet I cannot help but find the drama amusing, though I must play the part of the concerned father</role_thinking>Well, my dear Lizzy, I trust you are not too greatly troubled by recent events?
 ```
-## Training Data
-The models are trained on a comprehensive role-playing dataset that includes:
-- Authentic multi-turn, multi-character dialogues
-- Character profiles with detailed background information
-- Scenario descriptions and conversation contexts
-- Rich annotations including system thinking, role thinking, and role actions
-- Preference data collected through multi-dimensional evaluation
-## Ethical Considerations
-We have conducted safety checks on the training dataset and implemented safeguards. However, users should be aware that:
-- The models may generate content that reflects biases present in the training data
-- Role-playing as certain characters might involve generating content with specific personality traits or behaviors
-- Users should implement appropriate content filtering when deploying these models in production applications
-- The models include safety evaluation dimensions to minimize harmful outputs
 ## 🎓 Citation
@@ -281,28 +226,26 @@ We have conducted safety checks on the training dataset and implemented safeguar
 @article{her2025,
   title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
   author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
-  journal={arXiv preprint arXiv:2026.21459},
   year={2026}
 }
 ```
 ## 📄 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 ## 🤝 Acknowledgments
 - [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
-- [MiniMax](https://www.minimax.io/news/a-deep-dive-into-the-minimax-m2-her-2) for the evaluation benchmark
 ---
 <div align="center">
-**[Paper](https://arxiv.org/abs/2025.21459)** | **[Model](https://huggingface.co/ChengyuDu0123/HER-32B)** | **[Data](https://huggingface.co/datasets/ChengyuDu0123/HER-Dataset)**
 Made with ❤️ for better AI role-playing
-</div>

   - chat
 base_model: Qwen/Qwen3-32B
 ---
+<div align="center">
+# 🎭 HER-RL: Role-Playing Model with Reinforcement Learning
 ### HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
 </div>
+## Overview
+**HER-RL** is a role-playing language model enhanced with reinforcement learning, built upon Qwen3-32B. It achieves cognitive-level persona simulation through **Dual-layer Thinking**:
+- **System Thinking** (`<system_thinking>`): Third-person meta-level planning on how to portray the character
+- **Role Thinking** (`<role_thinking>`): First-person character's inner thoughts and cognitive processes
+HER-RL significantly outperforms Qwen3-32B baseline by **30.26%** on CoSER and **14.97%** on MiniMax Role-Play Bench.
+## Output Format
+The model generates responses with rich, interleaved structure:
+```
+<system_thinking>
+Third-person analysis: context understanding, character motivation, response planning...
+</system_thinking>
+<role_thinking>Character's inner thoughts (invisible to others)</role_thinking>
+<role_action>Physical actions and expressions (visible to others)</role_action>
+Spoken dialogue text.
+```
 ## How to Use
 ### Quick Start: Interactive Chat Demo
 ```bash
+git clone https://github.com/cydu24/HER.git
+cd HER/chat_demo
+python chat_demo.py --model-path ChengyuDu0123/HER-32B
 ```
 **Demo Options:**
 ```bash
 # Show the model's reasoning process (system thinking)
 python chat_demo.py --show-think
+# Show character's inner thoughts (role thinking)
 python chat_demo.py --show-rolethink
+# Both
+python chat_demo.py --show-think --show-rolethink
 ```
 ### Programmatic Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "ChengyuDu0123/HER-32B"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
     model_name,
     device_map="auto"
 )
+# Build system prompt
+system_prompt = """You are role-playing as Elizabeth Bennet from the book "Pride and Prejudice".
+===Elizabeth Bennet's Profile===
+The protagonist, intelligent and strong-willed. Quick-witted with a playful sense of humor. Values honesty and integrity. Maintains composure under pressure.
+===Current Scene===
+The scene is set at the Netherfield ball. Mr. Darcy has just approached you.
+===The Person You Are Interacting With===
+Mr. Darcy: A wealthy gentleman, proud and reserved. Owner of Pemberley estate.
+===Instructions===
+- Stay in character as Elizabeth Bennet at all times
+- Respond from Elizabeth's perspective
+- Speak DIRECTLY to "Mr. Darcy" using "you" (second person)
 ===Output Format===
+Your output should include thought, speech, and action in this two-part structure:
+1. System Thinking: A single block at the very beginning, wrapped in <system_thinking> and </system_thinking>. This is third-person analysis of how to portray the character.
+2. Role-play Response: The character's actual response including:
+   - <role_thinking>inner thoughts</role_thinking> (invisible to others)
+   - <role_action>physical actions</role_action> (visible to others)
+   - Speech (plain text, what the character says out loud)"""
+user_input = "*Mr. Darcy bows slightly* Miss Bennet, might I have the honor of the next dance?"
 messages = [
     {"role": "system", "content": system_prompt},
     {"role": "user", "content": user_input}
 ]
+# Generate with system_thinking prefix
 text = tokenizer.apply_chat_template(
+    messages + [{"role": "assistant", "content": "<system_thinking>"}],
     tokenize=False,
+    add_generation_prompt=False
 )
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
 )
+response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=False)
+response = response.replace("<|im_end|>", "").replace("<|im_start|>", "").strip()
+full_response = "<system_thinking>" + response
+print(full_response)
 ```
+### Example Output
 ```
 <system_thinking>
+Context Analysis: Mr. Darcy has asked Elizabeth to dance at the Netherfield ball.
+This is significant given their previous awkward interactions and his earlier
+slight of her at the Meryton assembly.
+Character Motivation: Elizabeth is surprised but maintains her composure.
+She's curious about his sudden interest but won't show it openly.
+Her wit is her shield.
 Plan:
+- Action: Accept with grace but subtle irony
+- Internal Thought: Question his motives
+- Speech: Polite acceptance with a hint of her characteristic wit
 </system_thinking>
+<role_thinking>What game is he playing now? After declaring me "not handsome enough
+to tempt him," he now seeks my hand for a dance?</role_thinking>
+<role_action>curtsies with practiced elegance, a slight smile playing at her lips</role_action>
+You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
+dancing to be among your preferred diversions.
 ```
+### Processing the Output
+```python
+import re
+def remove_system_thinking(text):
+    """Remove <system_thinking>...</system_thinking> for display"""
+    pattern = r'<system_thinking>.*?</system_thinking>\s*'
+    return re.sub(pattern, '', text, flags=re.DOTALL).strip()
+def format_for_display(text, show_rolethink=True):
+    """Format for display: [] for thoughts, () for actions"""
+    result = text
+    if show_rolethink:
+        result = result.replace('<role_thinking>', '[').replace('</role_thinking>', ']')
+    else:
+        result = re.sub(r'<role_thinking>.*?</role_thinking>', '', result, flags=re.DOTALL)
+    result = result.replace('<role_action>', '(').replace('</role_action>', ')')
+    result = result.replace('<role_speech>', '').replace('</role_speech>', '')
+    return result.strip()
+# Usage
+clean_response = remove_system_thinking(full_response)
+display_response = format_for_display(clean_response, show_rolethink=True)
+print(display_response)
+```
+**Output:**
+```
+[What game is he playing now? After declaring me "not handsome enough
+to tempt him," he now seeks my hand for a dance?]
+(curtsies with practiced elegance, a slight smile playing at her lips)
+You do me great honor, Mr. Darcy. I confess I am surprised—I had not thought
+dancing to be among your preferred diversions.
+```
+## Performance
+| Model                   | CoSER Avg         | MiniMax Avg       |
+| ----------------------- | ----------------- | ----------------- |
+| Qwen3-32B (baseline)    | 22.86             | 50.76             |
+| HER-SFT                 | 50.92             | 58.44             |
+| **HER-RL**        | **53.12**   | **65.73**   |
+| Improvement vs baseline | **+30.26%** | **+14.97%** |
 ## 🎓 Citation
 @article{her2025,
   title={HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing},
   author={Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao},
+  journal={arXiv preprint arXiv:2601.21459},
   year={2026}
 }
 ```
 ## 📄 License
+This project is licensed under the Apache 2.0 License.
 ## 🤝 Acknowledgments
 - [CoSER](https://github.com/Neph0s/CoSER) for the evaluation benchmark
+- [MiniMax](https://huggingface.co/datasets/MiniMaxAI/role-play-bench) for the evaluation benchmark
 ---
 <div align="center">
+**[Paper](https://arxiv.org/abs/2601.21459)** | **[HER-RM Model](https://huggingface.co/ChengyuDu0123/HER-RM-32B)** | **[Dataset](https://huggingface.co/datasets/ChengyuDu0123/HER-Dataset)** | **[GitHub](https://github.com/cydu24/HER)**
 Made with ❤️ for better AI role-playing
+</div>