File size: 12,075 Bytes
5398b18
 
64c32ff
 
 
 
5398b18
 
 
64c32ff
 
 
 
eedb5fb
64c32ff
 
eedb5fb
64c32ff
 
 
 
 
 
1c6257f
 
 
8a72bc2
1c6257f
 
64c32ff
1c6257f
64c32ff
 
 
 
 
 
 
 
 
8a72bc2
 
 
64c32ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a72bc2
 
64c32ff
1c6257f
 
 
64c32ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6975b2f
 
9acec55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eedb5fb
9acec55
 
 
eedb5fb
9acec55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a72bc2
64c32ff
8a72bc2
 
 
 
1c6257f
6975b2f
 
9acec55
8a72bc2
 
 
1c6257f
6975b2f
 
9acec55
8a72bc2
 
 
64c32ff
6975b2f
 
8a72bc2
 
9acec55
64c32ff
 
8db17f4
64c32ff
8db17f4
 
 
64c32ff
1c6257f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.1-8B
---




<div align="center">


  <img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/zT-y1dEjenfRwGk0ezjGc.png" width="300" style="border-radius: 20px;"/>


  <p style="margin-top: 20px;">
    <a href="https://example.com/report" style="margin: 0 10px;">
      📄 <strong>Paper</strong>
    </a> |
    <a href="https://github.com/KaiHe-better/Crab?tab=readme-ov-file" style="margin: 0 10px;">
      📄 <strong>Github</strong>
    </a> 
    <br>
    <a href="https://huggingface.co/HeAAAAA/Crab" style="margin: 0 10px;">
      💬 <strong>Role-playing Model</strong>
    </a>  |
    <a href="https://huggingface.co/HeAAAAA/RoleRM" style="margin: 0 10px;">
      💬 <strong>Role-palying Evaluation Model</strong>
    </a> 
    <br>
    <a href="https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-train-set" style="margin: 0 10px;">
      💬 <strong>Training Dataset</strong>
    </a>  |
    <a href="https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-evaluation-benchmark" style="margin: 0 10px;">
      💬 <strong>Evaluation Benchmark</strong>
    </a>  |
    <a href="https://huggingface.co/datasets/HeAAAAA/Crab-manually-annotated-role-playing-evaluation-dataset" style="margin: 0 10px;">
      💬 <strong>Annotated Role-playing Evaluation Dataset</strong>
    </a>   |
     <a href="https://huggingface.co/datasets/HeAAAAA/Crab-human-preference" style="margin: 0 10px;">
      💬 <strong>Human-preference Dataset</strong>
    </a>
  </p>

</div>




# 1. Introduction
   
We introduces Crab, a novel Configurable Role-Playing (RP) LLM with Assessing Benchmark, which consists of Role-Centric Dataset Curation, Persona-Embodying LLM Construction, and Comprehensive Benchmark Creation for RP dialogue generation.
Distinct from traditional RP models that employ only several preset roles, Crab enables dynamic configuration of desired roles, thereby enhancing related flexibility and adaptability. 
To effectively train RP-LLMs, we curated the largest RP training dataset. 
The dataset provides a detailed role overview for each dialogue, including character profile, conversation scenario, and tagged topic, capturing a broad range of role-based behaviors, emotions, and interactions.
We also noticed that current benchmarks lack both proper evaluation standards and methods. 
Thus, to validate RP-LLMs' effectiveness, we introduced a new benchmark containing an evaluation standard, a test dataset with manual annotations, and a reward model RoleRM designed to automatically assess specific aspects of RP while aligning with human perception.
Sufficient experiments reveal that RoleRM significantly outperforms ChatGPT and other evaluation methods in conducting fine-grained evaluations of RP. 
Also, RP-LLMs powered by Crab demonstrate superior performance across various fine-grained aspects.

More details can be seen at [GitHub](https://github.com/KaiHe-better/Crab?tab=readme-ov-file).





# 2. Configurable Role-Playing LLM

<div align="center">

  <!-- Logo 图片 -->
  <img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/fDaDq8tzBBUuEteND8N53.png" width="500" style="border-radius: 20px;"/>

</div>


Unlike existing RP-LLMs, where a single role is trained with numerous dialogues, our approach introduces a diverse range of roles with detailed configuration information
while keeping dialogue per role minimal. This enables LLMs to generate dialogues dynamically from configurations rather than memorizing specific roles, enhancing flexibility and adaptability. Additionally, we propose RoleRM in our benchmarks to address the challenge of evaluating RP performance.


# 3. Performance

| Models               | Overall | Language Fluency | Language Relevance | Role Language | Role Knowledge | Emotional Expression | Interactive Engagement |
|----------------------|---------|------------------|---------------------|----------------|-----------------|-----------------------|------------------------|
| Llama-2-7B           | 1.57    | 2.19             | 1.83                | 1.63           | 1.37            | 1.21                  | 1.21                   |
| Llama-3-8B           | 1.99    | 2.56             | 2.36                | 2.09           | 1.78            | 1.56                  | 1.60                   |
| Llama-3.1-8B         | 1.94    | 2.52             | 2.30                | 2.01           | 1.75            | 1.47                  | 1.57                   |
| Llama-2-7B-Crab      | 2.14    | 2.73             | 2.35                | 2.07           | 1.88            | 1.69                  | 2.12                   |
| Llama-3-8B-Crab      | 2.22    | 2.81             | 2.51                | 2.16           | 1.95            | 1.77                  | 2.13                   |
| **Llama-3.1-8B-Crab**| **2.23**| **2.87**         | **2.56**            | **2.17**       | **1.95**        | **1.76**              | **2.09**               |
| GPT3.5               | 1.66    | 2.35             | 2.11                | 1.72           | 1.50            | 1.11                  | 1.17                   |
| GPT4o                | 1.86    | 2.44             | 2.27                | 1.90           | 1.69            | 1.33                  | 1.51                   |
| GPT4                 | 2.13    | 2.73             | 2.53                | 2.18           | 1.90            | 1.62                  | 1.86                   |
| CharacterGLM-6B      | 1.83    | 2.37             | 1.96                | 1.80           | 1.60            | 1.39                  | 1.86                   |
| Pygmalion-2-7B       | 2.11    | 2.82             | 2.49                | 2.01           | 1.86            | 1.58                  | 1.91                   |
| Haruhi-Zero-7B       | 2.17    | 2.80             | 2.49                | 2.12           | 2.00            | 1.74                  | 1.86                   |


Table 1: The results of evaluation on the test data of our Benchmark. The listed scores are from our RoleRM. Bold fonts indicate the best results and underlined fonts represent the second best. The subscripts represent the difference between each model and Crab (Llama-3.1-8B-Crab) counterpart.



<div align="center">

  <!-- Logo 图片 -->
  <img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/veerjA_MP5ZXmOxjAdZRO.png" width="500" style="border-radius: 20px;"/>

</div>
Figure 2: Human evaluation comparing Crab, GPT-3.5, and Pygmalion-2-7B. We selected a general LLM and one well-known RP-LLM to compare their generations against our Crab. For the same dialogue, annotators ranked responses from the three LLMs.


| Models        | Overall | Language Fluency | Language Relevance | Role Language | Role Knowledge | Emotional Expression | Interactive Engagement |
|---------------|---------|------------------|---------------------|----------------|-----------------|-----------------------|------------------------|
| **Crab (sampled)** | **2.20** | **2.71**         | **2.45**            | **2.15**       | **1.95**        | **1.84**              | **2.12**               |
| w/o base      | 2.17    | 2.72             | 2.41                | 2.07           | 1.89            | 1.79                  | 2.11                   |
| w/o ref.      | 2.15    | 2.70             | 2.40                | 2.01           | 1.85            | 1.82                  | 2.11                   |
| w/o scene     | 2.15    | 2.69             | 2.39                | 2.10           | 1.90            | 1.81                  | 1.98                   |

Table 2: The ablation study for Crab. Due to missing attributes in our dataset, we sampled 1,000 fully attributed instances as the sub-test set to conduct the ablation experiments, referred to as Crab (sampled). The notation “w/o base" means without base role information for training RP-LLMs, including age, gender, personality, description, and expression; “w/o ref." means without catchphrases and knowledge; “w/o scene" means without interlocutor, relation, scenario, and tags.

<br>



# 4. Usage

<pre lang="markdown"> 
from transformers import AutoTokenizer, AutoModelForCausalLM

system_prompt = """ 
# Enter Roleplaying Mode
Now you are character `Hermione`.

## Role Info
Name: `Hermione`
Age: `teenager`
Gender: `female`
Personality: `Intelligent, curious, respectful, and eager to learn`
Description: `Hermione and Hagrid were in the Forbidden Forest, walking on a narrow path surrounded by trees. Hermione looked around carefully, fascinated by the dense forest. Hagrid was leading the way, pointing out various creatures and telling her about their habits and characteristics.`
Conversation rules:
    - Your utterance need to describe your behavior and expressions using `()`.
    Reference speaking style: ```I've read about it in my books
    [end_of_dialogue]
    
    I think it's so important to learn about these creatures
    [end_of_dialogue]
    
    ```Knowledge: ```
        ## Current Scenario Dialogue
        Interlocutor: `Hagrid, Hagrid is the Care of Magical Creatures teacher at Hogwarts. He is a half-giant with a great love for all creatures, magical or not.`
        Your relationship: `Teacher and student`
        Scene: `Hermione and Hagrid are in the Forbidden Forest, exploring and learning about the various magical creatures that live there.`
        Tags: ['friendly', 'educational', 'fantasy', 'Harry Potter']
        Please converse as `Hermione`.
"""

user_prompt = """
"Now, this here is a Bowtruckle, Hermione. They're very small, only about the size of a twig, and they're very shy. They usually live in trees and are very good at camouflaging themselves. You have to be very careful when handling them because they have very sharp fingers. Hermione, do you like them ?"
"""


model = AutoModelForCausalLM.from_pretrained("HeAAAAA/Crab")
tokenizer = AutoTokenizer.from_pretrained("HeAAAAA/Crab")


inputs_prompt = system_prompt + user_prompt
inputs = tokenizer(inputs_prompt, return_tensors="pt")


outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id  
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
  
</pre>


# 5. Four Datasets
We totally publish three datasets, including :

1. [Crab role-playing train set](https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-train-set) : the dataset used for fine‑tuning a role‑playing LLM.
2. [Crab role-playing evaluation benchmark](https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-evaluation-benchmark) :the dataset used for evalauating a role‑playing LLM.
3. [Manually annotated role-playing evaluation dataset](https://huggingface.co/datasets/HeAAAAA/Crab-manually-annotated-role-playing-evaluation-dataset):  the dataset used for training a evaluator for role‑playing tasks.
4. [Crab Human preference dataset](https://huggingface.co/datasets/HeAAAAA/Crab-Human-preference): the dataset used to train a role‑playing LLM via reinforcement learning

<br>

# 6. Fine-tuned Role-playing Model
We release a fine-tuned Role-playin LLM to achieve configurable Role-Playing tasks:

[Download Link](https://huggingface.co/HeAAAAA/Crab)

<br>

# 7. Role-palying Evaluation Model
We release a trained LLM to automate the evaluation of role-playing tasks:

[Download Link](https://huggingface.co/HeAAAAA/RoleRM)

<br>



# 8. Citation

```bibtex
@inproceedings{he2025Crab,
  title={Crab: A Novel Configurable Role-Playing LLM with Assessing Benchmark},
  author={Kai, He and Yucheng, Huang and  Wenqing,  Wang  and   Delong,  Ran  and   Dongming,  Sheng  and   Junxuan,  Huang  and   Qika,  Lin and Jiaxing,  Xu  and  Wenqiang,  Liu and  Mengling,  Feng},
  booktitle={Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2025}
}