Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,136 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- meta-llama/Llama-3.1-8B
|
| 7 |
---
|
| 8 |
|
| 9 |
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
<div align="center">
|
| 13 |
+
|
| 14 |
+
<!-- Logo ๅพ็ -->
|
| 15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/zT-y1dEjenfRwGk0ezjGc.png" width="300" style="border-radius: 20px;"/>
|
| 16 |
+
|
| 17 |
+
<!-- ้พๆฅ้จๅ -->
|
| 18 |
+
<p style="margin-top: 20px;">
|
| 19 |
+
<a href="https://example.com/report" style="margin: 0 10px;">
|
| 20 |
+
๐ <strong>Paper</strong>
|
| 21 |
+
</a> |
|
| 22 |
+
<a href="https://github.com/KaiHe-better/Crab?tab=readme-ov-file" style="margin: 0 10px;">
|
| 23 |
+
๐ <strong>Github</strong>
|
| 24 |
+
</a> |
|
| 25 |
+
<a href="https://huggingface.co/HeAAAAA/RoleRM" style="margin: 0 10px;">
|
| 26 |
+
๐ฌ <strong>Role-palying Evaluation Model</strong>
|
| 27 |
+
</a>
|
| 28 |
+
<br>
|
| 29 |
+
<a href="https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-train-set" style="margin: 0 10px;">
|
| 30 |
+
๐ฌ <strong>Training Dataset</strong>
|
| 31 |
+
</a> |
|
| 32 |
+
<a href="https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-evaluation-benchmark" style="margin: 0 10px;">
|
| 33 |
+
๐ฌ <strong>Evaluation Benchmark</strong>
|
| 34 |
+
</a> |
|
| 35 |
+
<a href="https://huggingface.co/datasets/HeAAAAA/Crab-manually-annotated-role-playing-evaluation-dataset" style="margin: 0 10px;">
|
| 36 |
+
๐ฌ <strong>Annotated Role-playing Evaluation Dataset</strong>
|
| 37 |
+
</a>
|
| 38 |
+
|
| 39 |
+
</p>
|
| 40 |
+
|
| 41 |
+
</div>
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
# 1. Introduction
|
| 47 |
+
|
| 48 |
+
We introduces Crab, a novel Configurable Role-Playing (RP) LLM with Assessing Benchmark, which consists of Role-Centric Dataset Curation, Persona-Embodying LLM Construction, and Comprehensive Benchmark Creation for RP dialogue generation.
|
| 49 |
+
Distinct from traditional RP models that employ only several preset roles, Crab enables dynamic configuration of desired roles, thereby enhancing related flexibility and adaptability.
|
| 50 |
+
To effectively train RP-LLMs, we curated the largest RP training dataset.
|
| 51 |
+
The dataset provides a detailed role overview for each dialogue, including character profile, conversation scenario, and tagged topic, capturing a broad range of role-based behaviors, emotions, and interactions.
|
| 52 |
+
We also noticed that current benchmarks lack both proper evaluation standards and methods.
|
| 53 |
+
Thus, to validate RP-LLMs' effectiveness, we introduced a new benchmark containing an evaluation standard, a test dataset with manual annotations, and a reward model RoleRM designed to automatically assess specific aspects of RP while aligning with human perception.
|
| 54 |
+
Sufficient experiments reveal that RoleRM significantly outperforms ChatGPT and other evaluation methods in conducting fine-grained evaluations of RP.
|
| 55 |
+
Also, RP-LLMs powered by Crab demonstrate superior performance across various fine-grained aspects.
|
| 56 |
+
|
| 57 |
+
More details can be seen at Github {https://github.com/KaiHe-better/Crab?tab=readme-ov-file}.
|
| 58 |
+
|
| 59 |
+
# 2. Configurable Role-Playing LLM
|
| 60 |
+
|
| 61 |
+
<div align="center">
|
| 62 |
+
|
| 63 |
+
<!-- Logo ๅพ็ -->
|
| 64 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/fDaDq8tzBBUuEteND8N53.png" width="500" style="border-radius: 20px;"/>
|
| 65 |
+
|
| 66 |
+
</div>
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
Unlike existing RP-LLMs, where a single role is trained with numerous dialogues, our approach introduces a diverse range of roles with detailed configuration information
|
| 70 |
+
while keeping dialogue per role minimal. This enables LLMs to generate dialogues dynamically from configurations rather than memorizing specific roles, enhancing flexibility and adaptability. Additionally, we propose RoleRM in our benchmarks to address the challenge of evaluating RP performance.
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
# 3. Performance
|
| 74 |
+
|
| 75 |
+
| Models | Overall | Language Fluency | Language Relevance | Role Language | Role Knowledge | Emotional Expression | Interactive Engagement |
|
| 76 |
+
|----------------------|---------|------------------|---------------------|----------------|-----------------|-----------------------|------------------------|
|
| 77 |
+
| Llama-2-7B | 1.57 | 2.19 | 1.83 | 1.63 | 1.37 | 1.21 | 1.21 |
|
| 78 |
+
| Llama-3-8B | 1.99 | 2.56 | 2.36 | 2.09 | 1.78 | 1.56 | 1.60 |
|
| 79 |
+
| Llama-3.1-8B | 1.94 | 2.52 | 2.30 | 2.01 | 1.75 | 1.47 | 1.57 |
|
| 80 |
+
| Llama-2-7B-Crab | 2.14 | 2.73 | 2.35 | 2.07 | 1.88 | 1.69 | 2.12 |
|
| 81 |
+
| Llama-3-8B-Crab | 2.22 | 2.81 | 2.51 | 2.16 | 1.95 | 1.77 | 2.13 |
|
| 82 |
+
| **Llama-3.1-8B-Crab**| **2.23**| **2.87** | **2.56** | **2.17** | **1.95** | **1.76** | **2.09** |
|
| 83 |
+
| GPT3.5 | 1.66 | 2.35 | 2.11 | 1.72 | 1.50 | 1.11 | 1.17 |
|
| 84 |
+
| GPT4o | 1.86 | 2.44 | 2.27 | 1.90 | 1.69 | 1.33 | 1.51 |
|
| 85 |
+
| GPT4 | 2.13 | 2.73 | 2.53 | 2.18 | 1.90 | 1.62 | 1.86 |
|
| 86 |
+
| CharacterGLM-6B | 1.83 | 2.37 | 1.96 | 1.80 | 1.60 | 1.39 | 1.86 |
|
| 87 |
+
| Pygmalion-2-7B | 2.11 | 2.82 | 2.49 | 2.01 | 1.86 | 1.58 | 1.91 |
|
| 88 |
+
| Haruhi-Zero-7B | 2.17 | 2.80 | 2.49 | 2.12 | 2.00 | 1.74 | 1.86 |
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
Table 1: The results of evaluation on the test data of our Benchmark. The listed scores are from our RoleRM. Bold fonts indicate the best results and underlined fonts represent the second best. The subscripts represent the difference between each model and Crab (Llama-3.1-8B-Crab) counterpart.
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
<div align="center">
|
| 96 |
+
|
| 97 |
+
<!-- Logo ๅพ็ -->
|
| 98 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/650add6348983c90ab688b6e/veerjA_MP5ZXmOxjAdZRO.png" width="500" style="border-radius: 20px;"/>
|
| 99 |
+
|
| 100 |
+
</div>
|
| 101 |
+
Figure 2: Human evaluation comparing Crab, GPT-3.5, and Pygmalion-2-7B. We selected a general LLM and one well-known RP-LLM to compare their generations against our Crab. For the same dialogue, annotators ranked responses from the three LLMs.
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
| Models | Overall | Language Fluency | Language Relevance | Role Language | Role Knowledge | Emotional Expression | Interactive Engagement |
|
| 105 |
+
|---------------|---------|------------------|---------------------|----------------|-----------------|-----------------------|------------------------|
|
| 106 |
+
| **Crab (sampled)** | **2.20** | **2.71** | **2.45** | **2.15** | **1.95** | **1.84** | **2.12** |
|
| 107 |
+
| w/o base | 2.17 | 2.72 | 2.41 | 2.07 | 1.89 | 1.79 | 2.11 |
|
| 108 |
+
| w/o ref. | 2.15 | 2.70 | 2.40 | 2.01 | 1.85 | 1.82 | 2.11 |
|
| 109 |
+
| w/o scene | 2.15 | 2.69 | 2.39 | 2.10 | 1.90 | 1.81 | 1.98 |
|
| 110 |
+
|
| 111 |
+
Table 2: The ablation study for Crab. Due to missing attributes in our dataset, we sampled 1,000 fully attributed instances as the sub-test set to conduct the ablation experiments, referred to as Crab (sampled). The notation โw/o base" means without base role information for training RP-LLMs, including age, gender, personality, description, and expression; โw/o ref." means without catchphrases and knowledge; โw/o scene" means without interlocutor, relation, scenario, and tags.
|
| 112 |
+
|
| 113 |
+
# 4. Three Datasets
|
| 114 |
+
We publish three datasets, including Crab role-playing train set, Crab role-playing evaluation benchmark, and manually annotated role-playing evaluation dataset (can be used for training a Role-palying Evaluation Model).
|
| 115 |
+
|
| 116 |
+
Crab role-playing train set:
|
| 117 |
+
{https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-train-set}
|
| 118 |
+
|
| 119 |
+
Crab role-playing evaluation benchmark:
|
| 120 |
+
{https://huggingface.co/datasets/HeAAAAA/Crab-role-playing-evaluation-benchmark}
|
| 121 |
+
|
| 122 |
+
Crab manually annotated role-playing evaluation dataset:
|
| 123 |
+
{https://huggingface.co/datasets/HeAAAAA/Crab-manually-annotated-role-playing-evaluation-dataset}
|
| 124 |
+
|
| 125 |
+
# 5. Role-palying Evaluation Model
|
| 126 |
+
We also release a trained model to automate the evaluation of role-playing tasks.
|
| 127 |
+
{https://huggingface.co/HeAAAAA/RoleRM}
|
| 128 |
+
|
| 129 |
+
# 6. Citation
|
| 130 |
+
|
| 131 |
+
```bibtex
|
| 132 |
+
@misc{kimiteam2025kimivltechnicalreport,
|
| 133 |
+
title={Crab: A Novel Configurable Role-Playing LLM with Assessing Benchmark},
|
| 134 |
+
author={Kai He, Yucheng Huang, Wenqing Wang, et.al.},
|
| 135 |
+
year={2025},
|
| 136 |
+
}
|