sunweiwei commited on
Commit
cdc13de
·
verified ·
1 Parent(s): d81b600

Add model card README

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-VL-8B-Instruct
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ tags:
8
+ - human-simulation
9
+ - role-play
10
+ - social-intelligence
11
+ ---
12
+
13
+ # Ditto-8B
14
+
15
+ **Ditto-8B** is an 8B open-weight model for **human behavior simulation**, covering theory of
16
+ mind, character role-play, social skills, learner simulation, user simulation, and persona
17
+ simulation.
18
+
19
+ - 📄 Paper: [Reinforcing Human Behavior Simulation via Verbal Feedback](https://arxiv.org/abs/2605.20506)
20
+ - 💻 Code: https://github.com/sunnweiwei/OdysSim
21
+ - 📊 Data (SOUL): https://huggingface.co/datasets/sunweiwei/Soul
22
+
23
+ ## Method
24
+
25
+ Ditto-8B is trained with **DITTO**, a reinforcement learning method that uses **verbal
26
+ feedback** as the learning signal. After each output, the model receives descriptive feedback
27
+ and produces an improved version; both are jointly optimized with GRPO. This distills the
28
+ verbal guidance into the policy, so **no feedback is needed at inference time**.
29
+
30
+ ## Results
31
+
32
+ Primary metric for each benchmark (higher is better).
33
+
34
+ | Dim | Benchmark | GPT 5.5 | Gemini 3.1 Pro | Claude Opus 4.7 | Qwen 3.6 Plus | Others* | Qwen3 8B Inst | Ditto-8B |
35
+ |---|---|---|---|---|---|---|---|---|
36
+ | CONV | UserLLM | 65.3 | 67.7 | 57.6 | 72.1 | 44.6 | 46.0 | 91.5 |
37
+ | CONV | MirrorBench | 56.7 | 48.3 | 63.7 | 48.0 | 45.4 | 54.0 | 73.4 |
38
+ | CONV | Humanual-Chat | 28.2 | 21.0 | 22.6 | 22.2 | 25.8 | 24.7 | 21.0 |
39
+ | CONV | SimArena-Doc | 83.4 | 83.0 | 83.5 | 82.4 | 83.5 | 83.6 | 84.4 |
40
+ | SS | Sotopia-Hard | 31.9 | 27.8 | 32.4 | 28.3 | 31.7 | 27.7 | 45.8 |
41
+ | COG | Fantom | 93.0 | 93.0 | 80.0 | 89.0 | 70.0 | 23.0 | 92.0 |
42
+ | COG | Hitom | 82.0 | 86.0 | 93.0 | 73.0 | 56.0 | 62.0 | 79.0 |
43
+ | COG | Paratomi | 99.0 | 97.0 | 90.0 | 94.0 | 75.0 | 67.0 | 95.0 |
44
+ | COG | Social-R1 | 69.0 | 79.0 | 67.0 | 67.0 | 47.0 | 54.0 | 50.0 |
45
+ | ROLE | Coser | 66.2 | 62.1 | 66.5 | 55.9 | 30.3 | 43.5 | 64.4 |
46
+ | ROLE | Lifechoices | 91.0 | 84.0 | 92.0 | 79.0 | 67.0 | 70.0 | 70.0 |
47
+ | ROLE | Twinvoice | 74.0 | 86.0 | 83.0 | 71.0 | 40.0 | 42.0 | 71.0 |
48
+ | ROLE | BehaviorChain | 95.0 | 92.0 | 96.0 | 85.0 | 36.0 | 41.0 | 44.0 |
49
+ | ROLE | SimArena-Math | 68.5 | 71.5 | 68.7 | 70.9 | 70.5 | 68.9 | 69.6 |
50
+ | ROLE | Mistakes | 72.0 | 73.0 | 74.0 | 67.0 | 56.0 | 27.0 | 36.0 |
51
+ | ROLE | Humanual-Email | 50.1 | 46.9 | 50.4 | 47.9 | 42.8 | 43.7 | 40.8 |
52
+ | ROLE | Humanual-News | 40.2 | 42.3 | 41.3 | 41.8 | 33.1 | 32.5 | 27.5 |
53
+ | ROLE | Humanual-Politics | 42.0 | 32.5 | 43.5 | 31.6 | 34.2 | 33.2 | 29.7 |
54
+ | EVAL | AlignX | 71.2 | 73.4 | 71.6 | 69.8 | 66.8 | 68.6 | 67.4 |
55
+ | EVAL | Humanllm | 45.7 | 46.9 | 44.2 | 42.7 | 35.2 | 34.1 | 33.1 |
56
+ | EVAL | Socsci210 | 77.2 | 78.0 | 77.2 | 74.5 | 75.2 | 73.6 | 72.5 |
57
+ | EVAL | Humanual-Book | 57.6 | 62.4 | 61.4 | 58.4 | 50.2 | 53.6 | 53.4 |
58
+ | EVAL | Humanual-Opinion | 39.8 | 36.0 | 46.2 | 34.2 | 37.4 | 37.2 | 30.3 |
59
+
60
+ \* *Others*: best result among other specialized human-simulation models (HumanLM-8B, Sotopia-RL-7B, UserLM-8B, Coser-8B).
61
+
62
+ > **Note.** The released Ditto-8B is a single generalist distilled from a set of task-specific DITTO experts via rejection sampling on the training set.
63
+
64
+ ## Usage
65
+
66
+ ```python
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer
68
+
69
+ model_name = "sunweiwei/Ditto-8B"
70
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
71
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
72
+
73
+ messages = [{"role": "user", "content": "Hello!"}]
74
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
75
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
76
+ outputs = model.generate(**inputs, max_new_tokens=512)
77
+ print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
78
+ ```
79
+
80
+ ## Citation
81
+
82
+ ```bibtex
83
+ @article{sun2026ditto,
84
+ title = {Reinforcing Human Behavior Simulation via Verbal Feedback},
85
+ author = {Sun, Weiwei and Zhou, Xuhui and Liu, Jiarui and Du, Weihua and Sun, Haojia and Xie, Yiqing and Ma, Qianou and Chen, Sihao and Wan, Mengting and Yang, Longqi and Zhou, Pei and Wu, Sherry and Welleck, Sean and Neubig, Graham and Yang, Yiming and Sap, Maarten},
86
+ year = {2026},
87
+ eprint = {2605.20506},
88
+ archivePrefix = {arXiv},
89
+ url = {http://arxiv.org/abs/2605.20506}
90
+ }
91
+ ```