Ryann829 commited on
Commit
bbfd692
·
verified ·
1 Parent(s): 15bc04a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +498 -1
README.md CHANGED
@@ -9,4 +9,501 @@ base_model:
9
  pipeline_tag: image-text-to-image
10
  tags:
11
  - subject-driven
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: image-text-to-image
10
  tags:
11
  - subject-driven
12
+ ---
13
+
14
+ <p align="center">
15
+ <img src="assets/logo.png" alt="Scone" width="400"/>
16
+ </p>
17
+ <h3 align="center">
18
+ Scone: Bridging Composition and Distinction in Subject-Driven Image Generation </br>via Unified Understanding-Generation Modeling
19
+ </h3>
20
+
21
+ <p align="center">
22
+ <a href="https://arxiv.org/abs/2512.12675"><img alt="Build" src="https://img.shields.io/badge/arXiv%20paper-Scone-b31b1b.svg"></a>
23
+ <a href="https://github.com/Ryann-Ran/Scone"><img alt="Build" src="https://img.shields.io/github/stars/Ryann-Ran/Scone"></a>
24
+ <a href="https://huggingface.co/Ryann829/Scone"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=green"></a>
25
+ <a href="https://huggingface.co/datasets/Ryann829/Scone-S2I-57K"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Data&color=yellow"></a>
26
+ <a href="https://huggingface.co/datasets/Ryann829/SconeEval"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Benchmark&color=yellow"></a>
27
+
28
+
29
+
30
+
31
+
32
+ # 🔧 Environment setup
33
+
34
+ ```bash
35
+ git clone https://github.com/Ryann-Ran/Scone.git
36
+ cd Scone
37
+ conda create -n scone python=3.10 -y
38
+ conda activate scone
39
+ pip install -r requirements.txt
40
+ pip install flash_attn==2.5.8 --no-build-isolation
41
+ ```
42
+
43
+
44
+ # 🔍 Inference and Evaluation
45
+
46
+ ## Scone model preparation
47
+
48
+ Download the Scone model checkpoint from [HuggingFace](https://huggingface.co/Ryann829/Scone):
49
+
50
+ ```bash
51
+ # pip install -U huggingface_hub
52
+ hf download Ryann829/Scone --local-dir ./ckpts/Scone
53
+ ```
54
+
55
+ ## Single case inference
56
+
57
+ Run the inference script:
58
+
59
+ ```bash
60
+ bash scripts/inference_single_case.sh
61
+ ```
62
+
63
+ **Example Output:** (Images sampled at 1024x1024 resolution with seed 1234, except for GPT-4o and Gemini-2.5-Flash-Image APIs)
64
+
65
+ <table>
66
+ <tr>
67
+ <th style="width: 150px; text-align: center;">Ref. 1</th>
68
+ <th style="width: 150px; text-align: center;">Ref. 2</th>
69
+ <th style="width: 200px; text-align: center;">Instruction</th>
70
+ <th style="width: 150px; text-align: center;">Scone (Ours)</th>
71
+ <th style="width: 150px; text-align: center;">GPT-4o</th>
72
+ <th style="width: 150px; text-align: center;">Gemini-2.5-Flash-Image</th>
73
+ <th style="width: 150px; text-align: center;">UNO</th>
74
+ <th style="width: 150px; text-align: center;">Qwen-Image-Edit-2509</th>
75
+ <th style="width: 150px; text-align: center;">BAGEL</th>
76
+ <th style="width: 150px; text-align: center;">OmniGen2</th>
77
+ <th style="width: 150px; text-align: center;">Echo-4o</th>
78
+ </tr>
79
+ <tr>
80
+ <td style="width: 150px; text-align: center;"><img src="./example_images/image_object.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
81
+ <td style="width: 150px; text-align: center;"><img src="./example_images/image_character.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
82
+ <td style="width: 200px; text-align: center;">The man from image 2 holds the object which has a blue-and-red top in image 1 in a coffee shop.</td>
83
+ <td style="width: 150px; text-align: center;"><img src="./example_results/Scone.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
84
+ <td style="width: 150px; text-align: center;"><img src="./example_results/GPT-4o.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
85
+ <td style="width: 150px; text-align: center;"><img src="./example_results/Gemini-2.5-Flash-Image.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
86
+ <td style="width: 150px; text-align: center;"><img src="./example_results/UNO.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
87
+ <td style="width: 150px; text-align: center;"><img src="./example_results/Qwen-Image-Edit-2509.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
88
+ <td style="width: 150px; text-align: center;"><img src="./example_results/BAGEL.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
89
+ <td style="width: 150px; text-align: center;"><img src="./example_results/OmniGen2.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
90
+ <td style="width: 150px; text-align: center;"><img src="./example_results/Echo-4o.png" style="width: 150px; height: auto; display: block; margin: 0 auto;" /></td>
91
+ </tr>
92
+ </table>
93
+
94
+
95
+ # 📊 Performance
96
+
97
+ ## OmniContext benchmark
98
+ <table border="1" style="border-collapse: collapse; width: 100%;">
99
+ <thead>
100
+ <tr>
101
+ <th rowspan="2">Method</th>
102
+ <th colspan="2" style="text-align: center;">SINGLE ↑</th>
103
+ <th colspan="3" style="text-align: center;">MULTIPLE ↑</th>
104
+ <th colspan="3" style="text-align: center;">SCENE ↑</th>
105
+ <th rowspan="2" style="text-align: center;">Average ↑</th>
106
+ </tr>
107
+ <tr>
108
+ <th style="text-align: center;">Character</th>
109
+ <th style="text-align: center;">Object</th>
110
+ <th style="text-align: center;">Character</th>
111
+ <th style="text-align: center;">Object</th>
112
+ <th style="text-align: center;">Char. + Obj.</th>
113
+ <th style="text-align: center;">Character</th>
114
+ <th style="text-align: center;">Object</th>
115
+ <th style="text-align: center;">Char. + Obj.</th>
116
+ </tr>
117
+ </thead>
118
+ <tbody>
119
+ <tr>
120
+ <td colspan="10" style="background-color: #ffefe6; text-align: center; font-weight: bold; font-style: italic;">Closed-Source Model</td>
121
+ </tr>
122
+ <tr>
123
+ <td>Gemini-2.5-Flash-Image</td>
124
+ <td>8.79</td>
125
+ <td><strong>9.12</strong></td>
126
+ <td>8.27</td>
127
+ <td>8.60</td>
128
+ <td>7.71</td>
129
+ <td>7.63</td>
130
+ <td>7.65</td>
131
+ <td>6.81</td>
132
+ <td>8.07</td>
133
+ </tr>
134
+ <tr>
135
+ <td>GPT-4o<sup>*</sup></td>
136
+ <td><strong>8.96</strong></td>
137
+ <td>8.91</td>
138
+ <td><strong>8.90</strong></td>
139
+ <td><strong>8.95</strong></td>
140
+ <td><strong>8.81</strong></td>
141
+ <td><strong>8.92</strong></td>
142
+ <td><strong>8.40</strong></td>
143
+ <td><strong>8.44</strong></td>
144
+ <td><strong>8.78</strong></td>
145
+ </tr>
146
+ <tr>
147
+ <td colspan="10" style="background-color: #e0eef9; text-align: center; font-weight: bold; font-style: italic;">Generation Model</td>
148
+ </tr>
149
+ <tr>
150
+ <td>FLUX.1 Kontext [dev]</td>
151
+ <td>8.07</td>
152
+ <td>7.97</td>
153
+ <td>-</td>
154
+ <td>-</td>
155
+ <td>-</td>
156
+ <td>-</td>
157
+ <td>-</td>
158
+ <td>-</td>
159
+ <td>-</td>
160
+ </tr>
161
+ <tr>
162
+ <td>UNO</td>
163
+ <td>7.15</td>
164
+ <td>6.72</td>
165
+ <td>3.56</td>
166
+ <td>6.46</td>
167
+ <td>4.90</td>
168
+ <td>2.72</td>
169
+ <td>4.89</td>
170
+ <td>4.76</td>
171
+ <td>5.14</td>
172
+ </tr>
173
+ <tr>
174
+ <td>USO</td>
175
+ <td>8.03</td>
176
+ <td>7.55</td>
177
+ <td>3.32</td>
178
+ <td>6.10</td>
179
+ <td>4.56</td>
180
+ <td>2.77</td>
181
+ <td>5.38</td>
182
+ <td>5.09</td>
183
+ <td>5.35</td>
184
+ </tr>
185
+ <tr>
186
+ <td>UniWorld-V2</td>
187
+ <td>8.45</td>
188
+ <td><strong>8.44</strong></td>
189
+ <td>7.87</td>
190
+ <td>8.22</td>
191
+ <td><strong>7.95</strong></td>
192
+ <td><strong>5.36</strong></td>
193
+ <td>7.47</td>
194
+ <td><strong>6.98</strong></td>
195
+ <td>7.59</td>
196
+ </tr>
197
+ <tr>
198
+ <td>Qwen-Image-Edit-2509</td>
199
+ <td><strong>8.56</strong></td>
200
+ <td>8.41</td>
201
+ <td><strong>7.92</strong></td>
202
+ <td><strong>8.37</strong></td>
203
+ <td>7.79</td>
204
+ <td>5.23</td>
205
+ <td><strong>7.70</strong></td>
206
+ <td>6.86</td>
207
+ <td><strong>7.60</strong></td>
208
+ </tr>
209
+ <tr>
210
+ <td colspan="10" style="background-color: #E6E6FA; text-align: center; font-weight: bold; font-style: italic;">Unified Model</td>
211
+ </tr>
212
+ <tr>
213
+ <td>BAGEL</td>
214
+ <td>7.00</td>
215
+ <td>7.04</td>
216
+ <td>5.32</td>
217
+ <td>6.69</td>
218
+ <td>6.74</td>
219
+ <td>3.94</td>
220
+ <td>5.77</td>
221
+ <td>5.73</td>
222
+ <td>6.03</td>
223
+ </tr>
224
+ <tr>
225
+ <td>OmniGen2</td>
226
+ <td>8.17</td>
227
+ <td>7.63</td>
228
+ <td>7.26</td>
229
+ <td>7.03</td>
230
+ <td>7.56</td>
231
+ <td>7.02</td>
232
+ <td>6.90</td>
233
+ <td>6.64</td>
234
+ <td>7.28</td>
235
+ </tr>
236
+ <tr>
237
+ <td>Echo-4o</td>
238
+ <td><strong>8.34</strong></td>
239
+ <td>8.27</td>
240
+ <td>8.13</td>
241
+ <td><strong>8.14</strong></td>
242
+ <td>8.11</td>
243
+ <td><strong>7.07</strong></td>
244
+ <td>7.73</td>
245
+ <td><strong>7.77</strong></td>
246
+ <td>7.95</td>
247
+ </tr>
248
+ <tr>
249
+ <td><strong>Score (Ours)</strong></td>
250
+ <td><strong>8.34</strong></td>
251
+ <td><strong>8.52</strong></td>
252
+ <td><strong>8.24</strong></td>
253
+ <td><strong>8.14</strong></td>
254
+ <td><strong>8.30</strong></td>
255
+ <td>7.06</td>
256
+ <td><strong>7.88</strong></td>
257
+ <td>7.63</td>
258
+ <td><strong>8.01</strong></td>
259
+ </tr>
260
+ </tbody>
261
+ </table>
262
+
263
+ > - *: GPT-4o responded to 365~370 test cases out of the total 409 cases due to OpenAI safety restrictions.
264
+ > - To mitigate randomness, we perform 3 rounds of sampling at 1024x1024 resolution, scoring 3 times per round, yielding 9 group results. The final score is the average of these results.
265
+
266
+
267
+ ## SconeEval benchmark
268
+ <table border="1" style="border-collapse: collapse; width: 100%;">
269
+ <thead>
270
+ <tr>
271
+ <th rowspan="3">Method</th>
272
+ <th colspan="2">Composition ↑</th>
273
+ <th colspan="4">Distinction ↑</th>
274
+ <th colspan="4">Distinction & Composition ↑</th>
275
+ <th colspan="3">Average ↑</th>
276
+ </tr>
277
+ <tr>
278
+ <th>Single</th>
279
+ <th>Multi</th>
280
+ <th colspan="2">Cross</th>
281
+ <th colspan="2">Intra</th>
282
+ <th colspan="2">Cross</th>
283
+ <th colspan="2">Intra</th>
284
+ <th rowspan="2">COM</th>
285
+ <th rowspan="2">DIS</th>
286
+ <th rowspan="2">Overall</th>
287
+ </tr>
288
+ <tr>
289
+ <th>COM</th>
290
+ <th>COM</th>
291
+ <th>COM</th>
292
+ <th>DIS</th>
293
+ <th>COM</th>
294
+ <th>DIS</th>
295
+ <th>COM</th>
296
+ <th>DIS</th>
297
+ <th>COM</th>
298
+ <th>DIS</th>
299
+ </tr>
300
+ </thead>
301
+ <tbody>
302
+ <tr>
303
+ <td colspan="14" style="background-color: #ffefe6; text-align: center; font-weight: bold; font-style: italic;">Closed-Source Model</td>
304
+ </tr>
305
+ <tr>
306
+ <td>Gemini-2.5-Flash-Image</td>
307
+ <td>8.87</td>
308
+ <td>7.94</td>
309
+ <td>9.12</td>
310
+ <td><strong>9.15</strong></td>
311
+ <td>9.00</td>
312
+ <td>8.50</td>
313
+ <td>8.27</td>
314
+ <td><strong>8.87</strong></td>
315
+ <td>8.17</td>
316
+ <td>8.85</td>
317
+ <td>8.56</td>
318
+ <td>8.84</td>
319
+ <td>8.70</td>
320
+ </tr>
321
+ <tr>
322
+ <td>GPT-4o<sup>*</sup></td>
323
+ <td><strong>8.92</strong></td>
324
+ <td><strong>8.51</strong></td>
325
+ <td><strong>9.18</strong></td>
326
+ <td>8.55</td>
327
+ <td><strong>9.45</strong></td>
328
+ <td><strong>9.01</strong></td>
329
+ <td><strong>8.83</strong></td>
330
+ <td>8.49</td>
331
+ <td><strong>8.99</strong></td>
332
+ <td><strong>9.56</strong></td>
333
+ <td><strong>8.98</strong></td>
334
+ <td><strong>8.90</strong></td>
335
+ <td><strong>8.94</strong></td>
336
+ </tr>
337
+ <tr>
338
+ <td colspan="14" style="background-color: #e0eef9; text-align: center; font-weight: bold; font-style: italic;">Generation Model</td>
339
+ </tr>
340
+ <tr>
341
+ <td>FLUX.1 Kontext [dev]</td>
342
+ <td>7.92</td>
343
+ <td>-</td>
344
+ <td>7.93</td>
345
+ <td>8.45</td>
346
+ <td>6.20</td>
347
+ <td>6.11</td>
348
+ <td>-</td>
349
+ <td>-</td>
350
+ <td>-</td>
351
+ <td>-</td>
352
+ <td>-</td>
353
+ <td>-</td>
354
+ <td>-</td>
355
+ </tr>
356
+ <tr>
357
+ <td>USO</td>
358
+ <td>8.03</td>
359
+ <td>5.19</td>
360
+ <td>7.96</td>
361
+ <td>8.50</td>
362
+ <td>7.14</td>
363
+ <td>6.51</td>
364
+ <td>5.10</td>
365
+ <td>6.25</td>
366
+ <td>5.07</td>
367
+ <td>5.57</td>
368
+ <td>6.41</td>
369
+ <td>6.71</td>
370
+ <td>6.56</td>
371
+ </tr>
372
+ <tr>
373
+ <td>UNO</td>
374
+ <td>7.53</td>
375
+ <td>5.38</td>
376
+ <td>7.27</td>
377
+ <td>7.90</td>
378
+ <td>6.76</td>
379
+ <td>6.53</td>
380
+ <td>5.27</td>
381
+ <td>7.02</td>
382
+ <td>5.61</td>
383
+ <td>6.27</td>
384
+ <td>6.31</td>
385
+ <td>6.93</td>
386
+ <td>6.62</td>
387
+ </tr>
388
+ <tr>
389
+ <td>UniWorld-V2<br>(Edit-R1-Qwen-Image-Edit-2509)</td>
390
+ <td>8.41</td>
391
+ <td><strong>7.16</strong></td>
392
+ <td>8.63</td>
393
+ <td>8.24</td>
394
+ <td><strong>7.44</strong></td>
395
+ <td>6.77</td>
396
+ <td>7.52</td>
397
+ <td>8.03</td>
398
+ <td><strong>7.70</strong></td>
399
+ <td><strong>7.24</strong></td>
400
+ <td><strong>7.81</strong></td>
401
+ <td>7.57</td>
402
+ <td>7.69</td>
403
+ </tr>
404
+ <tr>
405
+ <td>Qwen-Image-Edit-2509</td>
406
+ <td><strong>8.54</strong></td>
407
+ <td>6.85</td>
408
+ <td><strong>8.85</strong></td>
409
+ <td><strong>8.57</strong></td>
410
+ <td>7.32</td>
411
+ <td><strong>6.86</strong></td>
412
+ <td><strong>7.53</strong></td>
413
+ <td><strong>8.13</strong></td>
414
+ <td>7.49</td>
415
+ <td>7.02</td>
416
+ <td>7.76</td>
417
+ <td><strong>7.65</strong></td>
418
+ <td><strong>7.70</strong></td>
419
+ </tr>
420
+ <tr>
421
+ <td colspan="14" style="background-color: #E6E6FA; text-align: center; font-weight: bold; font-style: italic;">Unified Model</td>
422
+ </tr>
423
+ <tr>
424
+ <td>BAGEL</td>
425
+ <td>7.14</td>
426
+ <td>5.55</td>
427
+ <td>7.49</td>
428
+ <td>7.95</td>
429
+ <td>6.93</td>
430
+ <td>6.21</td>
431
+ <td>6.44</td>
432
+ <td>7.38</td>
433
+ <td>6.87</td>
434
+ <td>7.27</td>
435
+ <td>6.74</td>
436
+ <td>7.20</td>
437
+ <td>6.97</td>
438
+ </tr>
439
+ <tr>
440
+ <td>OmniGen2</td>
441
+ <td>8.00</td>
442
+ <td>6.59</td>
443
+ <td>8.31</td>
444
+ <td>8.99</td>
445
+ <td>6.99</td>
446
+ <td>6.80</td>
447
+ <td>7.28</td>
448
+ <td>8.30</td>
449
+ <td>7.14</td>
450
+ <td>7.13</td>
451
+ <td>7.39</td>
452
+ <td>7.81</td>
453
+ <td>7.60</td>
454
+ </tr>
455
+ <tr>
456
+ <td>Echo-4o</td>
457
+ <td><strong>8.58</strong></td>
458
+ <td><strong>7.73</strong></td>
459
+ <td>8.36</td>
460
+ <td>8.33</td>
461
+ <td>7.74</td>
462
+ <td>7.18</td>
463
+ <td>7.87</td>
464
+ <td>8.72</td>
465
+ <td>8.01</td>
466
+ <td>8.33</td>
467
+ <td>8.05</td>
468
+ <td>8.14</td>
469
+ <td>8.09</td>
470
+ </tr>
471
+ <tr>
472
+ <td><strong>Scone (Ours)</strong></td>
473
+ <td>8.52</td>
474
+ <td>7.40</td>
475
+ <td><strong>8.98</strong></td>
476
+ <td><strong>9.73</strong></td>
477
+ <td><strong>7.97</strong></td>
478
+ <td><strong>7.74</strong></td>
479
+ <td><strong>8.20</strong></td>
480
+ <td><strong>9.25</strong></td>
481
+ <td><strong>8.21</strong></td>
482
+ <td><strong>8.44</strong></td>
483
+ <td><strong>8.21</strong></td>
484
+ <td><strong>8.79</strong></td>
485
+ <td><strong>8.50</strong></td>
486
+ </tr>
487
+ </tbody>
488
+ </table>
489
+
490
+ > - *: GPT-4o responded to 365~370 test cases out of the total 409 cases due to OpenAI safety restrictions.
491
+ > - To mitigate randomness, we perform 3 rounds of sampling at 1024x1024 resolution, scoring 3 times per round, yielding 9 group results. The final score is the average of these results.
492
+
493
+
494
+
495
+ # 🚰 Citation
496
+ If you find Scone helpful, please consider giving the repo a star ⭐.
497
+
498
+ If you find this project useful for your research, please consider citing our paper:
499
+ ```bibtex
500
+ @misc{wang2025sconebridgingcompositiondistinction,
501
+ title={Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling},
502
+ author={Yuran Wang and Bohan Zeng and Chengzhuo Tong and Wenxuan Liu and Yang Shi and Xiaochen Ma and Hao Liang and Yuanxing Zhang and Wentao Zhang},
503
+ year={2025},
504
+ eprint={2512.12675},
505
+ archivePrefix={arXiv},
506
+ primaryClass={cs.CV},
507
+ url={https://arxiv.org/abs/2512.12675},
508
+ }
509
+ ```