Ryann829 commited on
Commit
2f4f092
·
verified ·
1 Parent(s): 7c2fa6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -235
README.md CHANGED
@@ -76,7 +76,7 @@ pip install flash_attn==2.5.8 --no-build-isolation
76
 
77
  ## Data and base model preparation
78
 
79
- 1. Download our **22K refined single-candidate data** and **35K multi-candidate data** from [Scone-S2I-57K](https://huggingface.co/datasets/Ryann829/Scone-S2I-57K). The 70K base single-canididate data are sampled from open-source datasets like [X2I](https://huggingface.co/datasets/yzwang/X2I-subject-driven), [MUSAR-Gen](https://huggingface.co/datasets/guozinan/MUSAR-Gen), [UNO-1M](https://huggingface.co/datasets/bytedance-research/UNO-1M), and [Echo-4o-Image](https://huggingface.co/datasets/Yejy53/Echo-4o-Image). Please refer to the dataset links for more details.
80
 
81
  ```bash
82
  cd Scone
@@ -365,232 +365,6 @@ bash scripts/inference_single_case.sh
365
  > - To mitigate randomness, we perform 3 rounds of sampling at 1024x1024 resolution, scoring 3 times per round, yielding 9 group results. The final score is the average of these results.
366
 
367
 
368
- ## SconeEval benchmark
369
- <table border="1" style="border-collapse: collapse; width: 100%;">
370
- <thead>
371
- <tr>
372
- <th rowspan="3">Method</th>
373
- <th colspan="2">Composition ↑</th>
374
- <th colspan="4">Distinction ↑</th>
375
- <th colspan="4">Distinction & Composition ↑</th>
376
- <th colspan="3">Average ↑</th>
377
- </tr>
378
- <tr>
379
- <th>Single</th>
380
- <th>Multi</th>
381
- <th colspan="2">Cross</th>
382
- <th colspan="2">Intra</th>
383
- <th colspan="2">Cross</th>
384
- <th colspan="2">Intra</th>
385
- <th rowspan="2">COM</th>
386
- <th rowspan="2">DIS</th>
387
- <th rowspan="2">Overall</th>
388
- </tr>
389
- <tr>
390
- <th>COM</th>
391
- <th>COM</th>
392
- <th>COM</th>
393
- <th>DIS</th>
394
- <th>COM</th>
395
- <th>DIS</th>
396
- <th>COM</th>
397
- <th>DIS</th>
398
- <th>COM</th>
399
- <th>DIS</th>
400
- </tr>
401
- </thead>
402
- <tbody>
403
- <tr>
404
- <td colspan="14" style="background-color: #ffefe6; text-align: center; font-weight: bold; font-style: italic;">Closed-Source Model</td>
405
- </tr>
406
- <tr>
407
- <td>Gemini-2.5-Flash-Image</td>
408
- <td>8.87</td>
409
- <td>7.94</td>
410
- <td>9.12</td>
411
- <td><strong>9.15</strong></td>
412
- <td>9.00</td>
413
- <td>8.50</td>
414
- <td>8.27</td>
415
- <td><strong>8.87</strong></td>
416
- <td>8.17</td>
417
- <td>8.85</td>
418
- <td>8.56</td>
419
- <td>8.84</td>
420
- <td>8.70</td>
421
- </tr>
422
- <tr>
423
- <td>GPT-4o<sup>*</sup></td>
424
- <td><strong>8.92</strong></td>
425
- <td><strong>8.51</strong></td>
426
- <td><strong>9.18</strong></td>
427
- <td>8.55</td>
428
- <td><strong>9.45</strong></td>
429
- <td><strong>9.01</strong></td>
430
- <td><strong>8.83</strong></td>
431
- <td>8.49</td>
432
- <td><strong>8.99</strong></td>
433
- <td><strong>9.56</strong></td>
434
- <td><strong>8.98</strong></td>
435
- <td><strong>8.90</strong></td>
436
- <td><strong>8.94</strong></td>
437
- </tr>
438
- <tr>
439
- <td colspan="14" style="background-color: #e0eef9; text-align: center; font-weight: bold; font-style: italic;">Generation Model</td>
440
- </tr>
441
- <tr>
442
- <td>FLUX.1 Kontext [dev]</td>
443
- <td>7.92</td>
444
- <td>-</td>
445
- <td>7.93</td>
446
- <td>8.45</td>
447
- <td>6.20</td>
448
- <td>6.11</td>
449
- <td>-</td>
450
- <td>-</td>
451
- <td>-</td>
452
- <td>-</td>
453
- <td>-</td>
454
- <td>-</td>
455
- <td>-</td>
456
- </tr>
457
- <tr>
458
- <td>USO</td>
459
- <td>8.03</td>
460
- <td>5.19</td>
461
- <td>7.96</td>
462
- <td>8.50</td>
463
- <td>7.14</td>
464
- <td>6.51</td>
465
- <td>5.10</td>
466
- <td>6.25</td>
467
- <td>5.07</td>
468
- <td>5.57</td>
469
- <td>6.41</td>
470
- <td>6.71</td>
471
- <td>6.56</td>
472
- </tr>
473
- <tr>
474
- <td>UNO</td>
475
- <td>7.53</td>
476
- <td>5.38</td>
477
- <td>7.27</td>
478
- <td>7.90</td>
479
- <td>6.76</td>
480
- <td>6.53</td>
481
- <td>5.27</td>
482
- <td>7.02</td>
483
- <td>5.61</td>
484
- <td>6.27</td>
485
- <td>6.31</td>
486
- <td>6.93</td>
487
- <td>6.62</td>
488
- </tr>
489
- <tr>
490
- <td>UniWorld-V2<br>(Edit-R1-Qwen-Image-Edit-2509)</td>
491
- <td>8.41</td>
492
- <td><strong>7.16</strong></td>
493
- <td>8.63</td>
494
- <td>8.24</td>
495
- <td><strong>7.44</strong></td>
496
- <td>6.77</td>
497
- <td>7.52</td>
498
- <td>8.03</td>
499
- <td><strong>7.70</strong></td>
500
- <td><strong>7.24</strong></td>
501
- <td><strong>7.81</strong></td>
502
- <td>7.57</td>
503
- <td>7.69</td>
504
- </tr>
505
- <tr>
506
- <td>Qwen-Image-Edit-2509</td>
507
- <td><strong>8.54</strong></td>
508
- <td>6.85</td>
509
- <td><strong>8.85</strong></td>
510
- <td><strong>8.57</strong></td>
511
- <td>7.32</td>
512
- <td><strong>6.86</strong></td>
513
- <td><strong>7.53</strong></td>
514
- <td><strong>8.13</strong></td>
515
- <td>7.49</td>
516
- <td>7.02</td>
517
- <td>7.76</td>
518
- <td><strong>7.65</strong></td>
519
- <td><strong>7.70</strong></td>
520
- </tr>
521
- <tr>
522
- <td colspan="14" style="background-color: #E6E6FA; text-align: center; font-weight: bold; font-style: italic;">Unified Model</td>
523
- </tr>
524
- <tr>
525
- <td>BAGEL</td>
526
- <td>7.14</td>
527
- <td>5.55</td>
528
- <td>7.49</td>
529
- <td>7.95</td>
530
- <td>6.93</td>
531
- <td>6.21</td>
532
- <td>6.44</td>
533
- <td>7.38</td>
534
- <td>6.87</td>
535
- <td>7.27</td>
536
- <td>6.74</td>
537
- <td>7.20</td>
538
- <td>6.97</td>
539
- </tr>
540
- <tr>
541
- <td>OmniGen2</td>
542
- <td>8.00</td>
543
- <td>6.59</td>
544
- <td>8.31</td>
545
- <td>8.99</td>
546
- <td>6.99</td>
547
- <td>6.80</td>
548
- <td>7.28</td>
549
- <td>8.30</td>
550
- <td>7.14</td>
551
- <td>7.13</td>
552
- <td>7.39</td>
553
- <td>7.81</td>
554
- <td>7.60</td>
555
- </tr>
556
- <tr>
557
- <td>Echo-4o</td>
558
- <td><strong>8.58</strong></td>
559
- <td><strong>7.73</strong></td>
560
- <td>8.36</td>
561
- <td>8.33</td>
562
- <td>7.74</td>
563
- <td>7.18</td>
564
- <td>7.87</td>
565
- <td>8.72</td>
566
- <td>8.01</td>
567
- <td>8.33</td>
568
- <td>8.05</td>
569
- <td>8.14</td>
570
- <td>8.09</td>
571
- </tr>
572
- <tr>
573
- <td><strong>Scone (Ours)</strong></td>
574
- <td>8.52</td>
575
- <td>7.40</td>
576
- <td><strong>8.98</strong></td>
577
- <td><strong>9.73</strong></td>
578
- <td><strong>7.97</strong></td>
579
- <td><strong>7.74</strong></td>
580
- <td><strong>8.20</strong></td>
581
- <td><strong>9.25</strong></td>
582
- <td><strong>8.21</strong></td>
583
- <td><strong>8.44</strong></td>
584
- <td><strong>8.21</strong></td>
585
- <td><strong>8.79</strong></td>
586
- <td><strong>8.50</strong></td>
587
- </tr>
588
- </tbody>
589
- </table>
590
-
591
- > - *: GPT-4o responded to 365~370 test cases out of the total 409 cases due to OpenAI safety restrictions.
592
- > - To mitigate randomness, we perform 3 rounds of sampling at 1024x1024 resolution, scoring 3 times per round, yielding 9 group results. The final score is the average of these results.
593
-
594
 
595
  ## SconeEval benchmark
596
  <p align="center">
@@ -875,14 +649,11 @@ If you find Scone helpful, please consider giving the repo a star ⭐.
875
 
876
  If you find this project useful for your research, please consider citing our paper:
877
  ```bibtex
878
- @misc{wang2025sconebridgingcompositiondistinction,
879
- title={Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling},
880
- author={Yuran Wang and Bohan Zeng and Chengzhuo Tong and Wenxuan Liu and Yang Shi and Xiaochen Ma and Hao Liang and Yuanxing Zhang and Wentao Zhang},
881
- year={2025},
882
- eprint={2512.12675},
883
- archivePrefix={arXiv},
884
- primaryClass={cs.CV},
885
- url={https://arxiv.org/abs/2512.12675},
886
  }
887
  ```
888
 
 
76
 
77
  ## Data and base model preparation
78
 
79
+ 1. Download our **22K refined single-candidate data** and **35K multi-candidate data** from [Scone-S2I-57K](https://huggingface.co/datasets/Ryann829/Scone-S2I-57K). The 70K base single-candidate data are sampled from open-source datasets like [X2I](https://huggingface.co/datasets/yzwang/X2I-subject-driven), [MUSAR-Gen](https://huggingface.co/datasets/guozinan/MUSAR-Gen), [UNO-1M](https://huggingface.co/datasets/bytedance-research/UNO-1M), and [Echo-4o-Image](https://huggingface.co/datasets/Yejy53/Echo-4o-Image). Please refer to the dataset links for more details.
80
 
81
  ```bash
82
  cd Scone
 
365
  > - To mitigate randomness, we perform 3 rounds of sampling at 1024x1024 resolution, scoring 3 times per round, yielding 9 group results. The final score is the average of these results.
366
 
367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
369
  ## SconeEval benchmark
370
  <p align="center">
 
649
 
650
  If you find this project useful for your research, please consider citing our paper:
651
  ```bibtex
652
+ @article{wang2025scone,
653
+ title={Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling},
654
+ author={Wang, Yuran and Zeng, Bohan and Tong, Chengzhuo and Liu, Wenxuan and Shi, Yang and Ma, Xiaochen and Liang, Hao and Zhang, Yuanxing and Zhang, Wentao},
655
+ journal={arXiv preprint arXiv:2512.12675},
656
+ year={2025}
 
 
 
657
  }
658
  ```
659