Update README.md
Browse files
README.md
CHANGED
|
@@ -120,61 +120,34 @@ The default motion_score = 5 is suitable for general use. If you need more stabi
|
|
| 120 |
We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
|
| 121 |
|
| 122 |
|
| 123 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
|
| 124 |
-
<tr>
|
| 125 |
-
<th style="width: 20%;">vs. OSTopA</th>
|
| 126 |
-
<th style="width: 20%;">vs. OSTopB</th>
|
| 127 |
-
<th style="width: 20%;">vs. CSTopC</th>
|
| 128 |
-
<th style="width: 20%;">vs. CSTopD</th>
|
| 129 |
-
</tr>
|
| 130 |
-
<tr>
|
| 131 |
-
<td>37-63-79</td>
|
| 132 |
-
<td>101-48-29</td>
|
| 133 |
-
<td>41-46-73</td>
|
| 134 |
-
<td>92-51-18</td>
|
| 135 |
-
</tr>
|
| 136 |
-
<tr>
|
| 137 |
-
<td>40-35-44</td>
|
| 138 |
-
<td>94-16-10</td>
|
| 139 |
-
<td>52-35-47</td>
|
| 140 |
-
<td>87-18-17</td>
|
| 141 |
-
</tr>
|
| 142 |
-
<tr>
|
| 143 |
-
<td>46-92-39</td>
|
| 144 |
-
<td>43-71-64</td>
|
| 145 |
-
<td>45-65-50</td>
|
| 146 |
-
<td>36-77-47</td>
|
| 147 |
-
</tr>
|
| 148 |
-
<tr>
|
| 149 |
-
<td>42-61-18</td>
|
| 150 |
-
<td>50-35-35</td>
|
| 151 |
-
<td>29-62-43</td>
|
| 152 |
-
<td>37-63-23</td>
|
| 153 |
-
</tr>
|
| 154 |
-
<tr>
|
| 155 |
-
<td>52-57-49</td>
|
| 156 |
-
<td>71-40-66</td>
|
| 157 |
-
<td>58-33-69</td>
|
| 158 |
-
<td>67-33-60</td>
|
| 159 |
-
</tr>
|
| 160 |
-
<tr>
|
| 161 |
-
<td>75-17-28</td>
|
| 162 |
-
<td>67-30-24</td>
|
| 163 |
-
<td>78-17-39</td>
|
| 164 |
-
<td>68-41-14</td>
|
| 165 |
-
</tr>
|
| 166 |
-
<tr>
|
| 167 |
-
<th colspan="4">Total Score</th>
|
| 168 |
-
</tr>
|
| 169 |
-
<tr>
|
| 170 |
-
<td>292-325-277</td>
|
| 171 |
-
<td>426-240-228</td>
|
| 172 |
-
<td>303-258-321</td>
|
| 173 |
-
<td>387-283-179</td>
|
| 174 |
-
</tr>
|
| 175 |
-
</table>
|
| 176 |
-
<p style="text-align: center;"><strong>Table 1: Comparison with baseline TI2V models using Step-Video-TI2V-Eval.</strong></p>
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
|
| 180 |
[VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
|
|
|
|
| 120 |
We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
|
| 121 |
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
+
<table border="0" style="width: 100%; text-align: center; margin-top: 10px; border-collapse: collapse; border-radius: 8px; overflow: hidden;">
|
| 125 |
+
<thead>
|
| 126 |
+
<tr style="">
|
| 127 |
+
<th style="width: 25%; padding: 10px;">vs. OSTopA</th>
|
| 128 |
+
<th style="width: 25%; padding: 10px;">vs. OSTopB</th>
|
| 129 |
+
<th style="width: 25%; padding: 10px;">vs. CSTopC</th>
|
| 130 |
+
<th style="width: 25%; padding: 10px;">vs. CSTopD</th>
|
| 131 |
+
</tr>
|
| 132 |
+
</thead>
|
| 133 |
+
<tbody>
|
| 134 |
+
<tr><td>37-63-79</td><td>101-48-29</td><td>41-46-73</td><td>92-51-18</td></tr>
|
| 135 |
+
<tr><td>40-35-44</td><td>94-16-10</td><td>52-35-47</td><td>87-18-17</td></tr>
|
| 136 |
+
<tr><td>46-92-39</td><td>43-71-64</td><td>45-65-50</td><td>36-77-47</td></tr>
|
| 137 |
+
<tr><td>42-61-18</td><td>50-35-35</td><td>29-62-43</td><td>37-63-23</td></tr>
|
| 138 |
+
<tr><td>52-57-49</td><td>71-40-66</td><td>58-33-69</td><td>67-33-60</td></tr>
|
| 139 |
+
<tr><td>75-17-28</td><td>67-30-24</td><td>78-17-39</td><td>68-41-14</td></tr>
|
| 140 |
+
<tr style="">
|
| 141 |
+
<td colspan="4" style="padding: 10px; font-weight: bold;">Total Score</td>
|
| 142 |
+
</tr>
|
| 143 |
+
<tr>
|
| 144 |
+
<td>292-325-277</td>
|
| 145 |
+
<td>426-240-228</td>
|
| 146 |
+
<td>303-258-321</td>
|
| 147 |
+
<td>387-283-179</td>
|
| 148 |
+
</tr>
|
| 149 |
+
</tbody>
|
| 150 |
+
</table>
|
| 151 |
|
| 152 |
|
| 153 |
[VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
|