File size: 8,393 Bytes
fee27b1
d0216b0
 
 
 
fee27b1
 
d0216b0
 
fee27b1
 
d0216b0
 
 
 
 
 
25a0563
d0216b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: VEFX-Code
emoji: 🎬
colorFrom: indigo
colorTo: pink
sdk: static
pinned: false
license: apache-2.0
short_description: VEFX-Bench reference code & inference utils
---

<div align="center">

# VEFX-Bench

### Benchmarking Generic Video Editing and Visual Effects

</div>

**VEFX-Bench** is a comprehensive benchmark for evaluating text-driven video editing and visual effects. It includes **5,049 annotated examples** spanning **9 categories** and **32 subcategories**, evaluated by **VEFX-Reward** β€” a VLM-based reward model that scores edits across three dimensions on a 1–4 scale:

| Dimension | What it measures |
|---|---|
| **Instructional Following (IF)** | Does the edit accurately reflect the editing instruction? |
| **Render Quality (RQ)** | Visual clarity, temporal consistency, and physical plausibility |
| **Edit Exclusivity (EE)** | Were only the intended regions modified, without side-effects? |

---

## πŸ† Model Leaderboard

VEFX-Reward scores on 1–4 scale. Ranked by **GeoAgg** (Ξ±=2 for IF, Ξ²=1 for RQ, Ξ³=1 for EE). Higher is better.

> **πŸ“… Updated: May 2, 2026** β€” For the latest results & submissions, visit the **[live leaderboard β†’](https://vefx-leaderboard.com/)**

| Rank | Model | Type | IF ↑ | RQ ↑ | EE ↑ | GeoAgg ↑ |
|:---:|---|---|:---:|:---:|:---:|:---:|
| πŸ₯‡ | **Kling o3 Omni** | Commercial | 3.033 | **3.588** | 3.043 | **3.057** |
| πŸ₯ˆ | **Kling o1** | Commercial | **3.040** | 3.534 | 2.976 | 2.985 |
| πŸ₯‰ | **Runway Gen-4.5** | Commercial | 2.817 | 3.319 | 2.923 | 2.912 |
| 4 | Seedance 2.0 | Commercial | 2.811 | 3.421 | 3.088 | 2.766 |
| 5 | Grok Imagine | Commercial | 2.606 | 3.346 | **3.376** | 2.723 |
| 6 | Luma Ray 3 | Commercial | 2.702 | 3.403 | 2.705 | 2.717 |
| 7 | UniVideo | Open-source | 2.294 | 3.266 | 3.091 | 2.516 |
| 8 | Wan 2.6 | Commercial | 2.012 | 3.317 | 2.446 | 2.146 |
| 9 | Luma Ray 2 | Commercial | 2.038 | 2.532 | 1.363 | 1.804 |
| 10 | VACE | Open-source | 2.027 | 3.172 | 1.180 | 1.775 |

---

## 🎬 Demo Videos

Each demo shows the **original video** (left) alongside the **edited video** (right).

<table>
<tr>
<td align="center"><b>Attribute Change</b><br><sub>"Change the color of the red industrial trailer to a bright yellow while maintaining the texture and appearance of the metal surface."</sub></td>
<td align="center"><b>Object Removal</b><br><sub>"Remove the woman with the grey backpack walking on the right side of the frame."</sub></td>
</tr>
<tr>
<td align="center"><img src="assets/demo_attribute_change.gif" width="400"></td>
<td align="center"><img src="assets/demo_object_removal.gif" width="400"></td>
</tr>
<tr>
<td align="center"><b>Style Transfer</b><br><sub>"Restore the natural, realistic colors to the entire scene, replacing the current black and white style with a full-color rendition."</sub></td>
<td align="center"><b>Camera Motion</b><br><sub>"Perform a smooth zoom in on the distant snowy mountain peaks to create a more immersive view."</sub></td>
</tr>
<tr>
<td align="center"><img src="assets/demo_style_transfer.gif" width="400"></td>
<td align="center"><img src="assets/demo_camera_zoom.gif" width="400"></td>
</tr>
</table>

---

## πŸ“Š Benchmark at a Glance

| | |
|---|---|
| πŸ“ **5,049** Annotated Examples | 🎬 **1,419** Source Videos |
| πŸ“‚ **9 / 32** Categories / Subcategories | πŸ€– **10** Editing Systems |
| πŸ“ **3** Quality Dimensions (IF, RQ, EE) | πŸ§ͺ **300** Benchmark Test Pairs |

---

## πŸ€— VEFX-Reward Models

| Model | Backbone | Params | HuggingFace | Status |
|---|---|---|---|---|
| **VEFX-Reward-4B** | Qwen3-VL-4B-Instruct | 4B | [VEFX-Reward/VEFX-Reward-4B](https://huggingface.co/VEFX-Reward/VEFX-Reward-4B) | βœ… Available |

---

## πŸ“¦ VEFX-Bench Dataset

The benchmark dataset is hosted on HuggingFace at **[VEFX-Reward/VEFX-Bench](https://huggingface.co/datasets/VEFX-Reward/VEFX-Bench)**.

| | |
|---|---|
| 🎬 **300** Source Videos (720p) | πŸ“ `prompts.json` with editing instructions |
| πŸ“‚ **9** Task Categories | πŸ—‚οΈ `benchmark_meta.json` with category labels |

**Task Categories:** Style Transfer Β· Object Manipulation Β· Background Change Β· Color/Lighting Β· Motion/Animation Β· Text/Overlay Β· Composition Β· Removal/Inpainting Β· Complex/Multi-step

### Download and Evaluate

```python
from huggingface_hub import snapshot_download

# Download the benchmark dataset
snapshot_download(repo_id="VEFX-Reward/VEFX-Bench", repo_type="dataset", local_dir="./vefx_bench")
```

**Evaluation workflow:**
1. Download the 300 source videos and `prompts.json`
2. Apply your video editing model to each source video following its prompt
3. Save edited videos as `0000.mp4` through `0299.mp4` (matching source index)
4. Score with VEFX-Reward:

```python
import json
from vefx_reward import VEFXReward

model = VEFXReward("VEFX-Reward/VEFX-Reward-4B", device="cuda")

with open("vefx_bench/prompts.json") as f:
    prompts = json.load(f)

for idx, item in enumerate(prompts):
    scores = model.score(
        original_video=f"vefx_bench/{idx:04d}.mp4",
        edited_video=f"your_edits/{idx:04d}.mp4",
        instruction=item["instruction"],
    )
    print(f"[{idx:04d}] IF={scores['IF']:.2f}  RQ={scores['RQ']:.2f}  EE={scores['EE']:.2f}")
```

---

## πŸš€ Quick Start

### Installation

```bash
conda create -n vefx-bench python=3.10 -y
conda activate vefx-bench

# Install PyTorch first (match your CUDA version)
# See https://pytorch.org/get-started/locally/ for the right command
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Install remaining dependencies
pip install -r requirements.txt

# Install the package
pip install -e .
```

> **Requirements:** Python β‰₯ 3.10, CUDA GPU, ~10 GB VRAM (bfloat16). Make sure your PyTorch CUDA version matches your driver.

### Score a Video Edit (Python API)

```python
from vefx_reward import VEFXReward

model = VEFXReward("VEFX-Reward/VEFX-Reward-4B", device="cuda")

scores = model.score(
    original_video="examples/sample_videos/object_removal_original.mp4",
    edited_video="examples/sample_videos/object_removal_edited.mp4",
    instruction="Remove the woman with the grey backpack walking on the right side of the frame.",
)
print(scores)
# {'IF': 2.34, 'RQ': 1.93, 'EE': 1.82, 'Overall': 6.09}
```

### CLI Usage

```bash
python examples/quick_start.py \
    --original examples/sample_videos/object_removal_original.mp4 \
    --edited examples/sample_videos/object_removal_edited.mp4 \
    --instruction "Remove the woman with the grey backpack walking on the right side of the frame."
```

### Score All Included Samples

The repo includes 4 sample video pairs with prompts. Score them all:

```python
import json
from vefx_reward import VEFXReward

model = VEFXReward("VEFX-Reward/VEFX-Reward-4B", device="cuda")

with open("examples/sample_videos/prompts.json") as f:
    samples = json.load(f)

for sample in samples:
    scores = model.score(
        original_video=f"examples/sample_videos/{sample['original']}",
        edited_video=f"examples/sample_videos/{sample['edited']}",
        instruction=sample["instruction"],
    )
    print(f"[{sample['category']}] IF={scores['IF']:.2f}  RQ={scores['RQ']:.2f}  EE={scores['EE']:.2f}")
```

### Batch Scoring

Prepare a CSV with columns `original_video`, `edited_video`, `instruction`:

```bash
python examples/batch_scoring.py --csv edits.csv --output results.csv
```

### Multi-GPU Scoring

For large-scale evaluation across multiple GPUs:

```bash
python examples/multi_gpu_scoring.py --csv edits.csv --num_gpus 4 --output results.csv
```

---

## πŸ“– API Reference

### `VEFXReward`

```python
VEFXReward(
    model_path="VEFX-Reward/VEFX-Reward-4B",  # HuggingFace ID or local path
    device="cuda",                           # "cuda", "cuda:0", "cpu"
    dtype=torch.bfloat16,                    # torch.bfloat16 or torch.float16
    fps=4.0,                                 # Video sampling rate
    max_frame_pixels=399360,                 # Max pixels per frame
)
```

#### `model.score(original_video, edited_video, instruction) β†’ dict`

Score a single video edit. Returns `{'IF': float, 'RQ': float, 'EE': float, 'Overall': float}`.

#### `model.score_batch(original_videos, edited_videos, instructions) β†’ list[dict]`

Score multiple edits sequentially. Each sample is processed independently to avoid OOM.

---