WRHC commited on
Commit
74deae1
·
verified ·
1 Parent(s): 31f8d54

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVA: Efficient Reinforcement Learning for End-to-End Video Agent
2
+
3
+ [![Paper](https://img.shields.io/badge/Paper-Link-b31b1b.svg)](https://arxiv.org/abs/2603.22918)
4
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black.svg)](https://github.com/wangruohui/EfficientVideoAgent)
5
+ [![Model](https://img.shields.io/badge/Model-Link-blue.svg)](https://huggingface.co/WRHC/EfficientVideoAgent/)
6
+
7
+ This repository contains the official evaluation code for the model proposed in our paper. The code is available on GitHub and the model weights are available on Hugging Face.
8
+
9
+ ![EVA Overview](fig1.png)
10
+
11
+ ## 1. Paper and Model
12
+
13
+ - Paper Title: `EVA: Efficient Reinforcement Learning for End-to-End Video Agent`
14
+ - Paper Link: `https://arxiv.org/abs/2603.22918`
15
+ - GitHub Repository: `https://github.com/wangruohui/EfficientVideoAgent`
16
+ - Model Link: `https://huggingface.co/WRHC/EfficientVideoAgent/`
17
+
18
+ ## 2. Reference Results
19
+
20
+ Reference result files are provided in this repository, under `results-12k`.
21
+ You can compute accuracy with `accuracy.py`:
22
+
23
+ ```bash
24
+ python accuracy.py <result_jsonl_path>
25
+ ```
26
+
27
+ Main results:
28
+
29
+ | Dataset | Acc | Round | Token |
30
+ | --- | ---: | ---: | ---: |
31
+ | VideoMME | 60.15 | 2.42 | 16911 |
32
+ | LongVideoBench | 54.97 | 2.57 | 19042 |
33
+ | MLVU | 68.26 | 2.42 | 16570 |
34
+ | LSDBench | 49.31 | 2.48 | 13914 |
35
+ | VideoHolmes | 37.18 | 2.75 | 9085 |
36
+ | LVBench | 43.32 | 2.62 | 20412 |
37
+
38
+ `Token` includes both text tokens and image tokens.
39
+
40
+ ## 3. Run Your Own Evaluation
41
+
42
+ ### Step 1. Clone the Repository
43
+
44
+ ```bash
45
+ git clone https://github.com/wangruohui/EfficientVideoAgent.git
46
+ cd EfficientVideoAgent
47
+ ```
48
+
49
+ ### Step 2. Download Model and Install Dependencies
50
+
51
+ 1. Download model weights from `https://huggingface.co/WRHC/EfficientVideoAgent/` to `hf_model/`:
52
+
53
+ ```bash
54
+ huggingface-cli download WRHC/EfficientVideoAgent --local-dir hf_model
55
+ ```
56
+
57
+ 2. Install FFmpeg following `https://www.ffmpeg.org/download.html`, ensure `ffprobe` is in `PATH`, and ensure FFmpeg shared libraries are in `LD_LIBRARY_PATH`.
58
+ 3. Install dependencies from `requirements.txt` (recommended: `uv`)
59
+
60
+ ```bash
61
+ uv venv .venv
62
+ source .venv/bin/activate
63
+ uv pip install -r requirements.txt
64
+ ```
65
+
66
+ ### Step 3. Download Evaluation Datasets and Update Dataset Paths
67
+
68
+ `eval-eva.py` reads dataset meta from `DATASET_CONFIG`. Before running evaluation, make sure each dataset is available locally and paths are correct.
69
+
70
+ 1. Download and extract video datasets (VideoMME / LSDBench / LVBench / VideoHolmes / LongVideoBench / MLVU).
71
+ 2. Annotation jsonl files are already provided in `data/*.jsonl` and have been normalized to a unified format.
72
+ 3. Edit `eval-eva.py` -> `DATASET_CONFIG`: only `video_root` needs to be changed to your local video directory.
73
+
74
+ Example:
75
+
76
+ ```python
77
+ DATASET_CONFIG = {
78
+ "videomme": {
79
+ "jsonl": "data/videomme_test_wosubtitles_raw_list_full.jsonl",
80
+ "video_root": "/path/to/VideoMME/video",
81
+ "cache": "cache_videomme.jsonl",
82
+ "result": "result_videomme.jsonl",
83
+ },
84
+ }
85
+ ```
86
+
87
+ ### Step 4. Serve the Model with vLLM (Multi-GPU Data Parallel)
88
+
89
+
90
+ ```bash
91
+ vllm serve <MODEL_PATH_OR_HF_ID> \
92
+ --data-parallel-size <NUM_GPUS> \
93
+ --limit-mm-per-prompt '{"image": 9999, "video":0}' \
94
+ --mm_processor_cache_gb 20 \
95
+ --attention-backend FLASH_ATTN \
96
+ --allowed-local-media-path <LOCAL_MEDIA_ROOT>
97
+ ```
98
+
99
+ **Reproducibility Note**
100
+
101
+ With vLLM, even when `temperature=0`, final accuracy can still fluctuate by around `0.x%` across runs.
102
+
103
+ ### Step 5. Configure `eval-eva.py` Runtime Settings and Run Evaluation
104
+
105
+ Before running, edit the config section at the top of `eval-eva.py`:
106
+
107
+ - `BASE_URL`: OpenAI-compatible endpoint for your vLLM server (for example, `http://localhost:8000/v1`).
108
+ - `API_KEY`: API key used by the client (can be a dummy value for local vLLM setups if authentication is disabled).
109
+ - `MODEL_TOKENIZER_PATH`: Tokenizer path, should pointing to downloaded hf model weights, i.e. `https://huggingface.co/WRHC/EfficientVideoAgent/` in step 2.
110
+ - `FRAME_TOOL_PATH`: path to the frame selection tool script (default is `select_frame_fallback.py`).
111
+ - `FRAME_SAVE_ROOT`: directory where extracted frames are saved during tool calls.
112
+ Also make sure:
113
+ - `FRAME_SAVE_ROOT` directory exists and is writable (or set it to a writable path).
114
+ - vLLM `--allowed-local-media-path` covers your dataset `video_root` directories.
115
+ - `DATASET_CONFIG`: per-dataset I/O configuration.
116
+ - `DATASET_CONFIG[*].video_root`: root directory containing raw video files.
117
+ - `DATASET_CONFIG[*].cache`: incremental cache file used during running.
118
+ - `DATASET_CONFIG[*].result`: final merged output file written at the end.
119
+
120
+ Run one dataset:
121
+
122
+ ```bash
123
+ python eval-eva.py --dataset videomme
124
+ python eval-eva.py --dataset lsdbench
125
+ python eval-eva.py --dataset lvbench
126
+ python eval-eva.py --dataset videoholmes
127
+ python eval-eva.py --dataset longvideobench
128
+ python eval-eva.py --dataset mlvu
129
+ ```
130
+
131
+ You can control per-tool-call visual token budget via `-v/--max-visual-tokens`.
132
+ When a tool call exceeds this budget, `eval-eva.py` automatically reduces resolution and frame count before extraction.
133
+
134
+ ```bash
135
+ python eval-eva.py --dataset videomme -v 12000
136
+ python eval-eva.py --dataset videomme -v 32000
137
+ ```
138
+
139
+ Run all supported datasets with `batch.sh`:
140
+
141
+ ```bash
142
+ bash batch.sh
143
+ ```
144
+
145
+ ## 4. Output Files and Cache/Resume Mechanism
146
+
147
+ - Output naming is controlled by `DATASET_CONFIG` in `eval-eva.py`.
148
+ - If the process is interrupted, rerunning the same command resumes from cache and skips finished samples.
149
+ - By default, each dataset writes:
150
+ - `cache_*.jsonl`: online cache (appended sample-by-sample)
151
+ - `result_*.jsonl`: final merged output
152
+ - Useful options:
153
+ - `--retry-error`: retry only failed/error cached samples
154
+ - `--new-cache`: recreate cache from scratch
155
+ - `--output-dir`: redirect cache/result outputs to another directory
156
+
157
+
158
+ ## Citation
159
+
160
+ ```bibtex
161
+ @misc{zhang2026evaefficientreinforcementlearning,
162
+ title={EVA: Efficient Reinforcement Learning for End-to-End Video Agent},
163
+ author={Yaolun Zhang and Ruohui Wang and Jiahao Wang and Yepeng Tang and Xuanyu Zheng and Haonan Duan and Hao Lu and Hanming Deng and Lewei Lu},
164
+ year={2026},
165
+ eprint={2603.22918},
166
+ archivePrefix={arXiv},
167
+ primaryClass={cs.CV},
168
+ url={https://arxiv.org/abs/2603.22918},
169
+ }
170
+ ```