File size: 6,892 Bytes
01a1944 554d88f 01a1944 554d88f 74deae1 d311e03 74deae1 d311e03 554d88f 74deae1 01a1944 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
pipeline_tag: video-text-to-text
library_name: transformers
---
# EVA: Efficient Reinforcement Learning for End-to-End Video Agent
[](https://arxiv.org/abs/2603.22918)
[](https://huggingface.co/papers/2603.22918)
[](https://github.com/wangruohui/EfficientVideoAgent)
[](https://huggingface.co/WRHC/EfficientVideoAgent/)
This repository contains the model weights proposed in our paper [EVA: Efficient Reinforcement Learning for End-to-End Video Agent](https://arxiv.org/abs/2603.22918). Official evaluation codes are hosted on [GitHub](https://github.com/wangruohui/EfficientVideoAgent).
EVA (Efficient Video Agent) is an end-to-end framework that enables "planning-before-perception" through iterative summary-plan-action-reflection reasoning. Unlike passive recognizers, EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding.

## 1. Paper and Model
- Paper Title: `EVA: Efficient Reinforcement Learning for End-to-End Video Agent`
- Paper Link: `https://arxiv.org/abs/2603.22918`
- GitHub Repository: `https://github.com/wangruohui/EfficientVideoAgent`
- Model Link: `https://huggingface.co/WRHC/EfficientVideoAgent/`
## 2. Reference Results
Reference result files are provided in this repository, under `results-12k`.
You can compute accuracy with `accuracy.py`:
```bash
python accuracy.py <result_jsonl_path>
```
Main results:
| Dataset | Acc | Round | Token |
| --- | ---: | ---: | ---: |
| VideoMME | 60.15 | 2.42 | 16911 |
| LongVideoBench | 54.97 | 2.57 | 19042 |
| MLVU | 68.26 | 2.42 | 16570 |
| LSDBench | 49.31 | 2.48 | 13914 |
| VideoHolmes | 37.18 | 2.75 | 9085 |
| LVBench | 43.32 | 2.62 | 20412 |
`Token` includes both text tokens and image tokens.
## 3. Run Your Own Evaluation
### Step 1. Clone the Repository
```bash
git clone https://github.com/wangruohui/EfficientVideoAgent.git
cd EfficientVideoAgent
```
### Step 2. Download Model and Install Dependencies
1. Download model weights from `https://huggingface.co/WRHC/EfficientVideoAgent/` to `hf_model/`:
```bash
huggingface-cli download WRHC/EfficientVideoAgent --local-dir hf_model
```
2. Install FFmpeg following `https://www.ffmpeg.org/download.html`, ensure `ffprobe` is in `PATH`, and ensure FFmpeg shared libraries are in `LD_LIBRARY_PATH`.
3. Install dependencies from `requirements.txt` (recommended: `uv`)
```bash
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
```
### Step 3. Download Evaluation Datasets and Update Dataset Paths
`eval-eva.py` reads dataset meta from `DATASET_CONFIG`. Before running evaluation, make sure each dataset is available locally and paths are correct.
1. Download and extract video datasets (VideoMME / LSDBench / LVBench / VideoHolmes / LongVideoBench / MLVU).
2. Annotation jsonl files are already provided in `data/*.jsonl` and have been normalized to a unified format.
3. Edit `eval-eva.py` -> `DATASET_CONFIG`: only `video_root` needs to be changed to your local video directory.
Example:
```python
DATASET_CONFIG = {
"videomme": {
"jsonl": "data/videomme_test_wosubtitles_raw_list_full.jsonl",
"video_root": "/path/to/VideoMME/video",
"cache": "cache_videomme.jsonl",
"result": "result_videomme.jsonl",
},
}
```
### Step 4. Serve the Model with vLLM (Multi-GPU Data Parallel)
```bash
vllm serve <MODEL_PATH_OR_HF_ID> \
--data-parallel-size <NUM_GPUS> \
--limit-mm-per-prompt '{"image": 9999, "video":0}' \
--mm_processor_cache_gb 20 \
--attention-backend FLASH_ATTN \
--allowed-local-media-path <LOCAL_MEDIA_ROOT>
```
**Reproducibility Note**
With vLLM, even when `temperature=0`, final accuracy can still fluctuate by around `0.x%` across runs.
### Step 5. Configure `eval-eva.py` Runtime Settings and Run Evaluation
Before running, edit the config section at the top of `eval-eva.py`:
- `BASE_URL`: OpenAI-compatible endpoint for your vLLM server (for example, `http://localhost:8000/v1`).
- `API_KEY`: API key used by the client (can be a dummy value for local vLLM setups if authentication is disabled).
- `MODEL_TOKENIZER_PATH`: Tokenizer path, should pointing to downloaded hf model weights, i.e. `https://huggingface.co/WRHC/EfficientVideoAgent/` in step 2.
- `FRAME_TOOL_PATH`: path to the frame selection tool script (default is `select_frame_fallback.py`).
- `FRAME_SAVE_ROOT`: directory where extracted frames are saved during tool calls.
Also make sure:
- `FRAME_SAVE_ROOT` directory exists and is writable (or set it to a writable path).
- vLLM `--allowed-local-media-path` covers your dataset `video_root` directories.
- `DATASET_CONFIG`: per-dataset I/O configuration.
- `DATASET_CONFIG[*].video_root`: root directory containing raw video files.
- `DATASET_CONFIG[*].cache`: incremental cache file used during running.
- `DATASET_CONFIG[*].result`: final merged output file written at the end.
Run one dataset:
```bash
python eval-eva.py --dataset videomme
python eval-eva.py --dataset lsdbench
python eval-eva.py --dataset lvbench
python eval-eva.py --dataset videoholmes
python eval-eva.py --dataset longvideobench
python eval-eva.py --dataset mlvu
```
You can control per-tool-call visual token budget via `-v/--max-visual-tokens`.
When a tool call exceeds this budget, `eval-eva.py` automatically reduces resolution and frame count before extraction.
```bash
python eval-eva.py --dataset videomme -v 12000
python eval-eva.py --dataset videomme -v 32000
```
Run all supported datasets with `batch.sh`:
```bash
bash batch.sh
```
## 4. Output Files and Cache/Resume Mechanism
- Output naming is controlled by `DATASET_CONFIG` in `eval-eva.py`.
- If the process is interrupted, rerunning the same command resumes from cache and skips finished samples.
- By default, each dataset writes:
- `cache_*.jsonl`: online cache (appended sample-by-sample)
- `result_*.jsonl`: final merged output
- Useful options:
- `--retry-error`: retry only failed/error cached samples
- `--new-cache`: recreate cache from scratch
- `--output-dir`: redirect cache/result outputs to another directory
## Citation
```bibtex
@misc{zhang2026evaefficientreinforcementlearning,
title={EVA: Efficient Reinforcement Learning for End-to-End Video Agent},
author={Yaolun Zhang and Ruohui Wang and Jiahao Wang and Yepeng Tang and Xuanyu Zheng and Haonan Duan and Hao Lu and Hanming Deng and Lewei Lu},
year={2026},
eprint={2603.22918},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.22918},
}
``` |