Spaces:

Video-Reason
/

VBVR-Bench-Leaderboard

Running

File size: 7,452 Bytes

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
    task0 = Task("anli_r1", "acc", "ANLI")
    task1 = Task("logiqa", "acc_norm", "LogiQA")

NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">VBVR-Bench Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
<a href="https://video-reason.com" target="_blank">
    <img alt="Code" src="https://img.shields.io/badge/Project%20-%20Homepage-4285F4" height="20" />
</a>
<a href="https://github.com/orgs/Video-Reason/repositories" target="_blank">
    <img alt="Code" src="https://img.shields.io/badge/VBVR-Code-100000?style=flat-square&logo=github&logoColor=white" height="20" />
</a>
  <a href="https://arxiv.org/abs/2602.20159" target="_blank">
      <img alt="arXiv" src="https://img.shields.io/badge/arXiv-VBVR-red?logo=arxiv" height="20" />
  </a>
<a href="https://huggingface.co/Video-Reason/VBVR-Dataset" target="_blank">
    <img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Dataset-Data-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://huggingface.co/Video-Reason/VBVR-Bench-Data" target="_blank">
    <img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Bench-Data-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
<a href="https://huggingface.co/Video-Reason/VBVR-Bench-Leaderboard" target="_blank">
    <img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Bench-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
</a>
**VBVR-Bench** is a comprehensive benchmark for evaluating **video reasoning capabilities**.

To systematically assess model reasoning capabilities, VBVR-Bench employs a **dual-split evaluation strategy** across **100 diverse tasks**:
- **In-Domain (ID)**: 50 tasks that overlap with training categories but differ in unseen parameter configurations and sample instances, testing *in-domain generalization*.
- **Out-of-Domain (OOD)**: 50 entirely novel tasks designed to measure *out-of-domain generalization*, testing whether models acquire transferable reasoning primitives rather than relying on task-specific memorization.

Each task consists of **5 test samples**, enabling statistically robust evaluation across diverse reasoning scenarios.

Use the column group selector below to customize which score groups are displayed.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## About VBVR-Bench

### Rule-Based Evaluation Framework

A key feature of VBVR-Bench is its fully **rule-based evaluation framework**. Most test tasks have a unique, verifiable correct answer, allowing interpretable evaluation based on spatial position, color, object identity, path, or logical outcome. Geometric, physical, and deductive constraints are also considered in the scoring rubrics.

Each of the 100 test tasks is paired with a dedicated evaluation rule, with scores on multiple aspects to compute a weighted, comprehensive score. Sub-criteria include:
- **Spatial Accuracy**: Correctness of object positions and arrangements
- **Trajectory Correctness**: Validity of movement paths
- **Temporal Consistency**: Smooth frame-by-frame progression
- **Logical Validity**: Adherence to task-specific reasoning constraints

### Example: Task G-45 (Key Door Matching)

A green dot agent must first locate a color-specified key and then navigate to the matching door within a grid maze. Performance is scored across four weighted dimensions:

| Dimension | Weight | Description |
|-----------|--------|-------------|
| Target Identification | 30% | Correct key and door selection without color confusion |
| Path Validity | 30% | Following allowed paths without wall collisions |
| Path Efficiency | 20% | Comparison to optimal BFS path |
| Animation Quality | 20% | Smooth movement and precise object alignment |

A perfect score requires all four dimensions to be satisfied.

### Key Benefits

- **Reproducibility and Determinism**: Fully deterministic evaluation avoiding stochastic variability or hallucinations associated with LLM-based judgments.
- **Granular Verifiability**: Each task is decomposed into interpretable vectors, allowing precise measurement of spatial, temporal, and logical correctness at the pixel or object-property level.
- **Transparent Diagnosis**: By explicitly encoding reasoning constraints, the benchmark not only ranks models but also reveals systematic capability gaps and cross-domain performance trends.

### Model Categories
- 👤 **Reference**: Human performance baseline
- 🟢 **Open-source**: Publicly available models  
- 🔵 **Proprietary**: Commercial/closed-source models
- ⭐ **Strong Baseline**: Data scaling strong baseline (VBVR-Wan2.2)
"""

EVALUATION_QUEUE_TEXT = """
## How to Submit Your Results

We welcome submissions from the research community! To submit your model's evaluation results to the VBVR-Bench leaderboard:

### Submission Process

📧 **Email your submission to: [C200210@e.ntu.edu.sg](mailto:C200210@e.ntu.edu.sg)**

Please include the following in your submission:

### Required Materials

1. **Model Information**
   - Model name and version
   - Model type (Open-source / Proprietary)
   - Link to model (if publicly available)
   - Brief model description

2. **Evaluation Results**
   - Complete evaluation scores in JSON format
   - Scores for all 100 tasks (50 ID + 50 OOD)
   - Category-wise breakdown (Abstraction, Knowledge, Perception, Spatiality, Transformation)

3. **Evaluation Logs**
   - Full evaluation logs for verification
   - Generated videos for a subset of tasks (optional but recommended)

4. **Technical Details**
   - Inference configuration (resolution, frame rate, etc.)
   - Hardware used for generation
   - Any preprocessing or postprocessing applied

We will review your submission and add it to the leaderboard within 1-2 weeks.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@article{vbvr2026,
      title={A Very Big Video Reasoning Suite}, 
      author={Maijunxian Wang and Ruisi Wang and Juyi Lin and Ran Ji and Thaddäus Wiedemer and Qingying Gao and Dezhi Luo and Yaoyao Qian and Lianyu Huang and Zelong Hong and Jiahui Ge and Qianli Ma and Hang He and Yifan Zhou and Lingzi Guo and Lantao Mei and Jiachen Li and Hanwen Xing and Tianqi Zhao and Fengyuan Yu and Weihang Xiao and Yizheng Jiao and Jianheng Hou and Danyang Zhang and Pengcheng Xu and Boyang Zhong and Zehong Zhao and Gaoyun Fang and John Kitaoka and Yile Xu and Hua Xu and Kenton Blacutt and Tin Nguyen and Siyuan Song and Haoran Sun and Shaoyue Wen and Linyang He and Runming Wang and Yanzhi Wang and Mengyue Yang and Ziqiao Ma and Raphaël Millière and Freda Shi and Nuno Vasconcelos and Daniel Khashabi and Alan Yuille and Yilun Du and Ziming Liu and Bo Li and Dahua Lin and Ziwei Liu and Vikash Kumar and Yijiang Li and Lei Yang and Zhongang Cai and Hokin Deng},
  journal = {arXiv preprint arXiv:2602.20159},
  year = {2026}
}
"""