A newer version of the Gradio SDK is available:
6.4.0
title: MedVidBench Leaderboard
emoji: π₯
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: MedVidBench Benchmark Leaderboard - 8 medical video tasks
sdk_version: 5.50.0
tags:
- leaderboard
- medical
- video-understanding
- surgical-ai
MedVidBench Leaderboard
Interactive leaderboard for evaluating Video-Language Models on the MedVidBench benchmark - 8 medical video understanding tasks across 8 surgical datasets.
π Live Demo: huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard
π Paper: arXiv:2512.06581
Overview
This leaderboard provides a centralized platform for researchers to:
- Submit inference results on the MedVidBench test set
- Automatically evaluate across 8 diverse tasks
- Compare model performance on standardized metrics
- Track state-of-the-art progress in medical video understanding
Features
π― 8 Medical Video Tasks
| Task | Metric | Description |
|---|---|---|
| TAL | mAP@0.5 | Temporal Action Localization - identify start/end times of surgical actions |
| STG | mIoU | Spatiotemporal Grounding - locate actions in space (bbox) and time |
| Next Action | Accuracy | Predict the next surgical step |
| DVC | LLM Judge | Dense Video Captioning - detailed segment descriptions |
| VS | LLM Judge | Video Summary - summarize entire surgical videos |
| RC | LLM Judge | Region Caption - describe regions indicated by bounding boxes |
| Skill Assessment | Accuracy | Evaluate surgical skill levels (JIGSAWS) |
| CVS Assessment | Accuracy | Clinical variable scoring |
βοΈ Automatic Evaluation
The leaderboard integrates directly with the MedVidBench evaluation pipeline:
- Validation: Checks results file format and sample count
- Execution: Runs
evaluate_all_pai.pywith dataset-agnostic grouping - Parsing: Extracts task-specific metrics from evaluation output
- Ranking: Computes normalized average score across all tasks
π Test Set Statistics
- Total samples: 6,245
- Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
- Video frames: ~103,742
Submission Guide
1. Run Inference
Run your model on the MedVidBench test set (6,245 samples) to generate predictions for all 8 tasks.
2. Expected Results Format
The leaderboard supports two formats for submission:
Format 1: Full Format (with Ground Truth)
[
{
"question": "<video>\nQuestion text...",
"response": "Your model's answer",
"ground_truth": "Correct answer",
"qa_type": "tal",
"metadata": {
"video_id": "...",
"fps": "1.0",
...
},
"data_source": "AVOS",
...
},
...
]
Format 2: Prediction-Only Format
[
{
"id": "video_id&&start_frame&&end_frame&&fps",
"qa_type": "tal",
"prediction": "Your model's answer"
},
...
]
Example:
[
{
"id": "kcOqlifSukA&&22425&&25124&&1.0",
"qa_type": "tal",
"prediction": "22.0-78.0, 89.0-94.0 seconds."
},
{
"id": "VsKw5d-4rq8&&13561&&16184&&1.0",
"qa_type": "stg",
"prediction": "[10, 20, 30, 40] 5.0-10.0 seconds."
}
]
Key differences:
- Format 1: Uses
response+ground_truthfields with full metadata (dictionary format indexed by string keys "0", "1", etc.) - Format 2: Uses
id+predictionfields only (list format, GT merged automatically by index position) - The
idfield format:{video_id}&&{start_frame}&&{end_frame}&&{fps}is included for reference but matching is done by array index - Important: Predictions in Format 2 must be in the same order as the test set
Valid qa_types:
tal- Temporal Action Localizationstg- Spatiotemporal Groundingnext_action- Next Action Predictiondense_captioning- Dense Video Captioningvideo_summary- Video Summaryregion_caption- Region Captionskill_assessment- Skill Assessment (JIGSAWS)cvs_assessment- CVS Assessment
3. Upload to Leaderboard
- Visit the leaderboard
- Go to the Submit Results tab
- Fill in:
- Model Name (e.g., "Qwen2.5-VL-7B-MedVidBench")
- Organization (e.g., "Your University")
- Contact (optional)
- Upload your results JSON file
- Click Submit to Leaderboard
The system will:
- Validate your file (format + sample count)
- Run automatic evaluation (~2-5 minutes with
--skip-llm-judge, ~10-20 minutes with LLM judge) - Extract metrics for all 8 tasks
- Add your model to the leaderboard
Note: By default, DVC/VS/RC are evaluated with --skip-llm-judge for faster results (caption metrics will be 0.0). You can run LLM judge evaluation later using the button on the leaderboard page.
4. Run LLM Judge Evaluation (Optional)
If your submission was evaluated with --skip-llm-judge (DVC_llm, VS_llm, RC_llm are all 0.0), you can compute these metrics later:
- Go to the Leaderboard tab
- Scroll to the "Run LLM Judge Evaluation" section
- Enter your model name (exact match)
- Click "Start Evaluation"
The system will:
- Start evaluation in the background (runs independently)
- Re-run evaluation for DVC/VS/RC tasks with LLM judge (GPT-4.1/Gemini)
- Automatically update your leaderboard entry when complete
- Preserve all other metrics (TAL, STG, NAP, SA, CVS)
β Background Execution:
- You can close the browser after starting - evaluation continues running
- Come back later and click "Check Status" to see progress
- The leaderboard will be automatically updated when complete
Time: ~10-20 minutes depending on API rate limits
Availability: Only available when ALL three caption metrics are 0.0
How to Check Status:
- Enter the same model name
- Click "Check Status" button
- View recent logs and progress
- Or simply refresh the leaderboard to see if metrics are updated
Evaluation Metrics
Task-Specific Metrics
| Task | Metric Extracted | Details |
|---|---|---|
| TAL | mAP@0.5 |
Mean Average Precision at IoU=0.5 |
| STG | mean_iou |
Mean Intersection over Union (spatial + temporal) |
| Next Action | Weighted Average Accuracy |
Classification accuracy |
| DVC/VS/RC | Average LLM Judge Score |
Average of R2, R4, R5, R7, R8 (1-5 scale) |
| Skill/CVS | accuracy |
Classification accuracy |
LLM Judge Details
For caption tasks (DVC, VS, RC), we use GPT-4.1 or Gemini-Pro with rubric-based scoring:
5 Key Aspects (1-5 scale each):
- R2: Relevance & Medical Terminology
- R4: Actionable Surgical Actions
- R5: Comprehensive Detail Level
- R7: Anatomical & Instrument Precision
- R8: Clinical Context & Coherence
Final score = Average of R2, R4, R5, R7, R8
Score Normalization
To compute the average score fairly across tasks:
- LLM Judge scores (1-5 scale) are normalized:
(score - 1) / 4β [0, 1] - Other metrics (already 0-1) remain unchanged
- Average = mean of all 8 normalized task scores
Links
- π Paper: https://arxiv.org/abs/2512.06581
- π Project: https://yuhaosu.github.io/MedGRPO/
- πΎ Dataset: https://huggingface.co/datasets/UIIAmerica/MedVidBench
- π» GitHub: https://github.com/YuhaoSu/MedGRPO
- π Leaderboard: https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard
Citation
@article{su2024medgrpo,
title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
journal={arXiv preprint arXiv:2512.06581},
year={2025}
}
License
- Leaderboard Code: Apache 2.0
- Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
Contact
For questions or issues:
- Open an issue on GitHub
- Visit the project page