--- title: MedVidBench Leaderboard emoji: 🏥 colorFrom: blue colorTo: purple sdk: gradio app_file: app.py pinned: true license: apache-2.0 short_description: MedVidBench Benchmark Leaderboard - 8 medical video tasks sdk_version: 5.50.0 tags: - leaderboard - medical - video-understanding - surgical-ai --- # MedVidBench Leaderboard Interactive leaderboard for evaluating Video-Language Models on the **MedVidBench benchmark** - 8 medical video understanding tasks across 8 surgical datasets. 🏆 **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard) 📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581) ## Overview This leaderboard provides a centralized platform for researchers to: - **Submit** inference results on the MedVidBench test set - **Automatically evaluate** across 8 diverse tasks - **Compare** model performance on standardized metrics - **Track** state-of-the-art progress in medical video understanding ## Features ### 🎯 8 Medical Video Tasks | Task | Metric | Description | |------|--------|-------------| | **TAL** | mAP@0.5 | Temporal Action Localization - identify start/end times of surgical actions | | **STG** | mIoU | Spatiotemporal Grounding - locate actions in space (bbox) and time | | **Next Action** | Accuracy | Predict the next surgical step | | **DVC** | LLM Judge | Dense Video Captioning - detailed segment descriptions | | **VS** | LLM Judge | Video Summary - summarize entire surgical videos | | **RC** | LLM Judge | Region Caption - describe regions indicated by bounding boxes | | **Skill Assessment** | Accuracy | Evaluate surgical skill levels (JIGSAWS) | | **CVS Assessment** | Accuracy | Clinical variable scoring | ### ⚙️ Automatic Evaluation The leaderboard integrates directly with the MedVidBench evaluation pipeline: - **Validation**: Checks results file format and sample count - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping - **Parsing**: Extracts task-specific metrics from evaluation output - **Ranking**: Computes normalized average score across all tasks ### 📊 Test Set Statistics - **Total samples**: 6,245 - **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS) - **Video frames**: ~103,742 ## Submission Guide ### 1. Run Inference Run your model on the MedVidBench test set (6,245 samples) to generate predictions for all 8 tasks. ### 2. Expected Results Format The leaderboard supports **two formats** for submission: #### Format 1: Full Format (with Ground Truth) ```json [ { "question": "