Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

MedGRPO Team Claude Sonnet 4.5 commited on 12 days ago

Commit

c137994

1 Parent(s): 15c8be4

Update README with comprehensive MedGRPO documentation

- Added detailed submission guide with expected results format
- Documented 8 tasks with metrics and descriptions
- Explained automatic evaluation pipeline integration
- Added LLM judge scoring details (R2, R4, R5, R7, R8)
- Included score normalization methodology
- Updated HuggingFace Space metadata (emoji, tags, description)
- Added test set statistics and task distribution
- Comprehensive citation and links section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show

README.md +160 -28

README.md CHANGED Viewed

@@ -1,48 +1,180 @@
 ---
 title: MedGRPO Leaderboard
-emoji: 🥇
-colorFrom: green
-colorTo: indigo
 sdk: gradio
 app_file: app.py
 pinned: true
 license: apache-2.0
-short_description: Leaderboard for Video-LLMs on the MedGRPO Benchmark
-sdk_version: 5.43.1
 tags:
 - leaderboard
 ---
-# Start the configuration
-Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
-Results files should have the following format and be stored as json files:
 ```json
-{
-    "config": {
-        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
-        "model_name": "path of the model on the hub: org/model",
-        "model_sha": "revision on the hub",
     },
-    "results": {
-        "task_name": {
-            "metric_name": score,
-        },
-        "task_name2": {
-            "metric_name": score,
-        }
-    }
 }
 ```
-Request files are created automatically by this tool.
-If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
-# Code logic for more complex edits
-You'll find
-- the main table' columns names and properties in `src/display/utils.py`
-- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
-- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`

 ---
 title: MedGRPO Leaderboard
+emoji: 🏥
+colorFrom: blue
+colorTo: purple
 sdk: gradio
 app_file: app.py
 pinned: true
 license: apache-2.0
+short_description: MedGRPO Benchmark Leaderboard - 8 medical video tasks
+sdk_version: 5.50.0
 tags:
 - leaderboard
+- medical
+- video-understanding
+- surgical-ai
 ---
+# MedGRPO Leaderboard
+Interactive leaderboard for evaluating Video-Language Models on the **MedGRPO benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
+🏆 **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
+📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
+## Overview
+This leaderboard provides a centralized platform for researchers to:
+- **Submit** inference results on the MedGRPO test set
+- **Automatically evaluate** across 8 diverse tasks
+- **Compare** model performance on standardized metrics
+- **Track** state-of-the-art progress in medical video understanding
+## Features
+### 🎯 8 Medical Video Tasks
+| Task | Metric | Description |
+|------|--------|-------------|
+| **TAL** | mAP@0.5 | Temporal Action Localization - identify start/end times of surgical actions |
+| **STG** | mIoU | Spatiotemporal Grounding - locate actions in space (bbox) and time |
+| **Next Action** | Accuracy | Predict the next surgical step |
+| **DVC** | LLM Judge | Dense Video Captioning - detailed segment descriptions |
+| **VS** | LLM Judge | Video Summary - summarize entire surgical videos |
+| **RC** | LLM Judge | Region Caption - describe regions indicated by bounding boxes |
+| **Skill Assessment** | Accuracy | Evaluate surgical skill levels (JIGSAWS) |
+| **CVS Assessment** | Accuracy | Clinical variable scoring |
+### ⚙️ Automatic Evaluation
+The leaderboard integrates directly with the MedGRPO evaluation pipeline:
+- **Validation**: Checks results file format and sample count
+- **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
+- **Parsing**: Extracts task-specific metrics from evaluation output
+- **Ranking**: Computes normalized average score across all tasks
+### 📊 Test Set Statistics
+- **Total samples**: 6,245
+- **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
+- **Video frames**: ~103,742
+## Submission Guide
+### 1. Run Inference
+Run your model on the MedGRPO test set (6,245 samples) to generate predictions for all 8 tasks.
+### 2. Expected Results Format
+Your results file should be a JSON with this structure:
 ```json
+[
+  {
+    "question": "<video>\nQuestion text...",
+    "response": "Your model's answer",
+    "ground_truth": "Correct answer",
+    "qa_type": "tal",
+    "metadata": {
+      "video_id": "...",
+      "fps": "1.0",
+      ...
     },
+    "data_source": "AVOS",
+    ...
+  },
+  ...
+]
+```
+**Valid qa_types**:
+- `tal` - Temporal Action Localization
+- `stg` - Spatiotemporal Grounding
+- `next_action` - Next Action Prediction
+- `dense_captioning` - Dense Video Captioning
+- `video_summary` - Video Summary
+- `region_caption` - Region Caption
+- `skill_assessment` - Skill Assessment (JIGSAWS)
+- `cvs_assessment` - CVS Assessment
+### 3. Upload to Leaderboard
+1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
+2. Go to the **Submit Results** tab
+3. Fill in:
+   - **Model Name** (e.g., "Qwen2.5-VL-7B-MedGRPO")
+   - **Organization** (e.g., "Your University")
+   - **Contact** (optional)
+4. Upload your results JSON file
+5. Click **Submit to Leaderboard**
+The system will:
+- Validate your file (format + sample count)
+- Run automatic evaluation (~5-10 minutes)
+- Extract metrics for all 8 tasks
+- Add your model to the leaderboard
+## Evaluation Metrics
+### Task-Specific Metrics
+| Task | Metric Extracted | Details |
+|------|------------------|---------|
+| TAL | `mAP@0.5` | Mean Average Precision at IoU=0.5 |
+| STG | `mean_iou` | Mean Intersection over Union (spatial + temporal) |
+| Next Action | `Weighted Average Accuracy` | Classification accuracy |
+| DVC/VS/RC | `Average LLM Judge Score` | Average of R2, R4, R5, R7, R8 (1-5 scale) |
+| Skill/CVS | `accuracy` | Classification accuracy |
+### LLM Judge Details
+For caption tasks (DVC, VS, RC), we use **GPT-4.1** or **Gemini-Pro** with rubric-based scoring:
+**5 Key Aspects** (1-5 scale each):
+- **R2**: Relevance & Medical Terminology
+- **R4**: Actionable Surgical Actions
+- **R5**: Comprehensive Detail Level
+- **R7**: Anatomical & Instrument Precision
+- **R8**: Clinical Context & Coherence
+**Final score** = Average of R2, R4, R5, R7, R8
+### Score Normalization
+To compute the **average score** fairly across tasks:
+1. **LLM Judge scores** (1-5 scale) are normalized: `(score - 1) / 4` → [0, 1]
+2. **Other metrics** (already 0-1) remain unchanged
+3. **Average** = mean of all 8 normalized task scores
+## Links
+- 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
+- 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
+- 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
+- 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
+- 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
+## Citation
+```bibtex
+@article{su2024medgrpo,
+  title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
+  author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
+  journal={arXiv preprint arXiv:2512.06581},
+  year={2025}
 }
 ```
+## License
+- **Leaderboard Code**: Apache 2.0
+- **Dataset**: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
+## Contact
+For questions or issues:
+- Open an issue on [GitHub](https://github.com/YuhaoSu/MedGRPO)
+- Visit the [project page](https://yuhaosu.github.io/MedGRPO/)