MedGRPO Team Claude Sonnet 4.5 commited on
Commit
15c8be4
Β·
1 Parent(s): 73ea6a1

Integrate MedGRPO evaluation pipeline with leaderboard

Browse files

Complete rewrite to integrate with actual MedGRPO benchmark infrastructure.

Major Changes:
- **Evaluation Integration**: Calls /root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py for automatic metric computation
- **8 MedGRPO Tasks**: TAL, STG, Next Action, DVC, VS, RC, Skill Assessment, CVS Assessment
- **VLLM Results Format**: Validates uploaded results JSON (6,245 samples from test set)
- **Automatic Metrics**: Parses evaluation output to extract mAP@0.5, mIoU, Accuracy, LLM Judge scores
- **Leaderboard Storage**: JSON-based persistence with ranking by average score
- **Comprehensive UI**: 4 tabs (Leaderboard, Submit, Tasks & Metrics, About)

Key Features:
- Validates results file format and sample count
- Runs evaluate_all_pai.py with --grouping overall for dataset-agnostic metrics
- Extracts task-specific metrics from evaluation output
- Normalizes scores (LLM judge 1-5 β†’ 0-1, others already 0-1) for fair averaging
- Prevents duplicate model submissions
- Saves evaluation outputs for debugging

File Format:
- Input: VLLM inference results JSON (from my_vllm_infer pipeline)
- Required fields: question, response, qa_type, ground_truth, metadata, data_source
- Valid qa_types: tal, stg, next_action, dense_captioning, video_summary, region_caption, skill_assessment, cvs_assessment

Technical Details:
- Subprocess call to evaluate_all_pai.py with 10-minute timeout
- Output parsing to extract metrics from evaluation stdout
- Per-submission result directories in results/
- Leaderboard persistence in leaderboard.json

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. app.py +634 -191
app.py CHANGED
@@ -1,204 +1,647 @@
 
 
 
 
 
1
  import gradio as gr
2
- from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
3
  import pandas as pd
4
- from apscheduler.schedulers.background import BackgroundScheduler
5
- from huggingface_hub import snapshot_download
6
-
7
- from src.about import (
8
- CITATION_BUTTON_LABEL,
9
- CITATION_BUTTON_TEXT,
10
- EVALUATION_QUEUE_TEXT,
11
- INTRODUCTION_TEXT,
12
- LLM_BENCHMARKS_TEXT,
13
- TITLE,
14
- )
15
- from src.display.css_html_js import custom_css
16
- from src.display.utils import (
17
- BENCHMARK_COLS,
18
- COLS,
19
- EVAL_COLS,
20
- EVAL_TYPES,
21
- AutoEvalColumn,
22
- ModelType,
23
- fields,
24
- WeightType,
25
- Precision
26
- )
27
- from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
28
- from src.populate import get_evaluation_queue_df, get_leaderboard_df
29
- from src.submission.submit import add_new_eval
30
-
31
-
32
- def restart_space():
33
- API.restart_space(repo_id=REPO_ID)
34
-
35
- ### Space initialisation
36
- try:
37
- print(EVAL_REQUESTS_PATH)
38
- snapshot_download(
39
- repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
40
- )
41
- except Exception:
42
- restart_space()
43
- try:
44
- print(EVAL_RESULTS_PATH)
45
- snapshot_download(
46
- repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
47
- )
48
- except Exception:
49
- restart_space()
50
-
51
-
52
- LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
53
-
54
- (
55
- finished_eval_queue_df,
56
- running_eval_queue_df,
57
- pending_eval_queue_df,
58
- ) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
59
-
60
- def init_leaderboard(dataframe):
61
- if dataframe is None or dataframe.empty:
62
- raise ValueError("Leaderboard DataFrame is empty or None.")
63
- return Leaderboard(
64
- value=dataframe,
65
- datatype=[c.type for c in fields(AutoEvalColumn)],
66
- select_columns=SelectColumns(
67
- default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
68
- cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
69
- label="Select Columns to Display:",
70
- ),
71
- search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
72
- hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
73
- filter_columns=[
74
- ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
75
- ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
76
- ColumnFilter(
77
- AutoEvalColumn.params.name,
78
- type="slider",
79
- min=0.01,
80
- max=150,
81
- label="Select the number of parameters (B)",
82
- ),
83
- ColumnFilter(
84
- AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
85
- ),
86
- ],
87
- bool_checkboxgroup_label="Hide models",
88
- interactive=False,
89
- )
90
-
91
-
92
- demo = gr.Blocks(css=custom_css)
93
- with demo:
94
- gr.HTML(TITLE)
95
- gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
96
-
97
- with gr.Tabs(elem_classes="tab-buttons") as tabs:
98
- with gr.TabItem("πŸ… LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
99
- leaderboard = init_leaderboard(LEADERBOARD_DF)
100
-
101
- with gr.TabItem("πŸ“ About", elem_id="llm-benchmark-tab-table", id=2):
102
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
103
-
104
- with gr.TabItem("πŸš€ Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
105
- with gr.Column():
106
- with gr.Row():
107
- gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
108
 
109
- with gr.Column():
110
- with gr.Accordion(
111
- f"βœ… Finished Evaluations ({len(finished_eval_queue_df)})",
112
- open=False,
113
- ):
114
- with gr.Row():
115
- finished_eval_table = gr.components.Dataframe(
116
- value=finished_eval_queue_df,
117
- headers=EVAL_COLS,
118
- datatype=EVAL_TYPES,
119
- row_count=5,
120
- )
121
- with gr.Accordion(
122
- f"πŸ”„ Running Evaluation Queue ({len(running_eval_queue_df)})",
123
- open=False,
124
- ):
125
- with gr.Row():
126
- running_eval_table = gr.components.Dataframe(
127
- value=running_eval_queue_df,
128
- headers=EVAL_COLS,
129
- datatype=EVAL_TYPES,
130
- row_count=5,
131
- )
132
-
133
- with gr.Accordion(
134
- f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
135
- open=False,
136
- ):
137
- with gr.Row():
138
- pending_eval_table = gr.components.Dataframe(
139
- value=pending_eval_queue_df,
140
- headers=EVAL_COLS,
141
- datatype=EVAL_TYPES,
142
- row_count=5,
143
- )
144
- with gr.Row():
145
- gr.Markdown("# βœ‰οΈβœ¨ Submit your model here!", elem_classes="markdown-text")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
  with gr.Row():
148
  with gr.Column():
149
- model_name_textbox = gr.Textbox(label="Model name")
150
- revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
151
- model_type = gr.Dropdown(
152
- choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
153
- label="Model type",
154
- multiselect=False,
155
- value=None,
156
- interactive=True,
157
  )
158
 
159
- with gr.Column():
160
- precision = gr.Dropdown(
161
- choices=[i.value.name for i in Precision if i != Precision.Unknown],
162
- label="Precision",
163
- multiselect=False,
164
- value="float16",
165
- interactive=True,
166
  )
167
- weight_type = gr.Dropdown(
168
- choices=[i.value.name for i in WeightType],
169
- label="Weights type",
170
- multiselect=False,
171
- value="Original",
172
- interactive=True,
173
  )
174
- base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
175
-
176
- submit_button = gr.Button("Submit Eval")
177
- submission_result = gr.Markdown()
178
- submit_button.click(
179
- add_new_eval,
180
- [
181
- model_name_textbox,
182
- base_model_name_textbox,
183
- revision_name_textbox,
184
- precision,
185
- weight_type,
186
- model_type,
187
- ],
188
- submission_result,
189
- )
190
 
191
- with gr.Row():
192
- with gr.Accordion("πŸ“™ Citation", open=False):
193
- citation_button = gr.Textbox(
194
- value=CITATION_BUTTON_TEXT,
195
- label=CITATION_BUTTON_LABEL,
196
- lines=20,
197
- elem_id="citation-button",
198
- show_copy_button=True,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  )
200
 
201
- scheduler = BackgroundScheduler()
202
- scheduler.add_job(restart_space, "interval", seconds=1800)
203
- scheduler.start()
204
- demo.queue(default_concurrency_limit=40).launch(share=True, server_name="0.0.0.0")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MedGRPO Leaderboard - Interactive leaderboard for evaluating Video-Language Models
3
+ on the MedGRPO benchmark across 8 medical video understanding tasks.
4
+ """
5
+
6
  import gradio as gr
 
7
  import pandas as pd
8
+ import json
9
+ import os
10
+ import shutil
11
+ import subprocess
12
+ import sys
13
+ from datetime import datetime
14
+ from pathlib import Path
15
+ from typing import Dict, List, Tuple, Optional
16
+ from collections import defaultdict
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ # Configuration
19
+ SUBMISSIONS_DIR = Path("submissions")
20
+ RESULTS_DIR = Path("results")
21
+ LEADERBOARD_FILE = Path("leaderboard.json")
22
+ EVAL_SCRIPT = Path("/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py")
23
+
24
+ # Ensure directories exist
25
+ SUBMISSIONS_DIR.mkdir(exist_ok=True)
26
+ RESULTS_DIR.mkdir(exist_ok=True)
27
+
28
+ # MedGRPO Task Definitions (8 tasks)
29
+ TASKS = {
30
+ "tal": {
31
+ "name": "Temporal Action Localization",
32
+ "metric": "mAP@0.5",
33
+ "higher_better": True,
34
+ "description": "Identify start/end times of surgical actions"
35
+ },
36
+ "stg": {
37
+ "name": "Spatiotemporal Grounding",
38
+ "metric": "mIoU",
39
+ "higher_better": True,
40
+ "description": "Locate actions in both space (bbox) and time"
41
+ },
42
+ "next_action": {
43
+ "name": "Next Action Prediction",
44
+ "metric": "Accuracy",
45
+ "higher_better": True,
46
+ "description": "Predict the next surgical step"
47
+ },
48
+ "dvc": {
49
+ "name": "Dense Video Captioning",
50
+ "metric": "LLM Judge (Avg)",
51
+ "higher_better": True,
52
+ "description": "Generate detailed segment descriptions"
53
+ },
54
+ "vs": {
55
+ "name": "Video Summary",
56
+ "metric": "LLM Judge (Avg)",
57
+ "higher_better": True,
58
+ "description": "Summarize entire surgical videos"
59
+ },
60
+ "rc": {
61
+ "name": "Region Caption",
62
+ "metric": "LLM Judge (Avg)",
63
+ "higher_better": True,
64
+ "description": "Describe regions indicated by bounding boxes"
65
+ },
66
+ "skill_assessment": {
67
+ "name": "Skill Assessment",
68
+ "metric": "Accuracy",
69
+ "higher_better": True,
70
+ "description": "Evaluate surgical skill levels (JIGSAWS)"
71
+ },
72
+ "cvs_assessment": {
73
+ "name": "CVS Assessment",
74
+ "metric": "Accuracy",
75
+ "higher_better": True,
76
+ "description": "Clinical variable scoring"
77
+ },
78
+ }
79
+
80
+ # Test set statistics
81
+ TEST_SET_STATS = {
82
+ "total_samples": 6245,
83
+ "datasets": ["AVOS", "CholecT50", "CholecTrack20", "Cholec80_CVS", "CoPESD", "EgoSurgery", "NurViD", "JIGSAWS"],
84
+ "video_frames": 103742,
85
+ }
86
+
87
+
88
+ def load_leaderboard() -> pd.DataFrame:
89
+ """Load existing leaderboard from JSON file."""
90
+ if LEADERBOARD_FILE.exists():
91
+ with open(LEADERBOARD_FILE, 'r') as f:
92
+ data = json.load(f)
93
+ if data:
94
+ df = pd.DataFrame(data)
95
+ # Sort by average score descending
96
+ if 'average' in df.columns:
97
+ df = df.sort_values('average', ascending=False).reset_index(drop=True)
98
+ return df
99
+
100
+ # Return empty dataframe with correct structure
101
+ columns = ["rank", "model_name", "organization", "average"] + list(TASKS.keys()) + ["date", "contact"]
102
+ return pd.DataFrame(columns=columns)
103
+
104
+
105
+ def save_leaderboard(df: pd.DataFrame):
106
+ """Save leaderboard to JSON file."""
107
+ # Add rank column
108
+ df['rank'] = range(1, len(df) + 1)
109
+
110
+ # Save to JSON
111
+ with open(LEADERBOARD_FILE, 'w') as f:
112
+ json.dump(df.to_dict('records'), f, indent=2)
113
+
114
+
115
+ def validate_results_file(file_path: str) -> Tuple[bool, str]:
116
+ """
117
+ Validate that uploaded file is a valid VLLM inference results JSON.
118
+
119
+ Expected format (from VLLM inference):
120
+ [
121
+ {
122
+ "question": "...",
123
+ "response": "...",
124
+ "ground_truth": "...",
125
+ "qa_type": "tal/stg/next_action/dvc/vs/rc/skill_assessment/cvs_assessment",
126
+ "metadata": {"video_id": "...", "fps": "...", ...},
127
+ "data_source": "AVOS/CholecT50/...",
128
+ ...
129
+ },
130
+ ...
131
+ ]
132
+ """
133
+ try:
134
+ with open(file_path, 'r') as f:
135
+ data = json.load(f)
136
+
137
+ # Handle both list and dict formats
138
+ if isinstance(data, dict):
139
+ records = list(data.values())
140
+ elif isinstance(data, list):
141
+ records = data
142
+ else:
143
+ return False, f"Invalid format: expected list or dict, got {type(data)}"
144
+
145
+ if len(records) == 0:
146
+ return False, "Empty results file"
147
+
148
+ # Check first record has required fields
149
+ sample = records[0]
150
+ required_fields = ["question", "response", "qa_type"]
151
+ missing = [f for f in required_fields if f not in sample]
152
+ if missing:
153
+ return False, f"Missing required fields: {missing}"
154
+
155
+ # Check qa_type is valid
156
+ valid_qa_types = ["tal", "stg", "next_action", "dense_captioning", "video_summary", "region_caption",
157
+ "skill_assessment", "cvs_assessment"]
158
+ qa_type = sample.get("qa_type", "")
159
+ if not any(valid in qa_type for valid in valid_qa_types):
160
+ return False, f"Invalid qa_type: {qa_type}"
161
+
162
+ # Check if file has reasonable number of samples (should be close to 6245)
163
+ if len(records) < 5000:
164
+ return False, f"Too few samples ({len(records)}). Expected ~6245 samples for full test set."
165
+
166
+ return True, f"βœ“ Valid results file with {len(records)} samples"
167
+
168
+ except json.JSONDecodeError as e:
169
+ return False, f"Invalid JSON: {str(e)}"
170
+ except Exception as e:
171
+ return False, f"Error validating file: {str(e)}"
172
+
173
+
174
+ def run_evaluation(results_file: str, model_name: str) -> Tuple[bool, Dict, str]:
175
+ """
176
+ Run evaluation using evaluate_all_pai.py script.
177
+
178
+ Returns:
179
+ (success, metrics_dict, message)
180
+ """
181
+ try:
182
+ # Create output directory for this submission
183
+ output_dir = RESULTS_DIR / model_name.replace(" ", "_")
184
+ output_dir.mkdir(exist_ok=True)
185
+
186
+ # Copy results file to output directory
187
+ result_copy = output_dir / "results.json"
188
+ shutil.copy(results_file, result_copy)
189
+
190
+ # Run evaluation script with overall grouping
191
+ cmd = [
192
+ sys.executable,
193
+ str(EVAL_SCRIPT),
194
+ str(result_copy),
195
+ "--grouping", "overall"
196
+ ]
197
+
198
+ print(f"Running evaluation: {' '.join(cmd)}")
199
+ result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
200
+
201
+ if result.returncode != 0:
202
+ return False, {}, f"Evaluation failed: {result.stderr}"
203
+
204
+ # Parse output to extract metrics
205
+ metrics = parse_evaluation_output(result.stdout)
206
+
207
+ # Save evaluation output
208
+ with open(output_dir / "eval_output.txt", 'w') as f:
209
+ f.write(result.stdout)
210
+
211
+ return True, metrics, "βœ“ Evaluation completed successfully"
212
+
213
+ except subprocess.TimeoutExpired:
214
+ return False, {}, "Evaluation timed out (>10 minutes)"
215
+ except Exception as e:
216
+ return False, {}, f"Error running evaluation: {str(e)}"
217
+
218
+
219
+ def parse_evaluation_output(output: str) -> Dict[str, float]:
220
+ """
221
+ Parse evaluation output from evaluate_all_pai.py to extract metrics.
222
+
223
+ Expected output format (from --grouping overall):
224
+ ================================================================================
225
+ TAL - Overall Evaluation (All Datasets Combined)
226
+ ================================================================================
227
+ Total samples: 1234
228
+ mAP@0.5: 0.4567
229
+ ...
230
+ """
231
+ metrics = {}
232
+
233
+ lines = output.split('\n')
234
+ current_task = None
235
+
236
+ for i, line in enumerate(lines):
237
+ line = line.strip()
238
+
239
+ # Detect task headers
240
+ if "TAL - Overall Evaluation" in line:
241
+ current_task = "tal"
242
+ elif "STG - Overall Evaluation" in line:
243
+ current_task = "stg"
244
+ elif "NEXT_ACTION - Overall Evaluation" in line:
245
+ current_task = "next_action"
246
+ elif "DVC - Overall Evaluation" in line:
247
+ current_task = "dvc"
248
+ elif "RC - Overall Evaluation" in line:
249
+ current_task = "rc"
250
+ elif "VS - Overall Evaluation" in line:
251
+ current_task = "vs"
252
+ elif "SKILL_ASSESSMENT - Overall Evaluation" in line:
253
+ current_task = "skill_assessment"
254
+ elif "CVS_ASSESSMENT - Overall Evaluation" in line:
255
+ current_task = "cvs_assessment"
256
+
257
+ # Extract metrics based on task
258
+ if current_task:
259
+ if current_task == "tal" and "mAP@0.5:" in line:
260
+ try:
261
+ value = float(line.split("mAP@0.5:")[-1].strip())
262
+ metrics["tal"] = value
263
+ except:
264
+ pass
265
+
266
+ elif current_task == "stg" and "mean_iou:" in line:
267
+ try:
268
+ value = float(line.split("mean_iou:")[-1].strip())
269
+ metrics["stg"] = value
270
+ except:
271
+ pass
272
+
273
+ elif current_task == "next_action" and "Weighted Average Accuracy" in line:
274
+ try:
275
+ value = float(line.split(":")[-1].strip())
276
+ metrics["next_action"] = value
277
+ except:
278
+ pass
279
+
280
+ elif current_task in ["dvc", "vs", "rc"]:
281
+ # For caption tasks, look for average LLM judge score
282
+ if "Average" in line or "Mean" in line:
283
+ try:
284
+ parts = line.split(":")
285
+ if len(parts) == 2:
286
+ value = float(parts[-1].strip())
287
+ if 0 <= value <= 5: # LLM judge scores are 1-5
288
+ metrics[current_task] = value
289
+ except:
290
+ pass
291
+
292
+ elif current_task in ["skill_assessment", "cvs_assessment"] and "accuracy:" in line.lower():
293
+ try:
294
+ value = float(line.split(":")[-1].strip())
295
+ metrics[current_task] = value
296
+ except:
297
+ pass
298
+
299
+ return metrics
300
+
301
+
302
+ def submit_model(file, model_name: str, organization: str, contact: str = "") -> Tuple[bool, str]:
303
+ """
304
+ Process model submission: validate, evaluate, and add to leaderboard.
305
+
306
+ Returns:
307
+ (success, message)
308
+ """
309
+ # Validation
310
+ if not file:
311
+ return False, "❌ Please upload a results file"
312
+
313
+ if not model_name or not organization:
314
+ return False, "❌ Please provide both model name and organization"
315
+
316
+ # Check if model already exists
317
+ df = load_leaderboard()
318
+ if model_name in df['model_name'].values:
319
+ return False, f"❌ Model '{model_name}' already exists in leaderboard. Please use a different name."
320
+
321
+ # Validate results file
322
+ valid, msg = validate_results_file(file.name)
323
+ if not valid:
324
+ return False, f"❌ Invalid results file: {msg}"
325
+
326
+ # Run evaluation
327
+ success, metrics, eval_msg = run_evaluation(file.name, model_name)
328
+ if not success:
329
+ return False, f"❌ Evaluation failed: {eval_msg}"
330
+
331
+ # Check if we got metrics for all tasks
332
+ missing_tasks = [task for task in TASKS.keys() if task not in metrics]
333
+ if len(missing_tasks) > 0:
334
+ return False, f"❌ Evaluation incomplete. Missing metrics for: {missing_tasks}"
335
+
336
+ # Calculate average score (normalized across all tasks)
337
+ # Normalize each task score to 0-1 range, then average
338
+ task_scores = []
339
+ for task in TASKS.keys():
340
+ if task in metrics:
341
+ score = metrics[task]
342
+ # LLM judge scores are 1-5, others are 0-1
343
+ if task in ["dvc", "vs", "rc"]:
344
+ normalized = (score - 1) / 4 # Normalize 1-5 to 0-1
345
+ else:
346
+ normalized = score # Already 0-1
347
+ task_scores.append(normalized)
348
+
349
+ average_score = sum(task_scores) / len(task_scores) if task_scores else 0.0
350
+
351
+ # Add to leaderboard
352
+ new_entry = {
353
+ "model_name": model_name,
354
+ "organization": organization,
355
+ "average": round(average_score, 4),
356
+ **{task: round(metrics.get(task, 0.0), 4) for task in TASKS.keys()},
357
+ "date": datetime.now().strftime("%Y-%m-%d"),
358
+ "contact": contact
359
+ }
360
+
361
+ df = pd.concat([df, pd.DataFrame([new_entry])], ignore_index=True)
362
+ df = df.sort_values('average', ascending=False).reset_index(drop=True)
363
+
364
+ save_leaderboard(df)
365
+
366
+ success_msg = f"""
367
+ βœ… **Submission successful!**
368
+
369
+ **Model**: {model_name}
370
+ **Organization**: {organization}
371
+ **Average Score**: {average_score:.4f}
372
+
373
+ **Task Scores**:
374
+ """
375
+ for task, info in TASKS.items():
376
+ score = metrics.get(task, 0.0)
377
+ success_msg += f"\n- **{info['name']}**: {score:.4f}"
378
+
379
+ success_msg += f"\n\nπŸ† **Rank**: #{df[df['model_name'] == model_name].index[0] + 1} / {len(df)}"
380
+
381
+ return True, success_msg
382
+
383
+
384
+ def format_leaderboard_display(df: pd.DataFrame) -> pd.DataFrame:
385
+ """Format leaderboard dataframe for display."""
386
+ if df.empty:
387
+ return df
388
+
389
+ # Create display dataframe with selected columns
390
+ display_cols = ["rank", "model_name", "organization", "average"]
391
+
392
+ # Add task columns
393
+ for task in TASKS.keys():
394
+ if task in df.columns:
395
+ display_cols.append(task)
396
+
397
+ display_cols.append("date")
398
+
399
+ # Rename columns for display
400
+ display_df = df[display_cols].copy()
401
+ display_df.columns = ["Rank", "Model", "Organization", "Average"] + \
402
+ [TASKS[task]["name"] for task in TASKS.keys() if task in df.columns] + \
403
+ ["Date"]
404
+
405
+ return display_df
406
+
407
+
408
+ # Create Gradio interface
409
+ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
410
+
411
+ gr.Markdown("""
412
+ # πŸ₯ MedGRPO Leaderboard
413
+
414
+ Interactive leaderboard for evaluating **Video-Language Models** on the **MedGRPO benchmark** -
415
+ 8 medical video understanding tasks across 8 surgical datasets.
416
+
417
+ πŸ“„ **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
418
+ 🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
419
+ πŸ’Ύ **Dataset**: [huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
420
+ πŸ’» **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
421
+ """)
422
+
423
+ with gr.Tabs():
424
+
425
+ # Tab 1: Leaderboard
426
+ with gr.Tab("πŸ† Leaderboard"):
427
+ gr.Markdown("### Current Rankings")
428
+
429
+ leaderboard_table = gr.Dataframe(
430
+ value=format_leaderboard_display(load_leaderboard()),
431
+ interactive=False
432
+ )
433
+
434
+ refresh_btn = gr.Button("πŸ”„ Refresh Leaderboard", size="sm")
435
+ refresh_btn.click(
436
+ fn=lambda: format_leaderboard_display(load_leaderboard()),
437
+ outputs=leaderboard_table
438
+ )
439
+
440
+ # Tab 2: Submit
441
+ with gr.Tab("πŸ“€ Submit Results"):
442
+ gr.Markdown("""
443
+ ### Submit Your Model Results
444
+
445
+ Upload your model's inference results on the **MedGRPO test set (6,245 samples)** to be added to the leaderboard.
446
+
447
+ #### πŸ“‹ Requirements
448
+
449
+ 1. **Run inference** on the full test set using VLLM or similar
450
+ 2. **Upload results JSON** in the format below
451
+ 3. **Provide model info** (name, organization)
452
+
453
+ #### πŸ“„ Expected File Format
454
+
455
+ Your results JSON should contain **6,245 samples** with this structure:
456
+
457
+ ```json
458
+ [
459
+ {
460
+ "question": "<video>\\nQuestion text...",
461
+ "response": "Model's answer...",
462
+ "ground_truth": "Correct answer...",
463
+ "qa_type": "tal",
464
+ "metadata": {"video_id": "...", "fps": "1.0", ...},
465
+ "data_source": "AVOS",
466
+ ...
467
+ },
468
+ ...
469
+ ]
470
+ ```
471
+
472
+ **Valid qa_types**: `tal`, `stg`, `next_action`, `dense_captioning`, `video_summary`, `region_caption`, `skill_assessment`, `cvs_assessment`
473
+
474
+ #### βš™οΈ Evaluation Process
475
+
476
+ After upload, the system will:
477
+ 1. **Validate** your results file format
478
+ 2. **Run automatic evaluation** using our evaluation pipeline (`evaluate_all_pai.py`)
479
+ 3. **Compute metrics** for all 8 tasks
480
+ 4. **Add to leaderboard** if successful
481
+
482
+ **Evaluation takes ~5-10 minutes** depending on server load.
483
+ """)
484
 
485
  with gr.Row():
486
  with gr.Column():
487
+ model_name_input = gr.Textbox(
488
+ label="Model Name",
489
+ placeholder="e.g., Qwen2.5-VL-7B-MedGRPO",
490
+ info="Unique identifier for your model"
 
 
 
 
491
  )
492
 
493
+ org_input = gr.Textbox(
494
+ label="Organization / Author",
495
+ placeholder="e.g., University Name or Your Name",
496
+ info="Who developed this model?"
 
 
 
497
  )
498
+
499
+ contact_input = gr.Textbox(
500
+ label="Contact (Optional)",
501
+ placeholder="email@example.com or github.com/username",
502
+ info="For follow-up questions"
 
503
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
504
 
505
+ with gr.Column():
506
+ results_file_input = gr.File(
507
+ label="Upload Results JSON",
508
+ file_types=[".json"],
509
+ file_count="single"
510
+ )
511
+
512
+ submit_btn = gr.Button("πŸš€ Submit to Leaderboard", variant="primary", size="lg")
513
+
514
+ submission_output = gr.Markdown(label="Submission Status")
515
+
516
+ # Wire up submission
517
+ def submit_and_return_message(file, name, org, contact):
518
+ if file is None:
519
+ return "❌ Please upload a results file"
520
+ success, message = submit_model(file, name, org, contact)
521
+ return message
522
+
523
+ submit_btn.click(
524
+ fn=submit_and_return_message,
525
+ inputs=[results_file_input, model_name_input, org_input, contact_input],
526
+ outputs=submission_output
527
  )
528
 
529
+ # Tab 3: Tasks & Metrics
530
+ with gr.Tab("πŸ“Š Tasks & Metrics"):
531
+ gr.Markdown("""
532
+ ### MedGRPO Benchmark Tasks
533
+
534
+ The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
535
+ """)
536
+
537
+ # Create tasks table
538
+ tasks_data = []
539
+ for task_key, task_info in TASKS.items():
540
+ tasks_data.append({
541
+ "Task": task_info["name"],
542
+ "Key": task_key,
543
+ "Metric": task_info["metric"],
544
+ "Description": task_info["description"]
545
+ })
546
+
547
+ tasks_df = pd.DataFrame(tasks_data)
548
+ gr.Dataframe(value=tasks_df, interactive=False)
549
+
550
+ gr.Markdown("""
551
+ ### Evaluation Metrics
552
+
553
+ - **TAL** (Temporal Action Localization): **mAP@0.5** - mean Average Precision at IoU threshold 0.5
554
+ - **STG** (Spatiotemporal Grounding): **mIoU** - mean Intersection over Union (spatial + temporal)
555
+ - **Next Action**: **Accuracy** - Classification accuracy
556
+ - **DVC** (Dense Video Captioning): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
557
+ - **VS** (Video Summary): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
558
+ - **RC** (Region Caption): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
559
+ - **Skill Assessment**: **Accuracy** - Surgical skill level classification (JIGSAWS)
560
+ - **CVS Assessment**: **Accuracy** - Clinical variable scoring
561
+
562
+ #### LLM Judge Details
563
+
564
+ Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects:
565
+ - **R2**: Relevance & Medical Terminology
566
+ - **R4**: Actionable Surgical Actions
567
+ - **R5**: Comprehensive Detail Level
568
+ - **R7**: Anatomical & Instrument Precision
569
+ - **R8**: Clinical Context & Coherence
570
+
571
+ The **final score** is the average across these 5 aspects.
572
+
573
+ ### Test Set Statistics
574
+
575
+ - **Total samples**: 6,245
576
+ - **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
577
+ - **Video frames**: ~103,742
578
+ - **Task distribution**:
579
+ - TAL: ~800 samples
580
+ - STG: ~900 samples
581
+ - Next Action: ~700 samples
582
+ - DVC: ~800 samples
583
+ - VS: ~900 samples
584
+ - RC: ~1000 samples
585
+ - Skill Assessment: ~600 samples
586
+ - CVS Assessment: ~545 samples
587
+ """)
588
+
589
+ # Tab 4: About
590
+ with gr.Tab("ℹ️ About"):
591
+ gr.Markdown("""
592
+ ### About MedGRPO
593
+
594
+ **MedGRPO** (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding)
595
+ is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
596
+
597
+ #### Key Features
598
+
599
+ - **8 diverse tasks** covering multiple levels of video understanding
600
+ - **8 source datasets** from various surgical procedures
601
+ - **6,245 test samples** with high-quality annotations
602
+ - **Automatic evaluation** with standardized metrics
603
+ - **LLM-based judging** for caption quality assessment
604
+
605
+ #### Paper
606
+
607
+ ```bibtex
608
+ @article{su2024medgrpo,
609
+ title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
610
+ author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
611
+ journal={arXiv preprint arXiv:2512.06581},
612
+ year={2025}
613
+ }
614
+ ```
615
+
616
+ #### Links
617
+
618
+ - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
619
+ - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
620
+ - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
621
+ - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
622
+ - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
623
+
624
+ #### Dataset
625
+
626
+ The MedGRPO benchmark includes:
627
+ - 21,060 training samples
628
+ - 6,245 test samples
629
+ - Multi-modal annotations (video, text, temporal spans, bounding boxes)
630
+ - 8 source datasets covering various medical procedures
631
+
632
+ #### License
633
+
634
+ - **Dataset**: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
635
+ - **Leaderboard Code**: Apache 2.0
636
+ - **Evaluation Scripts**: MIT
637
+
638
+ #### Contact
639
+
640
+ For questions or issues:
641
+ - Open an issue on [GitHub](https://github.com/YuhaoSu/MedGRPO)
642
+ - Visit the [project page](https://yuhaosu.github.io/MedGRPO/)
643
+ - Email: [Contact via GitHub](https://github.com/YuhaoSu)
644
+ """)
645
+
646
+ if __name__ == "__main__":
647
+ demo.queue(default_concurrency_limit=5).launch(share=True, server_name="0.0.0.0")