A newer version of the Gradio SDK is available:
6.3.0
MedVidBench Leaderboard Evaluation
Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.
Overview
This evaluation system supports two input formats:
- Merged format (already contains ground-truth): Like
results.json - Prediction-only format (needs ground-truth): Like user submissions
The wrapper automatically detects which format you're using and handles ground-truth merging if needed.
Quick Start
# Evaluate predictions (auto-detects format)
python evaluate_predictions.py <predictions_file>
# Evaluate specific tasks
python evaluate_predictions.py <predictions_file> --tasks tal stg
# Only analyze file structure
python evaluate_predictions.py <predictions_file> --analyze-only
# Use overall grouping (aggregate all datasets)
python evaluate_predictions.py <predictions_file> --grouping overall
Input Formats
Format 1: Prediction-Only (User Submission Format)
[
{
"id": "kcOqlifSukA&&22425&&25124&&1.0",
"qa_type": "tal",
"prediction": "22.0-78.0, 89.0-94.0 seconds."
},
...
]
ID Format: video_id&&start_frame&&end_frame&&fps
Required fields:
id: Unique identifier matching ground-truthqa_type: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)prediction: Model's prediction text
What happens: The wrapper automatically merges with ground_truth.json to create complete evaluation records.
Format 2: Merged (Complete Format)
{
"0": {
"metadata": {
"video_id": "kcOqlifSukA",
"fps": "1.0",
"input_video_start_frame": "22425",
"input_video_end_frame": "25124"
},
"qa_type": "tal",
"struc_info": [...],
"question": "...",
"gnd": "0.0-10.0 seconds.",
"answer": "22.0-78.0, 89.0-94.0 seconds.",
"data_source": "AVOS"
},
...
}
Required fields:
metadata: Video metadata (video_id, fps, frame range)qa_type: Task typestruc_info: Ground-truth structured informationquestion: Question textgnd: Ground-truth answeranswer: Model predictiondata_source: Dataset name
What happens: The wrapper uses the file directly for evaluation.
Ground-Truth File
Location: /root/code/MedVidBench-Leaderboard/data/ground_truth.json
Structure: Array of records, each containing:
- Complete metadata (video_id, fps, frame range)
struc_info: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)- Ground-truth answer
- Dataset source
Note: This file is NOT public. Users submit prediction-only files, which are merged server-side.
Supported Tasks
| Task | qa_type | Metrics |
|---|---|---|
| TAL | tal |
Recall@0.3/0.5, mIoU@0.3/0.5 |
| STG | stg |
IoU@0.3/0.5/0.7, mIoU |
| DVC | dense_captioning, dense_captioning_gpt, dense_captioning_gemini, dc |
CIDER, METEOR, Precision, Recall, F1, SODA_c |
| Next Action | next_action |
Accuracy (per-dataset) |
| RC | region_caption, region_caption_gpt, region_caption_gemini |
LLM Judge (GPT-4.1/Gemini) |
| VS | video_summary, video_summary_gpt, video_summary_gemini |
LLM Judge (GPT-4.1/Gemini) |
| Skill Assessment | skill_assessment |
Accuracy, Macro F1, Weighted F1 |
| CVS Assessment | cvs_assessment |
Accuracy, Precision, Recall, F1 |
Usage Examples
Example 1: Evaluate User Submission (Prediction-Only)
# User submits predictions in prediction-only format
python evaluate_predictions.py user_predictions.json
# Output:
# [EvaluationWrapper] β Detected: Prediction-only format (id, qa_type, prediction)
# [EvaluationWrapper] Merging with ground-truth...
# [EvaluationWrapper] β Successfully merged 6245/6245 predictions
# ... [evaluation results] ...
Example 2: Evaluate Internal Results (Already Merged)
# Internal evaluation with complete data
python evaluate_predictions.py results.json
# Output:
# [EvaluationWrapper] β Detected: Predictions already contain ground-truth
# [EvaluationWrapper] Using predictions file directly for evaluation
# ... [evaluation results] ...
Example 3: Specific Tasks Only
# Evaluate only TAL and STG tasks
python evaluate_predictions.py predictions.json --tasks tal stg
# Evaluate captioning tasks with LLM judge
python evaluate_predictions.py predictions.json --tasks rc vs dvc
Example 4: Different Grouping Modes
# Per-dataset results (default)
python evaluate_predictions.py predictions.json --grouping per-dataset
# Overall results (aggregate all datasets)
python evaluate_predictions.py predictions.json --grouping overall
Example 5: Skip LLM Judge (Use Pre-computed Scores)
# Skip LLM judge evaluation for caption tasks
# Useful when LLM scores are already pre-computed in the predictions
python evaluate_predictions.py predictions.json --skip-llm-judge
Example 6: Analyze File Structure
# Only analyze what tasks/datasets are present
python evaluate_predictions.py predictions.json --analyze-only
# Output:
# Found QA types:
# tal: 1637 records
# stg: 780 records
# ...
# Found datasets:
# AVOS: 321 records
# CholecT50: 409 records
# ...
Command-Line Options
python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]
Required:
PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format)
Optional:
--ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json)
--tasks TASK [TASK ...] Specific tasks to evaluate (default: all available)
Choices: dvc, tal, next_action, stg, rc, vs,
skill_assessment, cvs_assessment
--grouping {per-dataset,overall}
Grouping strategy (default: per-dataset)
- per-dataset: Results per dataset
- overall: Aggregate all datasets
--analyze-only Only analyze file structure, no evaluation
--skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores)
-h, --help Show help message
Output Format
Per-Dataset Grouping (Default)
================================================================================
EVALUATION RESULTS - PER DATASET
================================================================================
AVOS:
TAL:
recall@0.3: 0.45
meanIoU@0.3: 0.42
recall@0.5: 0.32
meanIoU@0.5: 0.28
CholecT50:
TAL:
recall@0.3: 0.52
...
Overall Grouping
================================================================================
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
================================================================================
TAL - Overall Evaluation (All Datasets Combined)
Total samples: 1637
recall@0.3: 0.48
meanIoU@0.3: 0.45
recall@0.5: 0.35
meanIoU@0.5: 0.30
Workflow for User Submissions
User downloads benchmark:
/root/code/MedVidBench/cleaned_test_data_11_04.json- Contains questions but NO ground-truth (struc_info removed)
User runs inference: Generates predictions for each sample
User submits predictions: prediction-only format
[ { "id": "<from benchmark>", "qa_type": "<from benchmark>", "prediction": "<model output>" }, ... ]Server evaluates:
python evaluate_predictions.py user_submission.json- Auto-detects format
- Merges with server-side ground-truth
- Runs evaluation
- Returns metrics
File Structure
evaluation/
βββ README.md # This file
βββ evaluate_predictions.py # Main wrapper (auto-detection + merging)
βββ evaluate_all_pai.py # Core evaluation orchestrator
βββ eval_tal.py # TAL evaluation
βββ eval_stg.py # STG evaluation
βββ eval_dvc.py # Dense captioning evaluation
βββ eval_next_action.py # Next action evaluation
βββ eval_caption_llm_judge.py # RC/VS LLM judge evaluation
βββ eval_skill_assessment.py # Skill assessment evaluation
βββ eval_cvs_assessment.py # CVS assessment evaluation
Key Features
Auto-Detection Logic
The wrapper detects format by checking for these indicators:
Prediction-only format:
- Has
idfield (video_id&&start&&end&&fps) - Has
predictionfield - Missing
gndorstruc_info
Merged format:
- Has
questionfield - Has
gndfield (ground-truth) - Has
struc_infofield (structured GT) - Has
metadatadict
Ground-Truth Merging
When prediction-only format is detected:
- Load predictions and ground-truth
- Build index:
{id -> ground_truth_record} - For each prediction:
- Look up ground-truth by ID
- Merge into complete record
- Add
data_sourcefrom ground-truth
- Save to temporary file
- Run evaluation
- Clean up temporary file
Dataset Detection
Datasets are detected from:
data_sourcefield (primary, leaderboard format)datasetfield (fallback)dataset_namefield (fallback)- Video ID patterns (last resort):
- YouTube IDs (11 chars with letters) β AVOS
*_part*pattern β CoPESDvideo*pattern β CholecT50
Error Handling
Missing Ground-Truth
# If ground-truth file not found
[EvaluationWrapper] β ERROR: Ground-truth file not found: /path/to/ground_truth.json
Solution: Specify correct path with --ground-truth
Unmatched Predictions
[EvaluationWrapper] β οΈ WARNING: 10 predictions not found in ground-truth
[EvaluationWrapper] First 5 unmatched IDs: [...]
Cause: Prediction IDs don't match ground-truth IDs
Solution: Check ID format (video_id&&start&&end&&fps must match exactly)
Invalid ID Format
ValueError: Invalid ID format: <id_string>
Cause: ID doesn't follow video_id&&start&&end&&fps format
Solution: Fix ID format in predictions
API Keys for LLM Judge
For RC/VS evaluation with LLM judge:
export OPENAI_API_KEY="your-key" # For GPT-4.1
export GOOGLE_API_KEY="your-key" # For Gemini
# Then run evaluation
python evaluate_predictions.py predictions.json --tasks rc vs
Cost: ~$0.012 per RC/VS sample (GPT-4.1)
Verification Checklist
Before evaluating submissions:
# 1. Check file format
python evaluate_predictions.py submission.json --analyze-only
# 2. Verify ground-truth file exists
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json
# 3. Run evaluation on sample (first 100 records)
head -100 submission.json > sample.json
python evaluate_predictions.py sample.json
# 4. If successful, run full evaluation
python evaluate_predictions.py submission.json
Performance
- Small files (100 samples): ~5-10 seconds
- Full benchmark (6245 samples): ~2-5 minutes
- TAL/STG: ~30 seconds per dataset
- Next Action: ~20 seconds per dataset
- DVC: ~1-2 minutes (metric computation)
- RC/VS with LLM judge: ~5-10 minutes (API calls)
Notes
- Ground-truth file contains 1414 records (subset for leaderboard testing)
- Full benchmark has 6245 records across 8 datasets
- Temporary files are automatically cleaned up after evaluation
- LLM judge can be skipped with
--skip-llm-judgeif scores pre-computed