MedGRPO Team
update
a605ebb

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

MedVidBench Leaderboard Evaluation

Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.

Overview

This evaluation system supports two input formats:

  1. Merged format (already contains ground-truth): Like results.json
  2. Prediction-only format (needs ground-truth): Like user submissions

The wrapper automatically detects which format you're using and handles ground-truth merging if needed.

Quick Start

# Evaluate predictions (auto-detects format)
python evaluate_predictions.py <predictions_file>

# Evaluate specific tasks
python evaluate_predictions.py <predictions_file> --tasks tal stg

# Only analyze file structure
python evaluate_predictions.py <predictions_file> --analyze-only

# Use overall grouping (aggregate all datasets)
python evaluate_predictions.py <predictions_file> --grouping overall

Input Formats

Format 1: Prediction-Only (User Submission Format)

[
  {
    "id": "kcOqlifSukA&&22425&&25124&&1.0",
    "qa_type": "tal",
    "prediction": "22.0-78.0, 89.0-94.0 seconds."
  },
  ...
]

ID Format: video_id&&start_frame&&end_frame&&fps

Required fields:

  • id: Unique identifier matching ground-truth
  • qa_type: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
  • prediction: Model's prediction text

What happens: The wrapper automatically merges with ground_truth.json to create complete evaluation records.

Format 2: Merged (Complete Format)

{
  "0": {
    "metadata": {
      "video_id": "kcOqlifSukA",
      "fps": "1.0",
      "input_video_start_frame": "22425",
      "input_video_end_frame": "25124"
    },
    "qa_type": "tal",
    "struc_info": [...],
    "question": "...",
    "gnd": "0.0-10.0 seconds.",
    "answer": "22.0-78.0, 89.0-94.0 seconds.",
    "data_source": "AVOS"
  },
  ...
}

Required fields:

  • metadata: Video metadata (video_id, fps, frame range)
  • qa_type: Task type
  • struc_info: Ground-truth structured information
  • question: Question text
  • gnd: Ground-truth answer
  • answer: Model prediction
  • data_source: Dataset name

What happens: The wrapper uses the file directly for evaluation.

Ground-Truth File

Location: /root/code/MedVidBench-Leaderboard/data/ground_truth.json

Structure: Array of records, each containing:

  • Complete metadata (video_id, fps, frame range)
  • struc_info: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
  • Ground-truth answer
  • Dataset source

Note: This file is NOT public. Users submit prediction-only files, which are merged server-side.

Supported Tasks

Task qa_type Metrics
TAL tal Recall@0.3/0.5, mIoU@0.3/0.5
STG stg IoU@0.3/0.5/0.7, mIoU
DVC dense_captioning, dense_captioning_gpt, dense_captioning_gemini, dc CIDER, METEOR, Precision, Recall, F1, SODA_c
Next Action next_action Accuracy (per-dataset)
RC region_caption, region_caption_gpt, region_caption_gemini LLM Judge (GPT-4.1/Gemini)
VS video_summary, video_summary_gpt, video_summary_gemini LLM Judge (GPT-4.1/Gemini)
Skill Assessment skill_assessment Accuracy, Macro F1, Weighted F1
CVS Assessment cvs_assessment Accuracy, Precision, Recall, F1

Usage Examples

Example 1: Evaluate User Submission (Prediction-Only)

# User submits predictions in prediction-only format
python evaluate_predictions.py user_predictions.json

# Output:
# [EvaluationWrapper] βœ“ Detected: Prediction-only format (id, qa_type, prediction)
# [EvaluationWrapper] Merging with ground-truth...
# [EvaluationWrapper] βœ“ Successfully merged 6245/6245 predictions
# ... [evaluation results] ...

Example 2: Evaluate Internal Results (Already Merged)

# Internal evaluation with complete data
python evaluate_predictions.py results.json

# Output:
# [EvaluationWrapper] βœ“ Detected: Predictions already contain ground-truth
# [EvaluationWrapper] Using predictions file directly for evaluation
# ... [evaluation results] ...

Example 3: Specific Tasks Only

# Evaluate only TAL and STG tasks
python evaluate_predictions.py predictions.json --tasks tal stg

# Evaluate captioning tasks with LLM judge
python evaluate_predictions.py predictions.json --tasks rc vs dvc

Example 4: Different Grouping Modes

# Per-dataset results (default)
python evaluate_predictions.py predictions.json --grouping per-dataset

# Overall results (aggregate all datasets)
python evaluate_predictions.py predictions.json --grouping overall

Example 5: Skip LLM Judge (Use Pre-computed Scores)

# Skip LLM judge evaluation for caption tasks
# Useful when LLM scores are already pre-computed in the predictions
python evaluate_predictions.py predictions.json --skip-llm-judge

Example 6: Analyze File Structure

# Only analyze what tasks/datasets are present
python evaluate_predictions.py predictions.json --analyze-only

# Output:
# Found QA types:
#   tal: 1637 records
#   stg: 780 records
#   ...
# Found datasets:
#   AVOS: 321 records
#   CholecT50: 409 records
#   ...

Command-Line Options

python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]

Required:
  PREDICTIONS_FILE          Path to predictions JSON (merged or prediction-only format)

Optional:
  --ground-truth PATH       Path to ground-truth JSON (default: data/ground_truth.json)
  --tasks TASK [TASK ...]  Specific tasks to evaluate (default: all available)
                           Choices: dvc, tal, next_action, stg, rc, vs,
                                   skill_assessment, cvs_assessment
  --grouping {per-dataset,overall}
                           Grouping strategy (default: per-dataset)
                           - per-dataset: Results per dataset
                           - overall: Aggregate all datasets
  --analyze-only           Only analyze file structure, no evaluation
  --skip-llm-judge         Skip LLM judge for caption tasks (use pre-computed scores)
  -h, --help              Show help message

Output Format

Per-Dataset Grouping (Default)

================================================================================
EVALUATION RESULTS - PER DATASET
================================================================================

AVOS:
  TAL:
    recall@0.3: 0.45
    meanIoU@0.3: 0.42
    recall@0.5: 0.32
    meanIoU@0.5: 0.28

CholecT50:
  TAL:
    recall@0.3: 0.52
    ...

Overall Grouping

================================================================================
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
================================================================================

TAL - Overall Evaluation (All Datasets Combined)
Total samples: 1637

  recall@0.3: 0.48
  meanIoU@0.3: 0.45
  recall@0.5: 0.35
  meanIoU@0.5: 0.30

Workflow for User Submissions

  1. User downloads benchmark: /root/code/MedVidBench/cleaned_test_data_11_04.json

    • Contains questions but NO ground-truth (struc_info removed)
  2. User runs inference: Generates predictions for each sample

  3. User submits predictions: prediction-only format

    [
      {
        "id": "<from benchmark>",
        "qa_type": "<from benchmark>",
        "prediction": "<model output>"
      },
      ...
    ]
    
  4. Server evaluates:

    python evaluate_predictions.py user_submission.json
    
    • Auto-detects format
    • Merges with server-side ground-truth
    • Runs evaluation
    • Returns metrics

File Structure

evaluation/
β”œβ”€β”€ README.md                        # This file
β”œβ”€β”€ evaluate_predictions.py          # Main wrapper (auto-detection + merging)
β”œβ”€β”€ evaluate_all_pai.py             # Core evaluation orchestrator
β”œβ”€β”€ eval_tal.py                     # TAL evaluation
β”œβ”€β”€ eval_stg.py                     # STG evaluation
β”œβ”€β”€ eval_dvc.py                     # Dense captioning evaluation
β”œβ”€β”€ eval_next_action.py             # Next action evaluation
β”œβ”€β”€ eval_caption_llm_judge.py       # RC/VS LLM judge evaluation
β”œβ”€β”€ eval_skill_assessment.py        # Skill assessment evaluation
└── eval_cvs_assessment.py          # CVS assessment evaluation

Key Features

Auto-Detection Logic

The wrapper detects format by checking for these indicators:

Prediction-only format:

  • Has id field (video_id&&start&&end&&fps)
  • Has prediction field
  • Missing gnd or struc_info

Merged format:

  • Has question field
  • Has gnd field (ground-truth)
  • Has struc_info field (structured GT)
  • Has metadata dict

Ground-Truth Merging

When prediction-only format is detected:

  1. Load predictions and ground-truth
  2. Build index: {id -> ground_truth_record}
  3. For each prediction:
    • Look up ground-truth by ID
    • Merge into complete record
    • Add data_source from ground-truth
  4. Save to temporary file
  5. Run evaluation
  6. Clean up temporary file

Dataset Detection

Datasets are detected from:

  1. data_source field (primary, leaderboard format)
  2. dataset field (fallback)
  3. dataset_name field (fallback)
  4. Video ID patterns (last resort):
    • YouTube IDs (11 chars with letters) β†’ AVOS
    • *_part* pattern β†’ CoPESD
    • video* pattern β†’ CholecT50

Error Handling

Missing Ground-Truth

# If ground-truth file not found
[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json

Solution: Specify correct path with --ground-truth

Unmatched Predictions

[EvaluationWrapper] ⚠️  WARNING: 10 predictions not found in ground-truth
[EvaluationWrapper]   First 5 unmatched IDs: [...]

Cause: Prediction IDs don't match ground-truth IDs

Solution: Check ID format (video_id&&start&&end&&fps must match exactly)

Invalid ID Format

ValueError: Invalid ID format: <id_string>

Cause: ID doesn't follow video_id&&start&&end&&fps format

Solution: Fix ID format in predictions

API Keys for LLM Judge

For RC/VS evaluation with LLM judge:

export OPENAI_API_KEY="your-key"      # For GPT-4.1
export GOOGLE_API_KEY="your-key"      # For Gemini

# Then run evaluation
python evaluate_predictions.py predictions.json --tasks rc vs

Cost: ~$0.012 per RC/VS sample (GPT-4.1)

Verification Checklist

Before evaluating submissions:

# 1. Check file format
python evaluate_predictions.py submission.json --analyze-only

# 2. Verify ground-truth file exists
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json

# 3. Run evaluation on sample (first 100 records)
head -100 submission.json > sample.json
python evaluate_predictions.py sample.json

# 4. If successful, run full evaluation
python evaluate_predictions.py submission.json

Performance

  • Small files (100 samples): ~5-10 seconds
  • Full benchmark (6245 samples): ~2-5 minutes
    • TAL/STG: ~30 seconds per dataset
    • Next Action: ~20 seconds per dataset
    • DVC: ~1-2 minutes (metric computation)
    • RC/VS with LLM judge: ~5-10 minutes (API calls)

Notes

  • Ground-truth file contains 1414 records (subset for leaderboard testing)
  • Full benchmark has 6245 records across 8 datasets
  • Temporary files are automatically cleaned up after evaluation
  • LLM judge can be skipped with --skip-llm-judge if scores pre-computed