Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

# Evaluate predictions (auto-detects format)
python evaluate_predictions.py <predictions_file>

# Evaluate specific tasks
python evaluate_predictions.py <predictions_file> --tasks tal stg

# Only analyze file structure
python evaluate_predictions.py <predictions_file> --analyze-only

# Use overall grouping (aggregate all datasets)
python evaluate_predictions.py <predictions_file> --grouping overall

Input Formats

Format 1: Prediction-Only (User Submission Format)

[
  {
    "id": "kcOqlifSukA&&22425&&25124&&1.0",
    "qa_type": "tal",
    "prediction": "22.0-78.0, 89.0-94.0 seconds."
  },
  ...
]

ID Format: video_id&&start_frame&&end_frame&&fps

Required fields:

id: Unique identifier matching ground-truth
qa_type: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
prediction: Model's prediction text

What happens: The wrapper automatically merges with ground_truth.json to create complete evaluation records.

Format 2: Merged (Complete Format)

{
  "0": {
    "metadata": {
      "video_id": "kcOqlifSukA",
      "fps": "1.0",
      "input_video_start_frame": "22425",
      "input_video_end_frame": "25124"
    },
    "qa_type": "tal",
    "struc_info": [...],
    "question": "...",
    "gnd": "0.0-10.0 seconds.",
    "answer": "22.0-78.0, 89.0-94.0 seconds.",
    "data_source": "AVOS"
  },
  ...
}

Required fields:

metadata: Video metadata (video_id, fps, frame range)
qa_type: Task type
struc_info: Ground-truth structured information
question: Question text
gnd: Ground-truth answer
answer: Model prediction
data_source: Dataset name

What happens: The wrapper uses the file directly for evaluation.

Ground-Truth File

Location: /root/code/MedVidBench-Leaderboard/data/ground_truth.json

Structure: Array of records, each containing:

Complete metadata (video_id, fps, frame range)
struc_info: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
Ground-truth answer
Dataset source

Note: This file is NOT public. Users submit prediction-only files, which are merged server-side.

Supported Tasks

Task	qa_type	Metrics
TAL	`tal`	Recall@0.3/0.5, mIoU@0.3/0.5
STG	`stg`	IoU@0.3/0.5/0.7, mIoU
DVC	`dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc`	CIDER, METEOR, Precision, Recall, F1, SODA_c
Next Action	`next_action`	Accuracy (per-dataset)
RC	`region_caption`, `region_caption_gpt`, `region_caption_gemini`	LLM Judge (GPT-4.1/Gemini)
VS	`video_summary`, `video_summary_gpt`, `video_summary_gemini`	LLM Judge (GPT-4.1/Gemini)
Skill Assessment	`skill_assessment`	Accuracy, Macro F1, Weighted F1
CVS Assessment	`cvs_assessment`	Accuracy, Precision, Recall, F1

Usage Examples

Example 1: Evaluate User Submission (Prediction-Only)

# User submits predictions in prediction-only format
python evaluate_predictions.py user_predictions.json

# Output:
# [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)
# [EvaluationWrapper] Merging with ground-truth...
# [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions
# ... [evaluation results] ...

Example 2: Evaluate Internal Results (Already Merged)

# Internal evaluation with complete data
python evaluate_predictions.py results.json

# Output:
# [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth
# [EvaluationWrapper] Using predictions file directly for evaluation
# ... [evaluation results] ...

Example 3: Specific Tasks Only

# Evaluate only TAL and STG tasks
python evaluate_predictions.py predictions.json --tasks tal stg

# Evaluate captioning tasks with LLM judge
python evaluate_predictions.py predictions.json --tasks rc vs dvc

Example 4: Different Grouping Modes

# Per-dataset results (default)
python evaluate_predictions.py predictions.json --grouping per-dataset

# Overall results (aggregate all datasets)
python evaluate_predictions.py predictions.json --grouping overall

Example 5: Skip LLM Judge (Use Pre-computed Scores)

# Skip LLM judge evaluation for caption tasks
# Useful when LLM scores are already pre-computed in the predictions
python evaluate_predictions.py predictions.json --skip-llm-judge

Example 6: Analyze File Structure

# Only analyze what tasks/datasets are present
python evaluate_predictions.py predictions.json --analyze-only

# Output:
# Found QA types:
#   tal: 1637 records
#   stg: 780 records
#   ...
# Found datasets:
#   AVOS: 321 records
#   CholecT50: 409 records
#   ...

Command-Line Options

python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]

Required:
  PREDICTIONS_FILE          Path to predictions JSON (merged or prediction-only format)

Optional:
  --ground-truth PATH       Path to ground-truth JSON (default: data/ground_truth.json)
  --tasks TASK [TASK ...]  Specific tasks to evaluate (default: all available)
                           Choices: dvc, tal, next_action, stg, rc, vs,
                                   skill_assessment, cvs_assessment
  --grouping {per-dataset,overall}
                           Grouping strategy (default: per-dataset)
                           - per-dataset: Results per dataset
                           - overall: Aggregate all datasets
  --analyze-only           Only analyze file structure, no evaluation
  --skip-llm-judge         Skip LLM judge for caption tasks (use pre-computed scores)
  -h, --help              Show help message

Output Format

Per-Dataset Grouping (Default)

================================================================================
EVALUATION RESULTS - PER DATASET
================================================================================

AVOS:
  TAL:
    recall@0.3: 0.45
    meanIoU@0.3: 0.42
    recall@0.5: 0.32
    meanIoU@0.5: 0.28

CholecT50:
  TAL:
    recall@0.3: 0.52
    ...

Overall Grouping

================================================================================
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
================================================================================

TAL - Overall Evaluation (All Datasets Combined)
Total samples: 1637

  recall@0.3: 0.48
  meanIoU@0.3: 0.45
  recall@0.5: 0.35
  meanIoU@0.5: 0.30

Workflow for User Submissions

User downloads benchmark: /root/code/MedVidBench/cleaned_test_data_11_04.json
- Contains questions but NO ground-truth (struc_info removed)
User runs inference: Generates predictions for each sample

User submits predictions: prediction-only format

[
  {
    "id": "<from benchmark>",
    "qa_type": "<from benchmark>",
    "prediction": "<model output>"
  },
  ...
]

Server evaluates:
```
python evaluate_predictions.py user_submission.json
```
- Auto-detects format
- Merges with server-side ground-truth
- Runs evaluation
- Returns metrics

File Structure

evaluation/
├── README.md                        # This file
├── evaluate_predictions.py          # Main wrapper (auto-detection + merging)
├── evaluate_all_pai.py             # Core evaluation orchestrator
├── eval_tal.py                     # TAL evaluation
├── eval_stg.py                     # STG evaluation
├── eval_dvc.py                     # Dense captioning evaluation
├── eval_next_action.py             # Next action evaluation
├── eval_caption_llm_judge.py       # RC/VS LLM judge evaluation
├── eval_skill_assessment.py        # Skill assessment evaluation
└── eval_cvs_assessment.py          # CVS assessment evaluation

Key Features

Auto-Detection Logic

The wrapper detects format by checking for these indicators:

Prediction-only format:

Has id field (video_id&&start&&end&&fps)
Has prediction field
Missing gnd or struc_info

Merged format:

Has question field
Has gnd field (ground-truth)
Has struc_info field (structured GT)
Has metadata dict

Ground-Truth Merging

When prediction-only format is detected:

Load predictions and ground-truth
Build index: {id -> ground_truth_record}
For each prediction:
- Look up ground-truth by ID
- Merge into complete record
- Add data_source from ground-truth
Save to temporary file
Run evaluation
Clean up temporary file

Dataset Detection

Datasets are detected from:

data_source field (primary, leaderboard format)
dataset field (fallback)
dataset_name field (fallback)
Video ID patterns (last resort):
- YouTube IDs (11 chars with letters) → AVOS
- *_part* pattern → CoPESD
- video* pattern → CholecT50

Error Handling

Missing Ground-Truth

# If ground-truth file not found
[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json

Solution: Specify correct path with --ground-truth

Unmatched Predictions

[EvaluationWrapper] ⚠️  WARNING: 10 predictions not found in ground-truth
[EvaluationWrapper]   First 5 unmatched IDs: [...]

Cause: Prediction IDs don't match ground-truth IDs

Solution: Check ID format (video_id&&start&&end&&fps must match exactly)

Invalid ID Format

ValueError: Invalid ID format: <id_string>

Cause: ID doesn't follow video_id&&start&&end&&fps format

Solution: Fix ID format in predictions

API Keys for LLM Judge

For RC/VS evaluation with LLM judge:

export OPENAI_API_KEY="your-key"      # For GPT-4.1
export GOOGLE_API_KEY="your-key"      # For Gemini

# Then run evaluation
python evaluate_predictions.py predictions.json --tasks rc vs

Cost: ~$0.012 per RC/VS sample (GPT-4.1)

Verification Checklist

Before evaluating submissions:

# 1. Check file format
python evaluate_predictions.py submission.json --analyze-only

# 2. Verify ground-truth file exists
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json

# 3. Run evaluation on sample (first 100 records)
head -100 submission.json > sample.json
python evaluate_predictions.py sample.json

# 4. If successful, run full evaluation
python evaluate_predictions.py submission.json

Performance

Small files (100 samples): ~5-10 seconds
Full benchmark (6245 samples): ~2-5 minutes
- TAL/STG: ~30 seconds per dataset
- Next Action: ~20 seconds per dataset
- DVC: ~1-2 minutes (metric computation)
- RC/VS with LLM judge: ~5-10 minutes (API calls)

Notes

Ground-truth file contains 1414 records (subset for leaderboard testing)
Full benchmark has 6245 records across 8 datasets
Temporary files are automatically cleaned up after evaluation
LLM judge can be skipped with --skip-llm-judge if scores pre-computed