# MedVidBench Leaderboard Evaluation Auto-detection wrapper for evaluating predictions with automatic ground-truth merging. ## Overview This evaluation system supports two input formats: 1. **Merged format** (already contains ground-truth): Like `results.json` 2. **Prediction-only format** (needs ground-truth): Like user submissions The wrapper automatically detects which format you're using and handles ground-truth merging if needed. ## Quick Start ```bash # Evaluate predictions (auto-detects format) python evaluate_predictions.py # Evaluate specific tasks python evaluate_predictions.py --tasks tal stg # Only analyze file structure python evaluate_predictions.py --analyze-only # Use overall grouping (aggregate all datasets) python evaluate_predictions.py --grouping overall ``` ## Input Formats ### Format 1: Prediction-Only (User Submission Format) ```json [ { "id": "kcOqlifSukA&&22425&&25124&&1.0", "qa_type": "tal", "prediction": "22.0-78.0, 89.0-94.0 seconds." }, ... ] ``` **ID Format**: `video_id&&start_frame&&end_frame&&fps` **Required fields**: - `id`: Unique identifier matching ground-truth - `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment) - `prediction`: Model's prediction text **What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records. ### Format 2: Merged (Complete Format) ```json { "0": { "metadata": { "video_id": "kcOqlifSukA", "fps": "1.0", "input_video_start_frame": "22425", "input_video_end_frame": "25124" }, "qa_type": "tal", "struc_info": [...], "question": "...", "gnd": "0.0-10.0 seconds.", "answer": "22.0-78.0, 89.0-94.0 seconds.", "data_source": "AVOS" }, ... } ``` **Required fields**: - `metadata`: Video metadata (video_id, fps, frame range) - `qa_type`: Task type - `struc_info`: Ground-truth structured information - `question`: Question text - `gnd`: Ground-truth answer - `answer`: Model prediction - `data_source`: Dataset name **What happens**: The wrapper uses the file directly for evaluation. ## Ground-Truth File **Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json` **Structure**: Array of records, each containing: - Complete metadata (video_id, fps, frame range) - `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.) - Ground-truth answer - Dataset source **Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side. ## Supported Tasks | Task | qa_type | Metrics | |------|---------|---------| | **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 | | **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU | | **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c | | **Next Action** | `next_action` | Accuracy (per-dataset) | | **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) | | **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) | | **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 | | **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 | ## Usage Examples ### Example 1: Evaluate User Submission (Prediction-Only) ```bash # User submits predictions in prediction-only format python evaluate_predictions.py user_predictions.json # Output: # [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction) # [EvaluationWrapper] Merging with ground-truth... # [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions # ... [evaluation results] ... ``` ### Example 2: Evaluate Internal Results (Already Merged) ```bash # Internal evaluation with complete data python evaluate_predictions.py results.json # Output: # [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth # [EvaluationWrapper] Using predictions file directly for evaluation # ... [evaluation results] ... ``` ### Example 3: Specific Tasks Only ```bash # Evaluate only TAL and STG tasks python evaluate_predictions.py predictions.json --tasks tal stg # Evaluate captioning tasks with LLM judge python evaluate_predictions.py predictions.json --tasks rc vs dvc ``` ### Example 4: Different Grouping Modes ```bash # Per-dataset results (default) python evaluate_predictions.py predictions.json --grouping per-dataset # Overall results (aggregate all datasets) python evaluate_predictions.py predictions.json --grouping overall ``` ### Example 5: Skip LLM Judge (Use Pre-computed Scores) ```bash # Skip LLM judge evaluation for caption tasks # Useful when LLM scores are already pre-computed in the predictions python evaluate_predictions.py predictions.json --skip-llm-judge ``` ### Example 6: Analyze File Structure ```bash # Only analyze what tasks/datasets are present python evaluate_predictions.py predictions.json --analyze-only # Output: # Found QA types: # tal: 1637 records # stg: 780 records # ... # Found datasets: # AVOS: 321 records # CholecT50: 409 records # ... ``` ## Command-Line Options ``` python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS] Required: PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format) Optional: --ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json) --tasks TASK [TASK ...] Specific tasks to evaluate (default: all available) Choices: dvc, tal, next_action, stg, rc, vs, skill_assessment, cvs_assessment --grouping {per-dataset,overall} Grouping strategy (default: per-dataset) - per-dataset: Results per dataset - overall: Aggregate all datasets --analyze-only Only analyze file structure, no evaluation --skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores) -h, --help Show help message ``` ## Output Format ### Per-Dataset Grouping (Default) ``` ================================================================================ EVALUATION RESULTS - PER DATASET ================================================================================ AVOS: TAL: recall@0.3: 0.45 meanIoU@0.3: 0.42 recall@0.5: 0.32 meanIoU@0.5: 0.28 CholecT50: TAL: recall@0.3: 0.52 ... ``` ### Overall Grouping ``` ================================================================================ EVALUATION RESULTS - OVERALL (Dataset-Agnostic) ================================================================================ TAL - Overall Evaluation (All Datasets Combined) Total samples: 1637 recall@0.3: 0.48 meanIoU@0.3: 0.45 recall@0.5: 0.35 meanIoU@0.5: 0.30 ``` ## Workflow for User Submissions 1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json` - Contains questions but NO ground-truth (struc_info removed) 2. **User runs inference**: Generates predictions for each sample 3. **User submits predictions**: prediction-only format ```json [ { "id": "", "qa_type": "", "prediction": "" }, ... ] ``` 4. **Server evaluates**: ```bash python evaluate_predictions.py user_submission.json ``` - Auto-detects format - Merges with server-side ground-truth - Runs evaluation - Returns metrics ## File Structure ``` evaluation/ ├── README.md # This file ├── evaluate_predictions.py # Main wrapper (auto-detection + merging) ├── evaluate_all_pai.py # Core evaluation orchestrator ├── eval_tal.py # TAL evaluation ├── eval_stg.py # STG evaluation ├── eval_dvc.py # Dense captioning evaluation ├── eval_next_action.py # Next action evaluation ├── eval_caption_llm_judge.py # RC/VS LLM judge evaluation ├── eval_skill_assessment.py # Skill assessment evaluation └── eval_cvs_assessment.py # CVS assessment evaluation ``` ## Key Features ### Auto-Detection Logic The wrapper detects format by checking for these indicators: **Prediction-only format**: - Has `id` field (video_id&&start&&end&&fps) - Has `prediction` field - Missing `gnd` or `struc_info` **Merged format**: - Has `question` field - Has `gnd` field (ground-truth) - Has `struc_info` field (structured GT) - Has `metadata` dict ### Ground-Truth Merging When prediction-only format is detected: 1. Load predictions and ground-truth 2. Build index: `{id -> ground_truth_record}` 3. For each prediction: - Look up ground-truth by ID - Merge into complete record - Add `data_source` from ground-truth 4. Save to temporary file 5. Run evaluation 6. Clean up temporary file ### Dataset Detection Datasets are detected from: 1. **`data_source` field** (primary, leaderboard format) 2. `dataset` field (fallback) 3. `dataset_name` field (fallback) 4. Video ID patterns (last resort): - YouTube IDs (11 chars with letters) → AVOS - `*_part*` pattern → CoPESD - `video*` pattern → CholecT50 ## Error Handling ### Missing Ground-Truth ```bash # If ground-truth file not found [EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json ``` **Solution**: Specify correct path with `--ground-truth` ### Unmatched Predictions ```bash [EvaluationWrapper] ⚠️ WARNING: 10 predictions not found in ground-truth [EvaluationWrapper] First 5 unmatched IDs: [...] ``` **Cause**: Prediction IDs don't match ground-truth IDs **Solution**: Check ID format (video_id&&start&&end&&fps must match exactly) ### Invalid ID Format ```bash ValueError: Invalid ID format: ``` **Cause**: ID doesn't follow `video_id&&start&&end&&fps` format **Solution**: Fix ID format in predictions ## API Keys for LLM Judge For RC/VS evaluation with LLM judge: ```bash export OPENAI_API_KEY="your-key" # For GPT-4.1 export GOOGLE_API_KEY="your-key" # For Gemini # Then run evaluation python evaluate_predictions.py predictions.json --tasks rc vs ``` **Cost**: ~$0.012 per RC/VS sample (GPT-4.1) ## Verification Checklist Before evaluating submissions: ```bash # 1. Check file format python evaluate_predictions.py submission.json --analyze-only # 2. Verify ground-truth file exists ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json # 3. Run evaluation on sample (first 100 records) head -100 submission.json > sample.json python evaluate_predictions.py sample.json # 4. If successful, run full evaluation python evaluate_predictions.py submission.json ``` ## Performance - **Small files** (100 samples): ~5-10 seconds - **Full benchmark** (6245 samples): ~2-5 minutes - TAL/STG: ~30 seconds per dataset - Next Action: ~20 seconds per dataset - DVC: ~1-2 minutes (metric computation) - RC/VS with LLM judge: ~5-10 minutes (API calls) ## Notes - Ground-truth file contains **1414 records** (subset for leaderboard testing) - Full benchmark has **6245 records** across 8 datasets - Temporary files are automatically cleaned up after evaluation - LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed