Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Sleeping

App Files Files Community

MedGRPO Team commited on 13 days ago

Commit

a605ebb

1 Parent(s): 04f5f37

update

Browse files

Files changed (7) hide show

evaluation/README.md +409 -0
evaluation/eval_dvc.py +2 -1
evaluation/eval_next_action.py +2 -1
evaluation/eval_stg.py +2 -1
evaluation/eval_tal.py +2 -1
evaluation/evaluate_predictions.py +279 -0
evaluation/test_evaluation.sh +140 -0

evaluation/README.md ADDED Viewed

	@@ -0,0 +1,409 @@

+# MedVidBench Leaderboard Evaluation
+Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.
+## Overview
+This evaluation system supports two input formats:
+1. **Merged format** (already contains ground-truth): Like `results.json`
+2. **Prediction-only format** (needs ground-truth): Like user submissions
+The wrapper automatically detects which format you're using and handles ground-truth merging if needed.
+## Quick Start
+```bash
+# Evaluate predictions (auto-detects format)
+python evaluate_predictions.py <predictions_file>
+# Evaluate specific tasks
+python evaluate_predictions.py <predictions_file> --tasks tal stg
+# Only analyze file structure
+python evaluate_predictions.py <predictions_file> --analyze-only
+# Use overall grouping (aggregate all datasets)
+python evaluate_predictions.py <predictions_file> --grouping overall
+```
+## Input Formats
+### Format 1: Prediction-Only (User Submission Format)
+```json
+[
+  {
+    "id": "kcOqlifSukA&&22425&&25124&&1.0",
+    "qa_type": "tal",
+    "prediction": "22.0-78.0, 89.0-94.0 seconds."
+  },
+  ...
+]
+```
+**ID Format**: `video_id&&start_frame&&end_frame&&fps`
+**Required fields**:
+- `id`: Unique identifier matching ground-truth
+- `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
+- `prediction`: Model's prediction text
+**What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.
+### Format 2: Merged (Complete Format)
+```json
+{
+  "0": {
+    "metadata": {
+      "video_id": "kcOqlifSukA",
+      "fps": "1.0",
+      "input_video_start_frame": "22425",
+      "input_video_end_frame": "25124"
+    },
+    "qa_type": "tal",
+    "struc_info": [...],
+    "question": "...",
+    "gnd": "0.0-10.0 seconds.",
+    "answer": "22.0-78.0, 89.0-94.0 seconds.",
+    "data_source": "AVOS"
+  },
+  ...
+}
+```
+**Required fields**:
+- `metadata`: Video metadata (video_id, fps, frame range)
+- `qa_type`: Task type
+- `struc_info`: Ground-truth structured information
+- `question`: Question text
+- `gnd`: Ground-truth answer
+- `answer`: Model prediction
+- `data_source`: Dataset name
+**What happens**: The wrapper uses the file directly for evaluation.
+## Ground-Truth File
+**Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`
+**Structure**: Array of records, each containing:
+- Complete metadata (video_id, fps, frame range)
+- `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
+- Ground-truth answer
+- Dataset source
+**Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side.
+## Supported Tasks
+| Task | qa_type | Metrics |
+|------|---------|---------|
+| **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 |
+| **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU |
+| **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c |
+| **Next Action** | `next_action` | Accuracy (per-dataset) |
+| **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) |
+| **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) |
+| **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 |
+| **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 |
+## Usage Examples
+### Example 1: Evaluate User Submission (Prediction-Only)
+```bash
+# User submits predictions in prediction-only format
+python evaluate_predictions.py user_predictions.json
+# Output:
+# [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)
+# [EvaluationWrapper] Merging with ground-truth...
+# [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions
+# ... [evaluation results] ...
+```
+### Example 2: Evaluate Internal Results (Already Merged)
+```bash
+# Internal evaluation with complete data
+python evaluate_predictions.py results.json
+# Output:
+# [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth
+# [EvaluationWrapper] Using predictions file directly for evaluation
+# ... [evaluation results] ...
+```
+### Example 3: Specific Tasks Only
+```bash
+# Evaluate only TAL and STG tasks
+python evaluate_predictions.py predictions.json --tasks tal stg
+# Evaluate captioning tasks with LLM judge
+python evaluate_predictions.py predictions.json --tasks rc vs dvc
+```
+### Example 4: Different Grouping Modes
+```bash
+# Per-dataset results (default)
+python evaluate_predictions.py predictions.json --grouping per-dataset
+# Overall results (aggregate all datasets)
+python evaluate_predictions.py predictions.json --grouping overall
+```
+### Example 5: Skip LLM Judge (Use Pre-computed Scores)
+```bash
+# Skip LLM judge evaluation for caption tasks
+# Useful when LLM scores are already pre-computed in the predictions
+python evaluate_predictions.py predictions.json --skip-llm-judge
+```
+### Example 6: Analyze File Structure
+```bash
+# Only analyze what tasks/datasets are present
+python evaluate_predictions.py predictions.json --analyze-only
+# Output:
+# Found QA types:
+#   tal: 1637 records
+#   stg: 780 records
+#   ...
+# Found datasets:
+#   AVOS: 321 records
+#   CholecT50: 409 records
+#   ...
+```
+## Command-Line Options
+```
+python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]
+Required:
+  PREDICTIONS_FILE          Path to predictions JSON (merged or prediction-only format)
+Optional:
+  --ground-truth PATH       Path to ground-truth JSON (default: data/ground_truth.json)
+  --tasks TASK [TASK ...]  Specific tasks to evaluate (default: all available)
+                           Choices: dvc, tal, next_action, stg, rc, vs,
+                                   skill_assessment, cvs_assessment
+  --grouping {per-dataset,overall}
+                           Grouping strategy (default: per-dataset)
+                           - per-dataset: Results per dataset
+                           - overall: Aggregate all datasets
+  --analyze-only           Only analyze file structure, no evaluation
+  --skip-llm-judge         Skip LLM judge for caption tasks (use pre-computed scores)
+  -h, --help              Show help message
+```
+## Output Format
+### Per-Dataset Grouping (Default)
+```
+================================================================================
+EVALUATION RESULTS - PER DATASET
+================================================================================
+AVOS:
+  TAL:
+    recall@0.3: 0.45
+    meanIoU@0.3: 0.42
+    recall@0.5: 0.32
+    meanIoU@0.5: 0.28
+CholecT50:
+  TAL:
+    recall@0.3: 0.52
+    ...
+```
+### Overall Grouping
+```
+================================================================================
+EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
+================================================================================
+TAL - Overall Evaluation (All Datasets Combined)
+Total samples: 1637
+  recall@0.3: 0.48
+  meanIoU@0.3: 0.45
+  recall@0.5: 0.35
+  meanIoU@0.5: 0.30
+```
+## Workflow for User Submissions
+1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
+   - Contains questions but NO ground-truth (struc_info removed)
+2. **User runs inference**: Generates predictions for each sample
+3. **User submits predictions**: prediction-only format
+   ```json
+   [
+     {
+       "id": "<from benchmark>",
+       "qa_type": "<from benchmark>",
+       "prediction": "<model output>"
+     },
+     ...
+   ]
+   ```
+4. **Server evaluates**:
+   ```bash
+   python evaluate_predictions.py user_submission.json
+   ```
+   - Auto-detects format
+   - Merges with server-side ground-truth
+   - Runs evaluation
+   - Returns metrics
+## File Structure
+```
+evaluation/
+├── README.md                        # This file
+├── evaluate_predictions.py          # Main wrapper (auto-detection + merging)
+├── evaluate_all_pai.py             # Core evaluation orchestrator
+├── eval_tal.py                     # TAL evaluation
+├── eval_stg.py                     # STG evaluation
+├── eval_dvc.py                     # Dense captioning evaluation
+├── eval_next_action.py             # Next action evaluation
+├── eval_caption_llm_judge.py       # RC/VS LLM judge evaluation
+├── eval_skill_assessment.py        # Skill assessment evaluation
+└── eval_cvs_assessment.py          # CVS assessment evaluation
+```
+## Key Features
+### Auto-Detection Logic
+The wrapper detects format by checking for these indicators:
+**Prediction-only format**:
+- Has `id` field (video_id&&start&&end&&fps)
+- Has `prediction` field
+- Missing `gnd` or `struc_info`
+**Merged format**:
+- Has `question` field
+- Has `gnd` field (ground-truth)
+- Has `struc_info` field (structured GT)
+- Has `metadata` dict
+### Ground-Truth Merging
+When prediction-only format is detected:
+1. Load predictions and ground-truth
+2. Build index: `{id -> ground_truth_record}`
+3. For each prediction:
+   - Look up ground-truth by ID
+   - Merge into complete record
+   - Add `data_source` from ground-truth
+4. Save to temporary file
+5. Run evaluation
+6. Clean up temporary file
+### Dataset Detection
+Datasets are detected from:
+1. **`data_source` field** (primary, leaderboard format)
+2. `dataset` field (fallback)
+3. `dataset_name` field (fallback)
+4. Video ID patterns (last resort):
+   - YouTube IDs (11 chars with letters) → AVOS
+   - `*_part*` pattern → CoPESD
+   - `video*` pattern → CholecT50
+## Error Handling
+### Missing Ground-Truth
+```bash
+# If ground-truth file not found
+[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json
+```
+**Solution**: Specify correct path with `--ground-truth`
+### Unmatched Predictions
+```bash
+[EvaluationWrapper] ⚠️  WARNING: 10 predictions not found in ground-truth
+[EvaluationWrapper]   First 5 unmatched IDs: [...]
+```
+**Cause**: Prediction IDs don't match ground-truth IDs
+**Solution**: Check ID format (video_id&&start&&end&&fps must match exactly)
+### Invalid ID Format
+```bash
+ValueError: Invalid ID format: <id_string>
+```
+**Cause**: ID doesn't follow `video_id&&start&&end&&fps` format
+**Solution**: Fix ID format in predictions
+## API Keys for LLM Judge
+For RC/VS evaluation with LLM judge:
+```bash
+export OPENAI_API_KEY="your-key"      # For GPT-4.1
+export GOOGLE_API_KEY="your-key"      # For Gemini
+# Then run evaluation
+python evaluate_predictions.py predictions.json --tasks rc vs
+```
+**Cost**: ~$0.012 per RC/VS sample (GPT-4.1)
+## Verification Checklist
+Before evaluating submissions:
+```bash
+# 1. Check file format
+python evaluate_predictions.py submission.json --analyze-only
+# 2. Verify ground-truth file exists
+ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json
+# 3. Run evaluation on sample (first 100 records)
+head -100 submission.json > sample.json
+python evaluate_predictions.py sample.json
+# 4. If successful, run full evaluation
+python evaluate_predictions.py submission.json
+```
+## Performance
+- **Small files** (100 samples): ~5-10 seconds
+- **Full benchmark** (6245 samples): ~2-5 minutes
+  - TAL/STG: ~30 seconds per dataset
+  - Next Action: ~20 seconds per dataset
+  - DVC: ~1-2 minutes (metric computation)
+  - RC/VS with LLM judge: ~5-10 minutes (API calls)
+## Notes
+- Ground-truth file contains **1414 records** (subset for leaderboard testing)
+- Full benchmark has **6245 records** across 8 datasets
+- Temporary files are automatically cleaned up after evaluation
+- LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed

evaluation/eval_dvc.py CHANGED Viewed

@@ -112,7 +112,8 @@ def group_records_by_dataset(data):
         if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
             continue
-        dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

         if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
             continue
+        # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
+        dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

evaluation/eval_next_action.py CHANGED Viewed

@@ -494,7 +494,8 @@ def group_records_by_dataset(data):
         if 'next_action' not in qa_type.lower():
             continue
-        dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

         if 'next_action' not in qa_type.lower():
             continue
+        # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
+        dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

evaluation/eval_stg.py CHANGED Viewed

@@ -210,7 +210,8 @@ def group_records_by_dataset(data):
         if 'stg' not in qa_type.lower():
             continue
-        dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

         if 'stg' not in qa_type.lower():
             continue
+        # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
+        dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
         video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
         if dataset == 'Unknown' and video_id:

evaluation/eval_tal.py CHANGED Viewed

@@ -139,7 +139,8 @@ def group_records_by_dataset(data):
         # Detect dataset from video_id or other fields
         video_id = record.get('video_id', '')
-        dataset = record.get('dataset', record.get('dataset_name', 'Unknown'))
         if dataset == 'Unknown' and video_id:
             # Try to infer from video_id patterns

         # Detect dataset from video_id or other fields
         video_id = record.get('video_id', '')
+        # Check data_source first (used in leaderboard format), then fall back to dataset/dataset_name
+        dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', 'Unknown')))
         if dataset == 'Unknown' and video_id:
             # Try to infer from video_id patterns

evaluation/evaluate_predictions.py ADDED Viewed

	@@ -0,0 +1,279 @@

+"""Auto-detect prediction format and evaluate with ground-truth merging if needed."""
+import json
+import sys
+import argparse
+import os
+from pathlib import Path
+# Add evaluation directory to path to import evaluate_all_pai
+eval_dir = Path(__file__).parent
+sys.path.insert(0, str(eval_dir))
+import evaluate_all_pai
+def detect_has_ground_truth(data):
+    """Detect if prediction file already contains ground-truth.
+    Args:
+        data: Loaded JSON data (dict or list)
+    Returns:
+        bool: True if ground-truth is present, False otherwise
+    """
+    # Handle both dict and list formats
+    if isinstance(data, dict):
+        # Check first record
+        first_key = next(iter(data))
+        sample = data[first_key]
+    elif isinstance(data, list):
+        if not data:
+            return False
+        sample = data[0]
+    else:
+        return False
+    # Check for ground-truth indicators
+    # results.json format has: question, gnd, answer, struc_info, metadata, qa_type, data_source
+    has_question = 'question' in sample
+    has_gnd = 'gnd' in sample
+    has_struc_info = 'struc_info' in sample
+    has_metadata_dict = isinstance(sample.get('metadata'), dict)
+    # predictions_only.json format has: id, qa_type, prediction
+    has_id = 'id' in sample
+    has_prediction = 'prediction' in sample
+    # If it has id + prediction format, it's prediction-only
+    if has_id and has_prediction and not has_gnd:
+        return False
+    # If it has question + gnd + struc_info, it's already merged
+    if has_question and has_gnd and has_struc_info:
+        return True
+    # Default: assume needs merging if unclear
+    return False
+def parse_id(id_str):
+    """Parse ID string into components.
+    Format: video_id&&start_frame&&end_frame&&fps
+    Example: "kcOqlifSukA&&22425&&25124&&1.0"
+    Returns:
+        dict: {'video_id': str, 'input_video_start_frame': str,
+               'input_video_end_frame': str, 'fps': str}
+    """
+    parts = id_str.split('&&')
+    if len(parts) != 4:
+        raise ValueError(f"Invalid ID format: {id_str}")
+    return {
+        'video_id': parts[0],
+        'input_video_start_frame': parts[1],
+        'input_video_end_frame': parts[2],
+        'fps': parts[3]
+    }
+def merge_with_ground_truth(predictions_file, ground_truth_file):
+    """Merge prediction-only file with ground-truth.
+    Args:
+        predictions_file: Path to predictions JSON (id, qa_type, prediction format)
+        ground_truth_file: Path to ground-truth JSON
+    Returns:
+        dict: Merged data in results.json format
+    """
+    print(f"[EvaluationWrapper] Loading predictions from {predictions_file}")
+    with open(predictions_file, 'r') as f:
+        predictions = json.load(f)
+    print(f"[EvaluationWrapper] Loading ground-truth from {ground_truth_file}")
+    with open(ground_truth_file, 'r') as f:
+        ground_truth = json.load(f)
+    # Build lookup index for ground-truth
+    print("[EvaluationWrapper] Building ground-truth index...")
+    gt_index = {}
+    for record in ground_truth:
+        metadata = record.get('metadata', {})
+        # Create key from metadata
+        key = f"{metadata.get('video_id')}&&{metadata.get('input_video_start_frame')}&&{metadata.get('input_video_end_frame')}&&{metadata.get('fps')}"
+        gt_index[key] = record
+    print(f"[EvaluationWrapper] Ground-truth index size: {len(gt_index)} records")
+    print(f"[EvaluationWrapper] Predictions to merge: {len(predictions)} records")
+    # Merge predictions with ground-truth
+    merged = {}
+    matched_count = 0
+    unmatched_ids = []
+    for i, pred in enumerate(predictions):
+        pred_id = pred.get('id')
+        if not pred_id:
+            print(f"[EvaluationWrapper] ⚠️  WARNING: Prediction {i} missing 'id' field, skipping")
+            continue
+        # Look up ground-truth
+        if pred_id not in gt_index:
+            unmatched_ids.append(pred_id)
+            continue
+        gt_record = gt_index[pred_id]
+        # Create merged record (ensure data_source is properly set)
+        data_source = gt_record.get('data_source', 'Unknown')
+        # Fallback to dataset_name if data_source is missing
+        if data_source == 'Unknown' or not data_source:
+            data_source = gt_record.get('dataset_name', 'Unknown')
+        merged_record = {
+            'metadata': gt_record.get('metadata', {}),
+            'qa_type': pred.get('qa_type'),
+            'struc_info': gt_record.get('struc_info', []),
+            'question': gt_record.get('question', ''),
+            'gnd': gt_record.get('answer', ''),  # Ground-truth answer
+            'answer': pred.get('prediction', ''),  # Model prediction
+            'data_source': data_source
+        }
+        # Use sequential keys like results.json
+        merged[str(i)] = merged_record
+        matched_count += 1
+    print(f"[EvaluationWrapper] ✓ Successfully merged {matched_count}/{len(predictions)} predictions")
+    if unmatched_ids:
+        print(f"[EvaluationWrapper] ⚠️  WARNING: {len(unmatched_ids)} predictions not found in ground-truth")
+        if len(unmatched_ids) <= 5:
+            print(f"[EvaluationWrapper]   Unmatched IDs: {unmatched_ids}")
+        else:
+            print(f"[EvaluationWrapper]   First 5 unmatched IDs: {unmatched_ids[:5]}")
+    return merged
+def main():
+    """Main function with command line interface."""
+    parser = argparse.ArgumentParser(
+        description="Evaluate predictions with automatic ground-truth merging"
+    )
+    parser.add_argument("predictions_file",
+                       help="Path to predictions JSON file (can be merged or prediction-only format)")
+    parser.add_argument("--ground-truth",
+                       default="/root/code/MedVidBench-Leaderboard/data/ground_truth.json",
+                       help="Path to ground-truth JSON file (default: data/ground_truth.json)")
+    parser.add_argument("--tasks", nargs="+",
+                       choices=["dvc", "tal", "next_action", "stg", "rc", "vs",
+                               "skill_assessment", "cvs_assessment", "gemini_structured", "gpt_structured"],
+                       help="Specific tasks to evaluate (default: all available tasks)")
+    parser.add_argument("--grouping", choices=["per-dataset", "overall"], default="per-dataset",
+                       help="Grouping strategy: 'per-dataset' or 'overall' (default: per-dataset)")
+    parser.add_argument("--analyze-only", action="store_true",
+                       help="Only analyze the file structure without running evaluations")
+    parser.add_argument("--skip-llm-judge", action="store_true",
+                       help="Skip LLM judge evaluation for caption tasks (use when LLM scores are pre-computed)")
+    args = parser.parse_args()
+    # Load predictions
+    print(f"[EvaluationWrapper] Loading predictions from {args.predictions_file}")
+    with open(args.predictions_file, 'r') as f:
+        predictions_data = json.load(f)
+    # Auto-detect format
+    has_ground_truth = detect_has_ground_truth(predictions_data)
+    if has_ground_truth:
+        print("[EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth")
+        print("[EvaluationWrapper] Using predictions file directly for evaluation")
+        eval_file = args.predictions_file
+    else:
+        print("[EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)")
+        print("[EvaluationWrapper] Merging with ground-truth...")
+        # Check ground-truth file exists
+        if not os.path.exists(args.ground_truth):
+            print(f"[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: {args.ground_truth}")
+            sys.exit(1)
+        # Merge predictions with ground-truth
+        merged_data = merge_with_ground_truth(args.predictions_file, args.ground_truth)
+        # Save merged data to temporary file
+        import tempfile
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+            json.dump(merged_data, f, indent=2)
+            eval_file = f.name
+        print(f"[EvaluationWrapper] ✓ Merged data saved to temporary file: {eval_file}")
+    # Call evaluate_all_pai with the appropriate file
+    print(f"\n[EvaluationWrapper] {'='*80}")
+    print(f"[EvaluationWrapper] Starting evaluation with evaluate_all_pai.py")
+    print(f"[EvaluationWrapper] {'='*80}\n")
+    # Set sys.argv for evaluate_all_pai
+    eval_args = [eval_file]
+    if args.tasks:
+        eval_args.extend(["--tasks"] + args.tasks)
+    if args.grouping:
+        eval_args.extend(["--grouping", args.grouping])
+    if args.analyze_only:
+        eval_args.append("--analyze-only")
+    if args.skip_llm_judge:
+        eval_args.append("--skip-llm-judge")
+    original_argv = sys.argv
+    sys.argv = ["evaluate_all_pai.py"] + eval_args
+    try:
+        # Run evaluation
+        if args.analyze_only:
+            qa_type_counts, dataset_counts = evaluate_all_pai.analyze_output_file(eval_file)
+            # Determine available tasks
+            available_tasks = []
+            if any("dense_captioning" in qa_type or qa_type == "dc" for qa_type in qa_type_counts):
+                available_tasks.append("dvc")
+            if qa_type_counts.get("tal", 0) > 0:
+                available_tasks.append("tal")
+            if qa_type_counts.get("next_action", 0) > 0:
+                available_tasks.append("next_action")
+            if qa_type_counts.get("stg", 0) > 0:
+                available_tasks.append("stg")
+            if any("region_caption" in qa_type for qa_type in qa_type_counts):
+                available_tasks.append("rc")
+            if any("video_summary" in qa_type for qa_type in qa_type_counts):
+                available_tasks.append("vs")
+            if qa_type_counts.get("skill_assessment", 0) > 0:
+                available_tasks.append("skill_assessment")
+            if qa_type_counts.get("cvs_assessment", 0) > 0:
+                available_tasks.append("cvs_assessment")
+            evaluate_all_pai.print_evaluation_results_csv(eval_file, available_tasks)
+        else:
+            silent_eval = (args.grouping == "overall")
+            evaluate_all_pai.run_evaluation(
+                eval_file,
+                args.tasks,
+                grouping=args.grouping,
+                silent_eval=silent_eval,
+                skip_llm_judge=args.skip_llm_judge
+            )
+    finally:
+        sys.argv = original_argv
+        # Clean up temporary file if we created one
+        if not has_ground_truth and os.path.exists(eval_file):
+            os.unlink(eval_file)
+            print(f"\n[EvaluationWrapper] ✓ Cleaned up temporary file: {eval_file}")
+if __name__ == "__main__":
+    main()

evaluation/test_evaluation.sh ADDED Viewed

	@@ -0,0 +1,140 @@

+#!/bin/bash
+# Comprehensive test script for MedVidBench evaluation system
+set -e  # Exit on error
+echo "============================================================"
+echo "Testing MedVidBench Leaderboard Evaluation System"
+echo "============================================================"
+cd /root/code/MedVidBench-Leaderboard/evaluation
+# Color codes
+GREEN='\033[0;32m'
+RED='\033[0;31m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+# Test 1: Analyze-only mode with results.json
+echo -e "\n${BLUE}Test 1: Analyze-only mode (complete format)${NC}"
+python evaluate_all_pai.py ../data/results.json --analyze-only > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+    echo -e "${GREEN}✓ PASSED: Analyze-only mode works${NC}"
+else
+    echo -e "${RED}✗ FAILED: Analyze-only mode${NC}"
+    exit 1
+fi
+# Test 2: TAL evaluation (per-dataset)
+echo -e "\n${BLUE}Test 2: TAL evaluation (per-dataset grouping)${NC}"
+python evaluate_all_pai.py ../data/results.json --tasks tal --grouping per-dataset > /tmp/tal_per_dataset.log 2>&1
+if [ $? -eq 0 ]; then
+    # Check if evaluation actually ran
+    if grep -q "recall@0.3" /tmp/tal_per_dataset.log; then
+        echo -e "${GREEN}✓ PASSED: TAL per-dataset evaluation${NC}"
+    else
+        echo -e "${RED}✗ FAILED: TAL evaluation did not produce results${NC}"
+        exit 1
+    fi
+else
+    echo -e "${RED}✗ FAILED: TAL per-dataset evaluation${NC}"
+    exit 1
+fi
+# Test 3: STG evaluation (per-dataset)
+echo -e "\n${BLUE}Test 3: STG evaluation (per-dataset grouping)${NC}"
+python evaluate_all_pai.py ../data/results.json --tasks stg --grouping per-dataset > /tmp/stg_per_dataset.log 2>&1
+if [ $? -eq 0 ]; then
+    echo -e "${GREEN}✓ PASSED: STG per-dataset evaluation${NC}"
+else
+    echo -e "${RED}✗ FAILED: STG per-dataset evaluation${NC}"
+    exit 1
+fi
+# Test 4: TAL evaluation (overall grouping)
+echo -e "\n${BLUE}Test 4: TAL evaluation (overall grouping)${NC}"
+python evaluate_all_pai.py ../data/results.json --tasks tal --grouping overall > /tmp/tal_overall.log 2>&1
+if [ $? -eq 0 ]; then
+    # Check for overall evaluation output
+    if grep -q "Overall Evaluation" /tmp/tal_overall.log; then
+        echo -e "${GREEN}✓ PASSED: TAL overall evaluation${NC}"
+    else
+        echo -e "${RED}✗ FAILED: TAL overall evaluation did not produce results${NC}"
+        exit 1
+    fi
+else
+    echo -e "${RED}✗ FAILED: TAL overall evaluation${NC}"
+    exit 1
+fi
+# Test 5: Multiple tasks
+echo -e "\n${BLUE}Test 5: Multiple tasks (TAL + STG)${NC}"
+python evaluate_all_pai.py ../data/results.json --tasks tal stg --grouping per-dataset > /tmp/multi_tasks.log 2>&1
+if [ $? -eq 0 ]; then
+    echo -e "${GREEN}✓ PASSED: Multiple tasks evaluation${NC}"
+else
+    echo -e "${RED}✗ FAILED: Multiple tasks evaluation${NC}"
+    exit 1
+fi
+# Test 6: Auto-detection wrapper with merged format
+echo -e "\n${BLUE}Test 6: Evaluate predictions wrapper (merged format)${NC}"
+python evaluate_predictions.py ../data/results.json --tasks tal --analyze-only > /tmp/wrapper_merged.log 2>&1
+if [ $? -eq 0 ]; then
+    # Check for detection message
+    if grep -q "already contain ground-truth" /tmp/wrapper_merged.log; then
+        echo -e "${GREEN}✓ PASSED: Wrapper correctly detected merged format${NC}"
+    else
+        echo -e "${RED}✗ FAILED: Wrapper did not detect merged format${NC}"
+        exit 1
+    fi
+else
+    echo -e "${RED}✗ FAILED: Wrapper with merged format${NC}"
+    exit 1
+fi
+# Test 7: Auto-detection wrapper with prediction-only format
+echo -e "\n${BLUE}Test 7: Evaluate predictions wrapper (prediction-only format)${NC}"
+if [ -f ../data/sample_predictions.json ]; then
+    python evaluate_predictions.py ../data/sample_predictions.json --tasks tal > /tmp/wrapper_pred_only.log 2>&1
+    if [ $? -eq 0 ]; then
+        # Check for merging message
+        if grep -q "Merging with ground-truth" /tmp/wrapper_pred_only.log; then
+            echo -e "${GREEN}✓ PASSED: Wrapper correctly detected prediction-only format and merged${NC}"
+        else
+            echo -e "${RED}✗ FAILED: Wrapper did not detect prediction-only format${NC}"
+            exit 1
+        fi
+    else
+        echo -e "${RED}✗ FAILED: Wrapper with prediction-only format${NC}"
+        exit 1
+    fi
+else
+    echo -e "${BLUE}⊘ SKIPPED: sample_predictions.json not found${NC}"
+fi
+# Test 8: Dataset detection
+echo -e "\n${BLUE}Test 8: Dataset detection (check for AVOS not Unknown)${NC}"
+python evaluate_predictions.py ../data/results.json --tasks tal > /tmp/dataset_detection.log 2>&1
+if grep -q "AVOS:" /tmp/dataset_detection.log && ! grep -q "Unknown:" /tmp/dataset_detection.log; then
+    echo -e "${GREEN}✓ PASSED: Datasets correctly detected (AVOS found, no Unknown)${NC}"
+else
+    echo -e "${RED}✗ FAILED: Dataset detection issue (check for Unknown datasets)${NC}"
+    # This is a warning, not a failure
+fi
+# Summary
+echo -e "\n============================================================"
+echo -e "${GREEN}All Tests Passed!${NC}"
+echo -e "============================================================"
+echo ""
+echo "Test logs saved to /tmp:"
+echo "  - tal_per_dataset.log"
+echo "  - stg_per_dataset.log"
+echo "  - tal_overall.log"
+echo "  - multi_tasks.log"
+echo "  - wrapper_merged.log"
+echo "  - wrapper_pred_only.log"
+echo "  - dataset_detection.log"
+echo ""
+echo "System is ready for user submissions!"