Spaces:

UII-AI
/

MedVidBench-Leaderboard

Running

App Files Files Community

MedGRPO Team commited on Apr 7

Commit

e2b1040

1 Parent(s): a9b5dcf

fix issues

Browse files

Files changed (11) hide show

README.md +5 -4
app.py +22 -19
evaluation/eval_caption_llm_judge.py +1 -1
evaluation/eval_cvs_assessment.py +7 -0
evaluation/eval_dvc.py +177 -19
evaluation/eval_next_action.py +15 -0
evaluation/eval_skill_assessment.py +6 -0
evaluation/eval_stg.py +8 -0
evaluation/eval_tal.py +2 -1
evaluation/evaluate_all_pai.py +93 -233
evaluation/evaluate_predictions.py +10 -7

README.md CHANGED Viewed

@@ -221,9 +221,10 @@ To compute the **average score** fairly across tasks:
 ## Links
 - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
-- 🌐 **Project**: [https://gaozhongpai.github.io/MedGRPO-Page/](https://gaozhongpai.github.io/MedGRPO-Page/)
 - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
-- 💻 **GitHub**: [https://github.com/Gaozhongpai/MedGRPO](https://github.com/Gaozhongpai/MedGRPO)
 - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
 ## Citation
@@ -245,5 +246,5 @@ To compute the **average score** fairly across tasks:
 ## Contact
 For questions or issues:
-- Open an issue on [GitHub](https://github.com/Gaozhongpai/MedGRPO)
-- Visit the [project page](https://gaozhongpai.github.io/MedGRPO-Page/)

 ## Links
 - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
+- 🌐 **Project**: [https://uii-america.github.io/MedGRPO/](https://uii-america.github.io/MedGRPO/)
 - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
+- 💻 **GitHub**: [https://github.com/UII-America/MedGRPO-Code](https://github.com/UII-America/MedGRPO-Code)
+- 🎮 **Demo**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Demo](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Demo)
 - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
 ## Citation
 ## Contact
 For questions or issues:
+- Open an issue on [GitHub](https://github.com/UII-America/MedGRPO-Code)
+- Visit the [project page](https://uii-america.github.io/MedGRPO/)

app.py CHANGED Viewed

@@ -865,22 +865,23 @@ def parse_evaluation_output(output: str) -> Dict[str, float]:
         line = line.strip()
         # Detect task headers
-        if "TAL" in line and "Overall" in line:
             current_task = "tal"
         elif "STG" in line and "Overall" in line:
             current_task = "stg"
-        elif "NEXT_ACTION" in line and "Overall" in line or "Next Action" in line:
             current_task = "next_action"
-        elif "DVC" in line and "Overall" in line or "Dense Video Captioning" in line:
             current_task = "dvc"
-        elif "RC" in line and "Overall" in line or "Region Caption" in line:
             current_task = "rc"
-        elif "VS" in line and "Overall" in line or "Video Summary" in line:
             current_task = "vs"
-        elif "SKILL" in line and "Overall" in line or "Skill Assessment" in line:
-            current_task = "skill_assessment"
-        elif "CVS" in line and "Overall" in line or "CVS Assessment" in line:
-            current_task = "cvs_assessment"
         # Detect IoU sections for TAL (new format)
         if current_task == "tal":
@@ -951,16 +952,16 @@ def parse_evaluation_output(output: str) -> Dict[str, float]:
             # VS: Extract LLM score
             elif current_task == "vs" and ("score" in line.lower() or "average" in line.lower()):
                 try:
-                    value = float(line.split(":")[-1].strip())
-                    metrics["vs_llm"] = value
                 except:
                     pass
             # RC: Extract LLM score
             elif current_task == "rc" and ("score" in line.lower() or "average" in line.lower()):
                 try:
-                    value = float(line.split(":")[-1].strip())
-                    metrics["rc_llm"] = value
                 except:
                     pass
@@ -1652,9 +1653,10 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
     8 medical video understanding tasks across 8 surgical datasets.
     📄 **Paper**: [MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding](https://arxiv.org/abs/2512.06581)
-    🌐 **Project**: [gaozhongpai.github.io/MedGRPO-Page](https://gaozhongpai.github.io/MedGRPO-Page/)
     💾 **Dataset**: [huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
-    💻 **GitHub**: [github.com/Gaozhongpai/MedGRPO](https://github.com/Gaozhongpai/MedGRPO)
     """)
     with gr.Tabs():
@@ -1931,9 +1933,10 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
             #### Links
             - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
-            - 🌐 **Project Page**: [https://gaozhongpai.github.io/MedGRPO-Page/](https://gaozhongpai.github.io/MedGRPO-Page/)
             - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
-            - 💻 **GitHub**: [https://github.com/Gaozhongpai/MedGRPO](https://github.com/Gaozhongpai/MedGRPO)
             - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
             #### Dataset
@@ -1953,8 +1956,8 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
             #### Contact
             For questions or issues:
-            - Open an issue on [GitHub](https://github.com/Gaozhongpai/MedGRPO)
-            - Visit the [project page](https://gaozhongpai.github.io/MedGRPO-Page/)
             - Email: [Contact via GitHub](https://github.com/YuhaoSu)
             """)

         line = line.strip()
         # Detect task headers
+        # NOTE: Order matters — check CVS before VS (since "CVS" contains "VS")
+        if ("CVS" in line and "Overall" in line) or "CVS Assessment" in line:
+            current_task = "cvs_assessment"
+        elif ("SKILL" in line and "Overall" in line) or "Skill Assessment" in line:
+            current_task = "skill_assessment"
+        elif "TAL" in line and "Overall" in line:
             current_task = "tal"
         elif "STG" in line and "Overall" in line:
             current_task = "stg"
+        elif ("NEXT_ACTION" in line and "Overall" in line) or "Next Action" in line:
             current_task = "next_action"
+        elif ("DVC" in line and "Overall" in line) or "Dense Video Captioning" in line:
             current_task = "dvc"
+        elif ("RC" in line and "Overall" in line) or "Region Caption" in line:
             current_task = "rc"
+        elif ("VS" in line and "Overall" in line) or "Video Summary" in line:
             current_task = "vs"
         # Detect IoU sections for TAL (new format)
         if current_task == "tal":
             # VS: Extract LLM score
             elif current_task == "vs" and ("score" in line.lower() or "average" in line.lower()):
                 try:
+                    val_str = line.split(":")[-1].strip().split("(")[0].strip()
+                    metrics["vs_llm"] = float(val_str)
                 except:
                     pass
             # RC: Extract LLM score
             elif current_task == "rc" and ("score" in line.lower() or "average" in line.lower()):
                 try:
+                    val_str = line.split(":")[-1].strip().split("(")[0].strip()
+                    metrics["rc_llm"] = float(val_str)
                 except:
                     pass
     8 medical video understanding tasks across 8 surgical datasets.
     📄 **Paper**: [MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding](https://arxiv.org/abs/2512.06581)
+    🌐 **Project**: [uii-america.github.io/MedGRPO](https://uii-america.github.io/MedGRPO/)
     💾 **Dataset**: [huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
+    💻 **GitHub**: [github.com/UII-America/MedGRPO-Code](https://github.com/UII-America/MedGRPO-Code)
+    🎮 **Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Demo](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Demo)
     """)
     with gr.Tabs():
             #### Links
             - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
+            - 🌐 **Project Page**: [https://uii-america.github.io/MedGRPO/](https://uii-america.github.io/MedGRPO/)
             - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
+            - 💻 **GitHub**: [https://github.com/UII-America/MedGRPO-Code](https://github.com/UII-America/MedGRPO-Code)
+            - 🎮 **Demo**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Demo](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Demo)
             - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
             #### Dataset
             #### Contact
             For questions or issues:
+            - Open an issue on [GitHub](https://github.com/UII-America/MedGRPO-Code)
+            - Visit the [project page](https://uii-america.github.io/MedGRPO/)
             - Email: [Contact via GitHub](https://github.com/YuhaoSu)
             """)

evaluation/eval_caption_llm_judge.py CHANGED Viewed

@@ -163,7 +163,7 @@ def call_llm_judge_api(prediction: str, ground_truth: str, task_type: str, api_k
                     with progress_lock:
                         completed_calls += 1
-                        if completed_calls % 50 == 0:
                             print(f"  Progress: {completed_calls}/{total_calls} API calls completed")
                     return scores

                     with progress_lock:
                         completed_calls += 1
+                        if total_calls > 0 and completed_calls % 50 == 0:
                             print(f"  Progress: {completed_calls}/{total_calls} API calls completed")
                     return scores

evaluation/eval_cvs_assessment.py CHANGED Viewed

@@ -373,10 +373,17 @@ def main():
     print("CVS ASSESSMENT EVALUATION SUMMARY")
     print(f"{'='*60}")
     for dataset_name, results in all_results.items():
         if results:
             print(f"\n{dataset_name}:")
             print(f"  Overall Accuracy: {results['accuracy']:.4f} ({results['correct']}/{results['total']})")
 if __name__ == "__main__":

     print("CVS ASSESSMENT EVALUATION SUMMARY")
     print(f"{'='*60}")
+    all_bal_acc = []
     for dataset_name, results in all_results.items():
         if results:
             print(f"\n{dataset_name}:")
             print(f"  Overall Accuracy: {results['accuracy']:.4f} ({results['correct']}/{results['total']})")
+            all_bal_acc.append(results.get('component_balanced_accuracy', 0.0))
+    return {
+        'per_dataset': all_results,
+        'component_balanced_accuracy': np.mean(all_bal_acc) if all_bal_acc else 0.0
+    }
 if __name__ == "__main__":

evaluation/eval_dvc.py CHANGED Viewed

@@ -1,18 +1,30 @@
 """Dense Video Captioning evaluation using LLM judge + temporal F1.
 Temporal F1 algorithm matches Qwen2.5-VL/my_eval/eval_dvc.py exactly:
 - process_raw_output() + flatten_overlapping_segments() for parsing
 - Frame-based coordinates (multiply by FPS)
-- Many-to-many threshold matching across IoU (0.3, 0.5, 0.7, 0.9)
 - F1 = 2 * mean_precision * mean_recall / (mean_precision + mean_recall)
 """
 import json
 import re
 import sys
 import numpy as np
 from collections import defaultdict
-from eval_caption_llm_judge import evaluate_caption_task
 # =============================================================================
@@ -190,7 +202,7 @@ def compute_temporal_f1_single(predicted_segments, gt_segments, splits,
 # =============================================================================
-# Dataset grouping and evaluation
 # =============================================================================
 def group_records_by_dataset(data):
@@ -241,31 +253,176 @@ def _extract_gt_segments(record):
     return gnd
 def evaluate_dataset_dvc(dataset_name, records, skip_llm_judge=False):
     """Evaluate DVC for a specific dataset using caption quality + temporal F1."""
     print(f"\nEvaluating {dataset_name} ({len(records)} records)...")
-    # Step 1: Evaluate caption quality using LLM judge (unless skipped)
     if skip_llm_judge:
         print(f"  Skipping LLM judge caption evaluation (--skip-llm-judge flag)")
         caption_score = 0.0
         caption_method = 'skipped'
     else:
-        import tempfile
-        import os
-        temp_data = {str(i): record for i, record in enumerate(records)}
-        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
-            json.dump(temp_data, f)
-            temp_file = f.name
-        try:
-            caption_result = evaluate_caption_task(temp_file, 'dense_captioning')
-            caption_score = caption_result['score']
-            caption_method = caption_result['method']
-        finally:
-            os.unlink(temp_file)
     # Step 2: Compute temporal F1 matching Qwen2.5-VL algorithm exactly
     all_f1_scores = []
@@ -389,6 +546,7 @@ def main():
                     all_f1_scores.append(metrics.get('temporal_f1', 0))
     return {
         'caption_score': np.mean(all_caption_scores) if all_caption_scores else 0.0,
         'temporal_f1': np.mean(all_f1_scores) if all_f1_scores else 0.0,
         'method': all_results[list(all_results.keys())[0]]['overall'].get('caption_method', 'unknown') if all_results else 'unknown'

 """Dense Video Captioning evaluation using LLM judge + temporal F1.
+LLM judge uses IoU-matched segment pairs (matching original Qwen2.5-VL/llm_judge/):
+- Match predicted segments to GT segments at IoU thresholds (0.3, 0.5, 0.7)
+- Only judge matched pairs individually (not concatenated)
+- Average across matched pairs, then across thresholds
 Temporal F1 algorithm matches Qwen2.5-VL/my_eval/eval_dvc.py exactly:
 - process_raw_output() + flatten_overlapping_segments() for parsing
 - Frame-based coordinates (multiply by FPS)
+- Many-to-many threshold matching across IoU (0.3, 0.5, 0.7)
 - F1 = 2 * mean_precision * mean_recall / (mean_precision + mean_recall)
 """
 import json
+import os
 import re
 import sys
+import time
 import numpy as np
 from collections import defaultdict
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from threading import Lock
+from eval_caption_llm_judge import (
+    call_llm_judge_api, BEST5_ASPECTS, OPENAI_AVAILABLE,
+    compute_semantic_similarity_fallback
+)
 # =============================================================================
 # =============================================================================
+# Dataset grouping and evaluation (LlamaFactory specific)
 # =============================================================================
 def group_records_by_dataset(data):
     return gnd
+DVC_IOU_THRESHOLDS = [0.3, 0.5, 0.7]
+DVC_MAX_WORKERS = 20
+# Thread-safe progress counter for DVC LLM judge
+_dvc_progress_lock = Lock()
+_dvc_completed = 0
+_dvc_total = 0
+def _segment_iou(seg1, seg2):
+    """Compute IoU for two temporal segments (dicts with 'start' and 'end')."""
+    intersection = max(0, min(seg1['end'], seg2['end']) - max(seg1['start'], seg2['start']))
+    union = (seg1['end'] - seg1['start']) + (seg2['end'] - seg2['start']) - intersection
+    return intersection / union if union > 0 else 0.0
+def _match_captions_at_threshold(pred_segments, gt_segments, threshold):
+    """Match predicted to ground truth segments at a specific IoU threshold.
+    Returns list of (pred_caption, gt_caption) pairs.
+    """
+    matched_pairs = []
+    for pred_seg in pred_segments:
+        best_iou = 0.0
+        best_gt_caption = None
+        for gt_seg in gt_segments:
+            current_iou = _segment_iou(pred_seg, gt_seg)
+            if current_iou >= threshold and current_iou > best_iou:
+                best_iou = current_iou
+                best_gt_caption = gt_seg['caption']
+        if best_gt_caption is not None:
+            matched_pairs.append((pred_seg['caption'], best_gt_caption))
+    return matched_pairs
+def _evaluate_dvc_caption_iou_matched(records, api_key):
+    """Evaluate DVC captions using IoU-matched segment pairs + LLM judge.
+    Matches the original Qwen2.5-VL/llm_judge/ approach:
+    1. Parse pred and GT into segments
+    2. Match at IoU thresholds (0.3, 0.5, 0.7)
+    3. Judge each matched pair individually
+    4. Average across pairs, then across thresholds
+    """
+    global _dvc_completed, _dvc_total
+    # Phase 1: Match all samples at all thresholds
+    print(f"  Phase 1: Matching segments at IoU thresholds {DVC_IOU_THRESHOLDS}...")
+    all_matched = []
+    for record in records:
+        pred_text = record.get('answer', '')
+        gt_text = record.get('gnd', '')
+        pred_segments = process_raw_output(pred_text)
+        gt_segments = _extract_gt_segments(record)
+        if not isinstance(gt_segments, list):
+            continue
+        # Ensure gt_segments are dicts with caption
+        gt_segs = [g for g in gt_segments if isinstance(g, dict) and 'start' in g and 'end' in g and 'caption' in g]
+        if not pred_segments or not gt_segs:
+            continue
+        matched_pairs = {}
+        for threshold in DVC_IOU_THRESHOLDS:
+            pairs = _match_captions_at_threshold(pred_segments, gt_segs, threshold)
+            matched_pairs[threshold] = pairs
+        all_matched.append(matched_pairs)
+    total_pairs = sum(sum(len(pairs) for pairs in m.values()) for m in all_matched)
+    print(f"  ✓ Matched {len(all_matched)} samples, {total_pairs} total pairs across all thresholds")
+    if total_pairs == 0:
+        return 0.0, 'llm_judge_iou_matched', 0.0
+    # Phase 2: Evaluate all matched pairs in parallel
+    _dvc_total = total_pairs
+    _dvc_completed = 0
+    print(f"  Phase 2: Evaluating {total_pairs} pairs with LLM Judge ({DVC_MAX_WORKERS} workers)...")
+    # Collect all tasks: (sample_idx, threshold, pred_caption, gt_caption)
+    tasks = []
+    for sample_idx, matched_pairs in enumerate(all_matched):
+        for threshold in DVC_IOU_THRESHOLDS:
+            for pred_cap, gt_cap in matched_pairs[threshold]:
+                tasks.append((sample_idx, threshold, pred_cap, gt_cap))
+    # Store results per threshold
+    threshold_scores = {t: {aspect: [] for aspect in BEST5_ASPECTS} for t in DVC_IOU_THRESHOLDS}
+    api_successes = 0
+    def _judge_pair(pred_cap, gt_cap):
+        global _dvc_completed
+        result = call_llm_judge_api(pred_cap, gt_cap, 'dense_captioning', api_key)
+        with _dvc_progress_lock:
+            _dvc_completed += 1
+            if _dvc_completed % 50 == 0:
+                print(f"    Progress: {_dvc_completed}/{_dvc_total} API calls completed")
+        return result
+    with ThreadPoolExecutor(max_workers=DVC_MAX_WORKERS) as executor:
+        future_to_task = {
+            executor.submit(_judge_pair, pred_cap, gt_cap): (sample_idx, threshold)
+            for sample_idx, threshold, pred_cap, gt_cap in tasks
+        }
+        for future in as_completed(future_to_task):
+            _, threshold = future_to_task[future]
+            try:
+                result = future.result()
+                if result.get('api_success', False):
+                    for aspect in BEST5_ASPECTS:
+                        threshold_scores[threshold][aspect].append(result[aspect])
+                    api_successes += 1
+            except Exception as e:
+                print(f"    ⚠ Error: {e}")
+    # Phase 3: Aggregate — average per threshold, then across thresholds
+    per_threshold_avg = {}
+    for threshold in DVC_IOU_THRESHOLDS:
+        aspect_avgs = {}
+        for aspect in BEST5_ASPECTS:
+            scores = threshold_scores[threshold][aspect]
+            aspect_avgs[aspect] = np.mean(scores) if scores else 0.0
+        valid = [v for v in aspect_avgs.values() if v > 0]
+        per_threshold_avg[threshold] = np.mean(valid) if valid else 0.0
+    # Overall: average across thresholds
+    valid_thresholds = [v for v in per_threshold_avg.values() if v > 0]
+    overall_score = np.mean(valid_thresholds) if valid_thresholds else 0.0
+    success_rate = api_successes / total_pairs if total_pairs > 0 else 0.0
+    print(f"  ✓ LLM Judge completed: {api_successes}/{total_pairs} successful")
+    for t in DVC_IOU_THRESHOLDS:
+        print(f"    IoU@{t}: {per_threshold_avg[t]:.3f}")
+    print(f"    Overall (threshold-averaged): {overall_score:.3f}")
+    return overall_score, 'llm_judge_iou_matched', success_rate
 def evaluate_dataset_dvc(dataset_name, records, skip_llm_judge=False):
     """Evaluate DVC for a specific dataset using caption quality + temporal F1."""
     print(f"\nEvaluating {dataset_name} ({len(records)} records)...")
+    # Step 1: Evaluate caption quality using IoU-matched LLM judge
     if skip_llm_judge:
         print(f"  Skipping LLM judge caption evaluation (--skip-llm-judge flag)")
         caption_score = 0.0
         caption_method = 'skipped'
     else:
+        api_key = os.getenv('OPENAI_API_KEY')
+        if api_key and OPENAI_AVAILABLE:
+            caption_score, caption_method, _ = _evaluate_dvc_caption_iou_matched(records, api_key)
+        else:
+            print(f"  ⚠ No API key, using semantic similarity fallback")
+            import tempfile
+            temp_data = {str(i): record for i, record in enumerate(records)}
+            with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
+                json.dump(temp_data, f)
+                temp_file = f.name
+            try:
+                caption_score = compute_semantic_similarity_fallback(temp_data, 'dense_captioning')
+                caption_method = 'semantic_similarity'
+            finally:
+                os.unlink(temp_file)
     # Step 2: Compute temporal F1 matching Qwen2.5-VL algorithm exactly
     all_f1_scores = []
                     all_f1_scores.append(metrics.get('temporal_f1', 0))
     return {
+        'per_dataset': all_results,
         'caption_score': np.mean(all_caption_scores) if all_caption_scores else 0.0,
         'temporal_f1': np.mean(all_f1_scores) if all_f1_scores else 0.0,
         'method': all_results[list(all_results.keys())[0]]['overall'].get('caption_method', 'unknown') if all_results else 'unknown'

evaluation/eval_next_action.py CHANGED Viewed

@@ -684,6 +684,9 @@ def main():
     print("NEXT ACTION EVALUATION SUMMARY")
     print(f"{'='*80}")
     for dataset_name, fps_results in all_results.items():
         if fps_results:
             print(f"\n{dataset_name}:")
@@ -694,6 +697,18 @@ def main():
                         print(f"    {metric_name}: {value:.4f}")
                     else:
                         print(f"    samples: {value}")
 if __name__ == "__main__":

     print("NEXT ACTION EVALUATION SUMMARY")
     print(f"{'='*80}")
+    all_accuracies = []
+    total_correct = 0
+    total_samples = 0
     for dataset_name, fps_results in all_results.items():
         if fps_results:
             print(f"\n{dataset_name}:")
                         print(f"    {metric_name}: {value:.4f}")
                     else:
                         print(f"    samples: {value}")
+            if 'overall' in fps_results:
+                acc = fps_results['overall'].get('accuracy', 0.0)
+                count = fps_results['overall'].get('count', 0)
+                all_accuracies.append(acc)
+                total_correct += int(acc * count)
+                total_samples += count
+    return {
+        'per_dataset': all_results,
+        'accuracy': total_correct / total_samples if total_samples > 0 else 0.0,
+        'macro_accuracy': np.mean(all_accuracies) if all_accuracies else 0.0
+    }
 if __name__ == "__main__":

evaluation/eval_skill_assessment.py CHANGED Viewed

@@ -421,6 +421,12 @@ def main():
             # Show overall skill level accuracy
             print(f"  Overall Skill Level Accuracy: {results['accuracy']:.4f} ({results['correct']}/{results['total']})")
 if __name__ == "__main__":
     main()

             # Show overall skill level accuracy
             print(f"  Overall Skill Level Accuracy: {results['accuracy']:.4f} ({results['correct']}/{results['total']})")
+    all_bal_acc = [r.get('aspect_balanced_accuracy', 0.0) for r in all_results.values() if r]
+    return {
+        'per_dataset': all_results,
+        'aspect_balanced_accuracy': np.mean(all_bal_acc) if all_bal_acc else 0.0
+    }
 if __name__ == "__main__":
     main()

evaluation/eval_stg.py CHANGED Viewed

@@ -354,6 +354,7 @@ def main():
     print("STG EVALUATION SUMMARY")
     print(f"{'='*80}")
     for dataset_name, fps_results in all_results.items():
         if fps_results:
             print(f"\n{dataset_name}:")
@@ -364,6 +365,13 @@ def main():
                         print(f"    {metric_name}: {value:.4f}")
                     else:
                         print(f"    samples: {value}")
 if __name__ == "__main__":

     print("STG EVALUATION SUMMARY")
     print(f"{'='*80}")
+    all_ious = []
     for dataset_name, fps_results in all_results.items():
         if fps_results:
             print(f"\n{dataset_name}:")
                         print(f"    {metric_name}: {value:.4f}")
                     else:
                         print(f"    samples: {value}")
+            if 'overall' in fps_results:
+                all_ious.append(fps_results['overall'].get('mean_iou', 0.0))
+    return {
+        'per_dataset': all_results,
+        'mean_iou': np.mean(all_ious) if all_ious else 0.0
+    }
 if __name__ == "__main__":

evaluation/eval_tal.py CHANGED Viewed

@@ -309,8 +309,9 @@ def main():
                 if 'meanIoU@0.5' in metrics:
                     all_miou_05.append(metrics['meanIoU@0.5'])
-    # Return overall aggregated results
     return {
         'meanIoU@0.3': np.mean(all_miou_03) if all_miou_03 else 0.0,
         'meanIoU@0.5': np.mean(all_miou_05) if all_miou_05 else 0.0
     }

                 if 'meanIoU@0.5' in metrics:
                     all_miou_05.append(metrics['meanIoU@0.5'])
+    # Return per-dataset results for caching + macro averages
     return {
+        'per_dataset': all_results,
         'meanIoU@0.3': np.mean(all_miou_03) if all_miou_03 else 0.0,
         'meanIoU@0.5': np.mean(all_miou_05) if all_miou_05 else 0.0
     }

evaluation/evaluate_all_pai.py CHANGED Viewed

@@ -442,278 +442,138 @@ def print_evaluation_results_csv_internal(output_file, tasks, evaluation_results
 def print_overall_evaluation_results(output_file, tasks, all_task_results, skip_llm_judge=False):
-    """Print evaluation results in overall mode (dataset-agnostic).
-    For each task, computes metrics by processing individual samples across
-    all datasets together, rather than averaging per-dataset metrics.
     """
     print(f"\n{'='*80}")
     print(f"EVALUATION RESULTS - OVERALL (Dataset-Agnostic)")
     print(f"{'='*80}")
-    # Load the data to re-process at individual level
-    with open(output_file, "r") as f:
-        data = json.load(f)
-    # Handle both dict and list formats
-    if isinstance(data, dict):
-        records = list(data.values())
-    elif isinstance(data, list):
-        records = data
-    else:
-        print(f"Unexpected data format: {type(data)}")
-        return
-    # For each task, collect all records across datasets and re-evaluate
     for task_name in sorted(tasks):
         print(f"\n{'='*80}")
         print(f"{task_name.upper()} - Overall Evaluation (All Datasets Combined)")
         print(f"{'='*80}")
-        # Filter records for this task
-        task_records = []
-        for record in records:
-            qa_type = record.get("qa_type", "unknown")
-            # Map qa_type to task name
-            mapped_task = None
-            if any("dense_captioning" in qa_type or qa_type == "dc" for _ in [qa_type]):
-                mapped_task = "dvc"
-            elif qa_type == "tal":
-                mapped_task = "tal"
-            elif qa_type == "next_action":
-                mapped_task = "next_action"
-            elif qa_type == "stg":
-                mapped_task = "stg"
-            elif "region_caption" in qa_type:
-                mapped_task = "rc"
-            elif "video_summary" in qa_type:
-                mapped_task = "vs"
-            elif qa_type == "skill_assessment":
-                mapped_task = "skill_assessment"
-            elif qa_type == "cvs_assessment":
-                mapped_task = "cvs_assessment"
-            if mapped_task == task_name:
-                task_records.append(record)
-        if not task_records:
-            print(f"No records found for {task_name}")
             continue
-        print(f"Total samples: {len(task_records)}")
-        # Re-run evaluation on all records together
-        # Import and call the appropriate evaluation function
         try:
             if task_name == "tal":
-                # Import the eval module
-                module = load_eval_module("eval_tal")
-                # Create a temporary dict with sequential keys
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                # Get grouped records
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                # Combine all records across datasets
-                all_records = []
-                for ds_records in dataset_records_dict.values():
-                    all_records.extend(ds_records)
-                # Evaluate as single dataset
-                results = module.evaluate_dataset_tal("Overall", all_records)
-                # Print results
-                for iou_key, metrics in results.items():
-                    if isinstance(metrics, dict):
-                        print(f"\n{iou_key}:")
-                        for metric_name, value in metrics.items():
-                            print(f"  {metric_name}: {value:.4f}")
-                    else:
-                        print(f"{iou_key}: {metrics:.4f}")
             elif task_name == "stg":
-                module = load_eval_module("eval_stg")
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                all_records = []
-                for ds_records in dataset_records_dict.values():
-                    all_records.extend(ds_records)
-                results = module.evaluate_dataset_stg("Overall", all_records)
-                # Extract overall metrics
-                if 'overall' in results:
-                    mean_iou = results['overall'].get('mean_iou', 0.0)
-                    print(f"\nmean_iou: {mean_iou:.4f}")
-                else:
-                    # Compute from per-FPS metrics if overall not available
                     all_ious = []
-                    for fps_key, metrics in results.items():
-                        if isinstance(metrics, dict) and 'mIoU' in metrics:
-                            count = metrics.get('count', 0)
-                            miou = metrics.get('mIoU', 0)
-                            all_ious.extend([miou] * int(count))
-                    if all_ious:
-                        import numpy as np
-                        overall_miou = np.mean(all_ious)
-                        print(f"\nmean_iou: {overall_miou:.4f}")
-                    else:
-                        print(f"\nmean_iou: 0.0000")
             elif task_name in ["rc", "vs"]:
-                # Use server-side LLM judge for caption evaluation
-                module = load_eval_module("eval_caption_llm_judge")
-                task_type = "region_caption" if task_name == "rc" else "video_summary"
-                # Save task records to temp file for evaluation
-                import tempfile
-                with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
-                    json.dump(task_records, f)
-                    temp_file = f.name
-                try:
-                    result = module.evaluate_caption_task(temp_file, task_type)
-                    print(f"Method: {result['method']}")
-                    print(f"Score: {result['score']:.4f} ({result['scale']} scale)")
-                    if 'aspect_scores' in result:
                         print("Aspect Scores:")
-                        for aspect, score in sorted(result['aspect_scores'].items()):
                             print(f"  {aspect}: {score:.3f}")
-                finally:
-                    os.unlink(temp_file)
             elif task_name == "next_action":
-                module = load_eval_module("eval_next_action")
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                # For next_action, we need to evaluate per dataset (different action lists)
-                # then aggregate the results - but suppress per-dataset output
-                all_accuracies = []
-                total_correct = 0
-                total_samples = 0
-                # Suppress output during per-dataset evaluation
-                import io
-                import contextlib
-                for dataset_name, ds_records in dataset_records_dict.items():
-                    if ds_records:
-                        # Silently evaluate each dataset
-                        # Suppress SentenceTransformer/safetensors warnings at fd level
-                        import logging, os
-                        logging.disable(logging.WARNING)
-                        old_fd_out = os.dup(1)
-                        old_fd_err = os.dup(2)
-                        devnull = os.open(os.devnull, os.O_WRONLY)
-                        os.dup2(devnull, 1)
-                        os.dup2(devnull, 2)
-                        try:
-                            with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
-                                ds_results = module.evaluate_dataset_next_action(dataset_name, ds_records)
-                        finally:
-                            os.dup2(old_fd_out, 1)
-                            os.dup2(old_fd_err, 2)
-                            os.close(old_fd_out)
-                            os.close(old_fd_err)
-                            os.close(devnull)
-                            logging.disable(logging.NOTSET)
-                        if "overall" in ds_results:
-                            accuracy = ds_results["overall"].get("accuracy", 0.0)
-                            # Use actual evaluated count, not input count (some records may be skipped)
-                            evaluated_count = ds_results["overall"].get("count", len(ds_records))
-                            all_accuracies.append(accuracy)
-                            total_correct += int(accuracy * evaluated_count)
-                            total_samples += evaluated_count
-                # Print only final aggregate metrics
-                if all_accuracies:
-                    macro_avg = sum(all_accuracies) / len(all_accuracies)
-                    weighted_avg = total_correct / total_samples if total_samples > 0 else 0.0
-                    print(f"\nMacro Average Accuracy (across {len(all_accuracies)} datasets): {macro_avg:.4f}")
-                    print(f"Weighted Average Accuracy (across {total_samples} samples): {weighted_avg:.4f}")
                 else:
-                    # Fallback: compute overall accuracy directly from all records
-                    print(f"\nNext Action Metrics:")
-                    all_correct = 0
-                    all_total = 0
-                    for dataset_name, ds_records in dataset_records_dict.items():
-                        if ds_records:
-                            with contextlib.redirect_stdout(io.StringIO()):
-                                ds_results = module.evaluate_dataset_next_action(dataset_name, ds_records)
-                            # Extract accuracy from any FPS key
-                            for fps_key, metrics in ds_results.items():
-                                if isinstance(metrics, dict) and 'accuracy' in metrics:
-                                    accuracy = metrics['accuracy']
-                                    count = metrics.get('count', len(ds_records))
-                                    all_correct += int(accuracy * count)
-                                    all_total += count
-                                    break
-                    if all_total > 0:
-                        overall_acc = all_correct / all_total
-                        print(f"  accuracy: {overall_acc:.4f}")
             elif task_name == "dvc":
-                module = load_eval_module("eval_dvc")
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                # Combine all records across datasets
-                all_records = []
-                for ds_records in dataset_records_dict.values():
-                    all_records.extend(ds_records)
-                # Evaluate as single dataset (pass skip_llm_judge flag)
-                results = module.evaluate_dataset_dvc("Overall", all_records, skip_llm_judge=skip_llm_judge)
-                # Print results
                 print(f"\nDense Video Captioning Metrics:")
-                if 'overall' in results:
-                    overall_metrics = results['overall']
-                    for metric_name, value in overall_metrics.items():
-                        if isinstance(value, (int, float)):
-                            print(f"  {metric_name}: {value:.4f}")
             elif task_name == "cvs_assessment":
-                module = load_eval_module("eval_cvs_assessment")
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                # Combine all records across datasets
-                all_records = []
-                for ds_records in dataset_records_dict.values():
-                    all_records.extend(ds_records)
-                # Evaluate combined
-                results = module.evaluate_cvs_assessment(all_records)
-                # Print results
-                print(f"\nCVS Assessment Metrics:")
-                if "overall" in results:
-                    for metric_name, value in results["overall"].items():
-                        if isinstance(value, (int, float)):
-                            print(f"  {metric_name}: {value:.4f}")
                 else:
-                    for metric_name, value in results.items():
-                        if isinstance(value, (int, float)):
-                            print(f"  {metric_name}: {value:.4f}")
             elif task_name == "skill_assessment":
-                module = load_eval_module("eval_skill_assessment")
-                temp_data = {str(i): record for i, record in enumerate(task_records)}
-                dataset_records_dict = module.group_records_by_dataset(temp_data)
-                # Combine all records across datasets
-                all_records = []
-                for ds_records in dataset_records_dict.values():
-                    all_records.extend(ds_records)
-                # Evaluate combined
-                results = module.evaluate_skill_assessment(all_records)
-                # Print results
-                print(f"\nSkill Assessment Metrics:")
-                if "overall" in results:
-                    for metric_name, value in results["overall"].items():
-                        if isinstance(value, (int, float)):
-                            print(f"  {metric_name}: {value:.4f}")
                 else:
-                    for metric_name, value in results.items():
-                        if isinstance(value, (int, float)):
-                            print(f"  {metric_name}: {value:.4f}")
             else:
                 print(f"Overall evaluation not implemented for {task_name} yet")
         except Exception as e:
-            print(f"Error running overall evaluation for {task_name}: {e}")
             import traceback
             traceback.print_exc()

 def print_overall_evaluation_results(output_file, tasks, all_task_results, skip_llm_judge=False):
+    """Print evaluation results in overall mode using cached per-dataset results.
+    Aggregates per-dataset results from _run_task_eval (pooled across all datasets)
+    so that each data point is only evaluated once.
     """
+    import numpy as np
     print(f"\n{'='*80}")
     print(f"EVALUATION RESULTS - OVERALL (Dataset-Agnostic)")
     print(f"{'='*80}")
     for task_name in sorted(tasks):
         print(f"\n{'='*80}")
         print(f"{task_name.upper()} - Overall Evaluation (All Datasets Combined)")
         print(f"{'='*80}")
+        cached = all_task_results.get(task_name, {})
+        if not cached:
+            print(f"No results found for {task_name}")
             continue
         try:
             if task_name == "tal":
+                per_dataset = cached.get('per_dataset', {})
+                if per_dataset:
+                    # Pool all per-sample meanIoU across datasets and FPS groups
+                    all_miou_03 = []
+                    all_miou_05 = []
+                    for ds_name, fps_results in per_dataset.items():
+                        for fps_key, metrics in fps_results.items():
+                            if isinstance(metrics, dict) and 'meanIoU@0.3' in metrics:
+                                count = metrics.get('count', 1)
+                                all_miou_03.extend([metrics['meanIoU@0.3']] * count)
+                                all_miou_05.extend([metrics['meanIoU@0.5']] * count)
+                    print(f"\n  mIoU@0.3: {np.mean(all_miou_03):.4f}" if all_miou_03 else "\n  mIoU@0.3: 0.0000")
+                    print(f"  mIoU@0.5: {np.mean(all_miou_05):.4f}" if all_miou_05 else "  mIoU@0.5: 0.0000")
+                else:
+                    print(f"  mIoU@0.3: {cached.get('meanIoU@0.3', 0.0):.4f}")
+                    print(f"  mIoU@0.5: {cached.get('meanIoU@0.5', 0.0):.4f}")
             elif task_name == "stg":
+                per_dataset = cached.get('per_dataset', {})
+                if per_dataset:
+                    # Pool all per-sample IoUs across datasets
                     all_ious = []
+                    for ds_name, fps_results in per_dataset.items():
+                        if 'overall' in fps_results:
+                            count = fps_results['overall'].get('valid_records', 1)
+                            miou = fps_results['overall'].get('mean_iou', 0.0)
+                            all_ious.extend([miou] * count)
+                        else:
+                            for fps_key, metrics in fps_results.items():
+                                if isinstance(metrics, dict) and 'mIoU' in metrics:
+                                    count = metrics.get('count', 1)
+                                    all_ious.extend([metrics['mIoU']] * count)
+                    print(f"\nmean_iou: {np.mean(all_ious):.4f}" if all_ious else "\nmean_iou: 0.0000")
+                else:
+                    print(f"\nmean_iou: {cached.get('mean_iou', 0.0):.4f}")
             elif task_name in ["rc", "vs"]:
+                # LLM judge — use cached results directly (already pooled)
+                if 'score' in cached:
+                    print(f"Method: {cached['method']}")
+                    print(f"Score: {cached['score']:.4f} ({cached['scale']} scale)")
+                    if 'aspect_scores' in cached:
                         print("Aspect Scores:")
+                        for aspect, score in sorted(cached['aspect_scores'].items()):
                             print(f"  {aspect}: {score:.3f}")
+                else:
+                    print(f"No LLM judge results available")
             elif task_name == "next_action":
+                per_dataset = cached.get('per_dataset', {})
+                if per_dataset:
+                    # Pool per-sample correct/total across datasets
+                    total_correct = 0
+                    total_samples = 0
+                    for ds_name, fps_results in per_dataset.items():
+                        if 'overall' in fps_results:
+                            acc = fps_results['overall'].get('accuracy', 0.0)
+                            count = fps_results['overall'].get('count', 0)
+                            total_correct += round(acc * count)
+                            total_samples += count
+                    if total_samples > 0:
+                        print(f"\n  accuracy: {total_correct / total_samples:.4f}")
+                    else:
+                        print(f"\n  accuracy: 0.0000")
                 else:
+                    print(f"\n  accuracy: {cached.get('accuracy', 0.0):.4f}")
             elif task_name == "dvc":
+                per_dataset = cached.get('per_dataset', {})
                 print(f"\nDense Video Captioning Metrics:")
+                if per_dataset:
+                    # Pool caption_score and temporal_f1 weighted by sample count
+                    total_caption = 0.0
+                    total_f1 = 0.0
+                    total_count = 0
+                    for ds_name, ds_results in per_dataset.items():
+                        if ds_results and 'overall' in ds_results:
+                            overall = ds_results['overall']
+                            count = overall.get('count', 0)
+                            total_caption += overall.get('caption_score', 0.0) * count
+                            total_f1 += overall.get('temporal_f1', 0.0) * count
+                            total_count += count
+                    if total_count > 0:
+                        print(f"  caption_score: {total_caption / total_count:.4f}")
+                        print(f"  temporal_f1: {total_f1 / total_count:.4f}")
+                else:
+                    for metric_name in ['caption_score', 'temporal_f1']:
+                        if metric_name in cached and isinstance(cached[metric_name], (int, float)):
+                            print(f"  {metric_name}: {cached[metric_name]:.4f}")
             elif task_name == "cvs_assessment":
+                per_dataset = cached.get('per_dataset', {})
+                if per_dataset:
+                    print(f"\n  component_balanced_accuracy: {cached.get('component_balanced_accuracy', 0.0):.4f}")
                 else:
+                    print(f"\n  component_balanced_accuracy: {cached.get('component_balanced_accuracy', 0.0):.4f}")
             elif task_name == "skill_assessment":
+                per_dataset = cached.get('per_dataset', {})
+                if per_dataset:
+                    print(f"\n  aspect_balanced_accuracy: {cached.get('aspect_balanced_accuracy', 0.0):.4f}")
                 else:
+                    print(f"\n  aspect_balanced_accuracy: {cached.get('aspect_balanced_accuracy', 0.0):.4f}")
             else:
                 print(f"Overall evaluation not implemented for {task_name} yet")
         except Exception as e:
+            print(f"Error printing overall evaluation for {task_name}: {e}")
             import traceback
             traceback.print_exc()

evaluation/evaluate_predictions.py CHANGED Viewed

@@ -174,7 +174,12 @@ def _parse_metrics_from_output(output):
         line = line.strip()
         # Detect task sections
-        if "TAL" in line and "Overall" in line:
             current_task = "tal"
         elif "STG" in line and "Overall" in line:
             current_task = "stg"
@@ -186,10 +191,6 @@ def _parse_metrics_from_output(output):
             current_task = "rc"
         elif ("VS" in line and "Overall" in line) or "Video Summary" in line:
             current_task = "vs"
-        elif ("SKILL" in line and "Overall" in line) or "Skill Assessment" in line:
-            current_task = "skill_assessment"
-        elif ("CVS" in line and "Overall" in line) or "CVS Assessment" in line:
-            current_task = "cvs_assessment"
         if current_task == "tal":
             if "IoU_0.3:" in line:
@@ -226,10 +227,12 @@ def _parse_metrics_from_output(output):
                     metrics["dvc_f1"] = float(line.split(":")[-1].strip())
             elif current_task == "vs" and ("score" in line.lower() or "average" in line.lower()):
-                metrics["vs_llm"] = float(line.split(":")[-1].strip())
             elif current_task == "rc" and ("score" in line.lower() or "average" in line.lower()):
-                metrics["rc_llm"] = float(line.split(":")[-1].strip())
             elif current_task == "skill_assessment" and "aspect_balanced_accuracy" in line.lower():
                 metrics["sa_acc"] = float(line.split(":")[1].split("(")[0].strip())

         line = line.strip()
         # Detect task sections
+        # NOTE: Order matters — check CVS before VS (since "CVS" contains "VS")
+        if ("CVS" in line and "Overall" in line) or "CVS Assessment" in line:
+            current_task = "cvs_assessment"
+        elif ("SKILL" in line and "Overall" in line) or "Skill Assessment" in line:
+            current_task = "skill_assessment"
+        elif "TAL" in line and "Overall" in line:
             current_task = "tal"
         elif "STG" in line and "Overall" in line:
             current_task = "stg"
             current_task = "rc"
         elif ("VS" in line and "Overall" in line) or "Video Summary" in line:
             current_task = "vs"
         if current_task == "tal":
             if "IoU_0.3:" in line:
                     metrics["dvc_f1"] = float(line.split(":")[-1].strip())
             elif current_task == "vs" and ("score" in line.lower() or "average" in line.lower()):
+                val_str = line.split(":")[-1].strip().split("(")[0].strip()
+                metrics["vs_llm"] = float(val_str)
             elif current_task == "rc" and ("score" in line.lower() or "average" in line.lower()):
+                val_str = line.split(":")[-1].strip().split("(")[0].strip()
+                metrics["rc_llm"] = float(val_str)
             elif current_task == "skill_assessment" and "aspect_balanced_accuracy" in line.lower():
                 metrics["sa_acc"] = float(line.split(":")[1].split("(")[0].strip())