Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

MedGRPO Team Claude Sonnet 4.5 commited on 12 days ago

Commit

8f33d8f

1 Parent(s): c137994

Add comprehensive setup and deployment documentation

Documents complete leaderboard implementation including:
- Integration architecture with evaluate_all_pai.py
- Validation logic and metric extraction
- Score normalization methodology
- Data persistence format
- Deployment configuration
- Testing checklist
- Known limitations and recommendations
- Quick reference guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show

SETUP_SUMMARY.md +299 -0

SETUP_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,299 @@

+# MedGRPO Leaderboard Setup Summary
+**Date**: January 7, 2026
+**Status**: ✅ DEPLOYED
+**URL**: https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard
+## What Was Built
+A fully functional, automated leaderboard for the MedGRPO benchmark that:
+1. **Accepts submissions** via web interface (Gradio)
+2. **Validates** results files automatically
+3. **Runs evaluations** using your existing `evaluate_all_pai.py` script
+4. **Extracts metrics** for all 8 tasks
+5. **Ranks models** by average performance
+6. **Persists data** in JSON format
+## Key Features
+### ✅ Complete Integration
+- **Direct pipeline integration**: Calls `/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py`
+- **Automatic metric extraction**: Parses stdout to get mAP@0.5, mIoU, Accuracy, LLM Judge scores
+- **All 8 tasks supported**: TAL, STG, Next Action, DVC, VS, RC, Skill Assessment, CVS Assessment
+- **Dataset-agnostic evaluation**: Uses `--grouping overall` for fair comparison
+### 🎯 User Experience
+**4-Tab Interface**:
+1. **Leaderboard** - Current rankings with all metrics
+2. **Submit Results** - Upload results JSON with validation
+3. **Tasks & Metrics** - Detailed task descriptions and metric explanations
+4. **About** - Paper info, citation, links
+**Submission Flow**:
+1. User uploads results JSON (6,245 samples)
+2. System validates format and sample count
+3. Runs automatic evaluation (~5-10 min)
+4. Extracts metrics from evaluation output
+5. Adds to leaderboard with rank
+### 📊 Metrics & Scoring
+**Task Metrics**:
+- **TAL**: mAP@0.5 (temporal action localization)
+- **STG**: mIoU (spatiotemporal grounding)
+- **Next Action**: Accuracy
+- **DVC/VS/RC**: LLM Judge average (R2, R4, R5, R7, R8)
+- **Skill/CVS**: Accuracy
+**Score Normalization**:
+- LLM Judge scores (1-5) → normalized to [0,1]: `(score - 1) / 4`
+- Other metrics (already 0-1) → unchanged
+- **Average** = mean of 8 normalized scores
+### 🔧 Technical Details
+**File Structure**:
+```
+MedGRPO-Leaderboard/
+├── app.py                  # Main Gradio app (647 lines)
+├── leaderboard.json        # Persistent leaderboard data
+├── submissions/            # Uploaded results files
+├── results/                # Per-model evaluation outputs
+│   └── {model_name}/
+│       ├── results.json    # Copy of uploaded file
+│       └── eval_output.txt # Full evaluation stdout
+└── README.md               # Comprehensive documentation
+```
+**Dependencies**:
+- `gradio==5.50.0` - Web interface
+- `pandas` - Data manipulation
+- `subprocess` - Evaluation pipeline calls
+- Access to `/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py`
+## Validation Logic
+**Results File Checks**:
+1. ✅ Valid JSON format
+2. ✅ List or dict structure
+3. ✅ Required fields: `question`, `response`, `qa_type`
+4. ✅ Valid `qa_type` values (tal, stg, next_action, dense_captioning, video_summary, region_caption, skill_assessment, cvs_assessment)
+5. ✅ Sample count: ≥5000 (full test set has 6,245)
+**Evaluation Process**:
+```bash
+python3 /root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py \
+    {results_file} \
+    --grouping overall
+```
+- Timeout: 600 seconds (10 minutes)
+- Output: Captured stdout for metric parsing
+- Storage: Saved to `results/{model_name}/eval_output.txt`
+## Metric Parsing
+The system parses evaluation stdout to extract metrics:
+| Task | Search Pattern | Example Output |
+|------|---------------|----------------|
+| TAL | `"mAP@0.5:"` | `mAP@0.5: 0.4567` |
+| STG | `"mean_iou:"` | `mean_iou: 0.7234` |
+| Next Action | `"Weighted Average Accuracy"` | `Weighted Average Accuracy (across 1234 samples): 0.6789` |
+| DVC/VS/RC | `"Average"` or `"Mean"` | `Average: 3.45` (LLM judge 1-5) |
+| Skill/CVS | `"accuracy:"` | `accuracy: 0.8123` |
+**Parser Logic**:
+- Detects task sections by headers (e.g., "TAL - Overall Evaluation")
+- Extracts metrics within each section
+- Validates value ranges (e.g., LLM judge must be 1-5)
+- Returns dict: `{"tal": 0.4567, "stg": 0.7234, ...}`
+## Data Persistence
+**Leaderboard Storage** (`leaderboard.json`):
+```json
+[
+  {
+    "rank": 1,
+    "model_name": "Qwen2.5-VL-7B-MedGRPO",
+    "organization": "University Name",
+    "average": 0.7234,
+    "tal": 0.4567,
+    "stg": 0.7234,
+    "next_action": 0.6789,
+    "dvc": 3.45,
+    "vs": 3.67,
+    "rc": 3.89,
+    "skill_assessment": 0.8123,
+    "cvs_assessment": 0.7654,
+    "date": "2026-01-07",
+    "contact": "email@example.com"
+  },
+  ...
+]
+```
+**Features**:
+- Auto-updates ranks on each submission
+- Sorts by average score descending
+- Prevents duplicate model names
+- Preserves submission history
+## Deployment
+**Platform**: HuggingFace Spaces
+**URL**: https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard
+**Framework**: Gradio 5.50.0
+**Server**: `0.0.0.0:7860` with share=True
+**Configuration** (README.md front matter):
+```yaml
+title: MedGRPO Leaderboard
+emoji: 🏥
+sdk: gradio
+app_file: app.py
+sdk_version: 5.50.0
+tags:
+  - leaderboard
+  - medical
+  - video-understanding
+  - surgical-ai
+```
+**Launch Command**:
+```python
+demo.queue(default_concurrency_limit=5).launch(share=True, server_name="0.0.0.0")
+```
+## Git History
+```
+6a33c44 - Update README with comprehensive MedGRPO documentation
+aa988a4 - Integrate MedGRPO evaluation pipeline with leaderboard
+a50fe63 - Fix Gradio schema error and deployment configuration
+27eb692 - initial commit (HF template)
+```
+## Testing Checklist
+Before first real submission, test:
+1. ✅ App launches without errors
+2. ✅ Leaderboard tab displays empty table correctly
+3. ✅ Submit tab shows all form fields
+4. ⏳ Validation rejects invalid files
+5. ⏳ Evaluation runs successfully on valid file
+6. ⏳ Metrics are extracted correctly
+7. ⏳ Leaderboard updates with new entry
+8. ⏳ Rank is computed correctly
+**Suggested test file**: Use one of your existing results:
+- `/root/code/Qwen2.5-VL/my_eval/results/dvc_llm_judge_v4_best5_10k/step45_results.json`
+**Note**: These files may need format conversion if they don't match expected VLLM output format.
+## Known Limitations
+1. **Evaluation path hardcoded**: `/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py`
+   - **Solution for HF Spaces**: Need to copy evaluation scripts to Space or use API endpoint
+2. **No GPU access**: HF Spaces free tier has no GPU
+   - **Impact**: Evaluation may be slow or fail if it requires GPU
+   - **Solution**: Consider compute upgrade or external evaluation service
+3. **Storage limits**: HF Spaces have storage quotas
+   - **Impact**: Limited number of submissions before cleanup needed
+   - **Solution**: Implement periodic cleanup of old result files
+4. **Single-user submissions**: No authentication
+   - **Impact**: Anyone can submit, potential for abuse
+   - **Solution**: Add HF OAuth for authenticated submissions
+## Recommendations
+### For Production Use
+1. **Copy evaluation scripts to Space**:
+   ```bash
+   # In Space repository
+   mkdir -p evaluation/
+   cp /root/code/Qwen2.5-VL/my_eval/*.py evaluation/
+   ```
+   Update path in `app.py`: `EVAL_SCRIPT = Path("evaluation/evaluate_all_pai.py")`
+2. **Add authentication**:
+   - Enable HF OAuth in Space settings
+   - Track submissions by user
+   - Implement rate limiting
+3. **Set up monitoring**:
+   - Log all submissions
+   - Alert on evaluation failures
+   - Track Space resource usage
+4. **Consider persistent storage**:
+   - Use HF Datasets for leaderboard data
+   - Store evaluation outputs in separate dataset
+   - Enable data persistence across Space restarts
+5. **Add admin controls**:
+   - Manual submission approval
+   - Ability to remove/edit entries
+   - Download leaderboard as CSV
+### For Testing
+1. **Create test submission**:
+   ```bash
+   # Use existing results
+   cp /root/code/Qwen2.5-VL/my_eval/results/dvc_llm_judge_v4_best5_10k/step45_results.json /tmp/test_submission.json
+   ```
+2. **Test locally first**:
+   ```bash
+   cd /root/code/MedGRPO-Leaderboard
+   python3 app.py
+   ```
+   Visit: http://localhost:7860
+3. **Verify evaluation integration**:
+   ```bash
+   python3 /root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py \
+       /tmp/test_submission.json \
+       --grouping overall
+   ```
+## Support & Maintenance
+**For issues**:
+1. Check Space logs on HuggingFace
+2. Verify evaluation script is accessible
+3. Check for storage/compute limits
+4. Review validation error messages
+**For updates**:
+1. Edit `app.py` locally
+2. Test changes locally
+3. Commit and push to HF Space
+4. Space auto-rebuilds on push
+**Contact**:
+- GitHub Issues: https://github.com/YuhaoSu/MedGRPO
+- Project Page: https://yuhaosu.github.io/MedGRPO/
+---
+## Quick Reference
+**Space URL**: https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard
+**Git repo**: `/root/code/MedGRPO-Leaderboard`
+**Evaluation script**: `/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py`
+**Test data**: `/root/code/qa_instances/cleaned_test_data_10_18.json` (6,245 samples)
+**Example results**: `/root/code/Qwen2.5-VL/my_eval/results/dvc_llm_judge_v4_best5_10k/`