Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions 18339c0 MedGRPO Team commited on Jan 13
Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set dd1b9c6 MedGRPO Team Claude Sonnet 4.5 commited on Jan 13
Add semantic similarity matching for Next Action evaluation a66b9a4 MedGRPO Team Claude Sonnet 4.5 commited on Jan 13
Add placeholder for STG metrics when evaluation returns empty 6d00cf8 MedGRPO Team commited on Jan 13
Show evaluation metrics even in silent mode (--grouping overall) 2323031 MedGRPO Team commited on Jan 13
Implement secure ground truth with prediction-only submission format fe743f5 MedGRPO Team commited on Jan 8
Add support for pre-computed LLM judge scores 31817d3 MedGRPO Team Claude Sonnet 4.5 commited on Jan 7
Update evaluation metrics and leaderboard display a36b7fe MedGRPO Team Claude Sonnet 4.5 commited on Jan 7
Optimize evaluation scripts - remove optional files f986294 MedGRPO Team Claude Sonnet 4.5 commited on Jan 7
Copy evaluation scripts to leaderboard and clean up template code ba8d0d4 MedGRPO Team Claude Sonnet 4.5 commited on Jan 7