Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions 18339c0 MedGRPO Team commited on about 14 hours ago
Fix eval_dvc.py main() to support --skip-llm-judge flag 5f41159 MedGRPO Team commited on about 14 hours ago
Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set dd1b9c6 MedGRPO Team Claude Sonnet 4.5 commited on about 14 hours ago
Add semantic similarity matching for Next Action evaluation a66b9a4 MedGRPO Team Claude Sonnet 4.5 commited on about 15 hours ago
Fix TAL overall metrics computation and extraction c8f4cad MedGRPO Team commited on about 15 hours ago
Fix STG evaluation to extract bbox_dict from struc_info 0b29eca MedGRPO Team commited on about 15 hours ago
Add placeholder for STG metrics when evaluation returns empty 6d00cf8 MedGRPO Team commited on about 15 hours ago
Fix STG and next_action metric printing in overall mode 049c07c MedGRPO Team commited on about 15 hours ago
Show evaluation metrics even in silent mode (--grouping overall) 2323031 MedGRPO Team commited on about 15 hours ago
Filter DVC/VS/RC from tasks list when skip-llm-judge is set 2bd924c MedGRPO Team commited on about 16 hours ago
Implement secure ground truth with prediction-only submission format fe743f5 MedGRPO Team commited on 6 days ago
Add support for pre-computed LLM judge scores 31817d3 MedGRPO Team Claude Sonnet 4.5 commited on 7 days ago
Update evaluation metrics and leaderboard display a36b7fe MedGRPO Team Claude Sonnet 4.5 commited on 7 days ago
Consolidate evaluation scripts and remove hardcoded paths 3ea8a3a MedGRPO Team commited on 7 days ago
Optimize evaluation scripts - remove optional files f986294 MedGRPO Team Claude Sonnet 4.5 commited on 7 days ago
Copy evaluation scripts to leaderboard and clean up template code ba8d0d4 MedGRPO Team Claude Sonnet 4.5 commited on 7 days ago