chore: Apply Bug #2 and Bug #3 strict min/max bound clamping to prevent out of range scores and fix windows encoding ee547a6 Running immortalindeed commited on Apr 12
Final strict spec-compliance polish: score precision, empty rewards, updated test assertions 6284048 immortalindeed commited on Apr 12
Fix syntax of [END] STDOUT line to perfectly match Hackathon mandatory format with score= parameter f96532b immortalindeed commited on Apr 11
Fix: abort [END] lines use rewards=0.01 instead of empty rewards= to prevent evaluator 0.0 score 723407b immortalindeed commited on Apr 11
Spec-compliance overhaul: remove difficulty_multiplier, weighted blend scoring, dep_hard fix, [END] format f3fd4ef immortalindeed commited on Apr 11
Fix dep_hard Counter bug, add fatal error handling, update README with 14-model benchmark 3466d21 immortalindeed commited on Apr 10
Fix state machine bugs and switch to average scoring for discriminative benchmarking cd5104a immortalindeed commited on Apr 10
Fix score aggregation: use max(rewards) for discriminative multi-turn scoring fe9aa5c immortalindeed commited on Apr 10
Remove rate limiter (blocks evaluator) and fix score aggregation to clamped sum 3dfb5fe immortalindeed commited on Apr 10
fix(benchmark): Hardening multi-agent environment and strict score compliance 6f95f2a immortalindeed commited on Apr 9
Clamp scores strictly to (0.01, 0.99) to pass OpenEnv Phase 2 continuous environment score verification checks 829f543 immortalindeed commited on Apr 8
Revert incorrect log parsing changes and fix reward summation logic d270d2a immortalindeed commited on Apr 8
Fix benchmark output saving: add results dir and print errors 9bb611a immortalindeed commited on Apr 8