chore: Apply Bug #2 and Bug #3 strict min/max bound clamping to prevent out of range scores and fix windows encoding ee547a6 Running immortalindeed commited on Apr 12
Sync Web UI [END] logs to use the exact formal score= specification format f63920a immortalindeed commited on Apr 12
Fix Phase 2 OpenEnv validation traps: add grader paths to openenv.yaml and safe parameterless defaults 699f953 immortalindeed commited on Apr 11
Spec-compliance overhaul: remove difficulty_multiplier, weighted blend scoring, dep_hard fix, [END] format f3fd4ef immortalindeed commited on Apr 11
Fix dep_hard Counter bug, add fatal error handling, update README with 14-model benchmark 3466d21 immortalindeed commited on Apr 10
Major grading overhaul: difficulty multiplier, tighter scoring, mastery removal, precision penalties 72b3e8d immortalindeed commited on Apr 10
Fix state machine bugs and switch to average scoring for discriminative benchmarking cd5104a immortalindeed commited on Apr 10
Fix score aggregation: use max(rewards) for discriminative multi-turn scoring fe9aa5c immortalindeed commited on Apr 10
Remove rate limiter (blocks evaluator) and fix score aggregation to clamped sum 3dfb5fe immortalindeed commited on Apr 10
fix(benchmark): Hardening multi-agent environment and strict score compliance 6f95f2a immortalindeed commited on Apr 9
Clamp scores strictly to (0.01, 0.99) to pass OpenEnv Phase 2 continuous environment score verification checks 829f543 immortalindeed commited on Apr 8
Ensure /step returns info object perfectly matching OpenEnv spec 09576c0 immortalindeed commited on Apr 8
Fix benchmark output saving: add results dir and print errors 9bb611a immortalindeed commited on Apr 8