fix: clamp all rewards and scores to [0.10, 0.90] d3b224f samrat-rm Claude Sonnet 4.6 commited on 8 days ago
fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line bf98c78 samrat-rm commited on 8 days ago
fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference 3781ce7 samrat-rm commited on 8 days ago
fix: rewards field shows single composite score instead of step CSV f74015b samrat-rm commited on 8 days ago
chore: restrict stdout to START/STEP/END for eval compliance 87b840b samrat-rm commited on 8 days ago
fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules 77f9568 samrat-rm commited on 8 days ago
fix: updating the logs to align with the evaluation standards 2ae2b18 samrat-rm commited on 8 days ago
fix: tighten label rules for underfitting, overfitting, and vanishing gradients 25fff92 samrat-rm commited on 8 days ago
fix: harden label rules to prevent missing_regularization misfires 3eeca00 samrat-rm commited on 8 days ago
feat(inference): add label decision rules and stop rules to system prompt d8e7a25 samrat-rm Claude Sonnet 4.6 commited on 9 days ago
feat: upgrading the system and user prompt, upgrading the _make_env() function dc7aeea samrat-rm commited on 9 days ago
feat: adding [END] log for each episode and error handling for websocket a310ad6 samrat-rm commited on 9 days ago
fix: adding reasoning in episode run loop and refactor commments 775f9bc samrat-rm commited on 9 days ago
feat: implementing judge LLM which contributes to 15% of scoring ae1e803 samrat-rm commited on 9 days ago
refactor: WhyDidItFailAction and WhyDidItFailObservation classes 87037e2 samrat-rm commited on 11 days ago