WhyDidItFail / inference.py

Commit History

chore: removing logs
1dce05c

samrat-rm commited on

fix: remove exception
43c1c2a

samrat-rm commited on

feat: rewards upgrade
61e83f1

samrat-rm commited on

feat: updating prompt and reward consdition
f58e721

samrat-rm commited on

chore: updating the [END] log
87f9568

samrat-rm commited on

feat: LLM agent model change
3ef6b97

samrat-rm commited on

fix: clamp all rewards and scores to [0.10, 0.90]
d3b224f

samrat-rm Claude Sonnet 4.6 commited on

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line
bf98c78

samrat-rm commited on

fix: reduced the number of scenario in inference
196955c

samrat-rm commited on

fix: logs in inference
c348367

samrat-rm commited on

chore: doc string update and remove unused import
0252dc5

samrat-rm commited on

fic: score condition
d933934

samrat-rm commited on

fix: score format and range
a583a04

samrat-rm commited on

fix: score range
26f0b41

samrat-rm commited on

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference
3781ce7

samrat-rm commited on

fix: reward scores are updated to be between 0 and 1
c130122

samrat-rm commited on

chore: code cleanup
2014a9f

samrat-rm commited on

chore: logs format update
e7b5e0d

samrat-rm commited on

chore: updating logs
faf4fb8

samrat-rm commited on

fix: rewards field shows single composite score instead of step CSV
f74015b

samrat-rm commited on

chore: 2 sceanrios per difficulty - inference
a884ccb

samrat-rm commited on

chore: only one scenario is run per difficulty
d6bb519

samrat-rm commited on

chore: restrict stdout to START/STEP/END for eval compliance
87b840b

samrat-rm commited on

chore: clean up all the unnecessary comments
afa4b9d

samrat-rm commited on

chore: updating logs
051a1af

samrat-rm commited on

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules
77f9568

samrat-rm commited on

fix: error handling for episode run loop
aa1c27d

samrat-rm commited on

fix: updating the logs to align with the evaluation standards
2ae2b18

samrat-rm commited on

fix: error handling for HF_TOKEN env
5ca33a3

samrat-rm commited on

fix: tighten label rules for underfitting, overfitting, and vanishing gradients
25fff92

samrat-rm commited on

feat: add judge fallback
53f3a58

samrat-rm commited on

refactor: moving llm judge inside server dir
149177d

samrat-rm commited on

fix: harden label rules to prevent missing_regularization misfires
3eeca00

samrat-rm commited on

feat(inference): add label decision rules and stop rules to system prompt
d8e7a25

samrat-rm Claude Sonnet 4.6 commited on

feat: add local agent support
e8c7211

samrat-rm commited on

feat: upgrading the system and user prompt, upgrading the _make_env() function
dc7aeea

samrat-rm commited on

feat: adding [END] log for each episode and error handling for websocket
a310ad6

samrat-rm commited on

chore: docstring update
7cc0ee9

samrat-rm commited on

fix: adding reasoning in episode run loop and refactor commments
775f9bc

samrat-rm commited on

feat: implementing judge LLM which contributes to 15% of scoring
ae1e803

samrat-rm commited on

feat: 3 modes of difficulty and updating the logs
66d62a2

samrat-rm commited on

feat: init inference
b3d65f0

samrat-rm commited on

refactor: WhyDidItFailAction and WhyDidItFailObservation classes
87037e2

samrat-rm commited on

Initial commit
b37875f

samrat-rm commited on