Spaces:

samrat-rm
/

WhyDidItFail

Sleeping

App Files Files Community

WhyDidItFail / inference.py

Commit History

chore: removing logs

1dce05c

samrat-rm commited on 8 days ago

fix: remove exception

43c1c2a

samrat-rm commited on 8 days ago

feat: rewards upgrade

61e83f1

samrat-rm commited on 8 days ago

feat: updating prompt and reward consdition

f58e721

samrat-rm commited on 8 days ago

chore: updating the [END] log

87f9568

samrat-rm commited on 8 days ago

feat: LLM agent model change

3ef6b97

samrat-rm commited on 8 days ago

fix: clamp all rewards and scores to [0.10, 0.90]

d3b224f

samrat-rm Claude Sonnet 4.6 commited on 8 days ago

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line

bf98c78

samrat-rm commited on 8 days ago

fix: reduced the number of scenario in inference

196955c

samrat-rm commited on 8 days ago

fix: logs in inference

c348367

samrat-rm commited on 8 days ago

chore: doc string update and remove unused import

0252dc5

samrat-rm commited on 8 days ago

fic: score condition

d933934

samrat-rm commited on 8 days ago

fix: score format and range

a583a04

samrat-rm commited on 8 days ago

fix: score range

26f0b41

samrat-rm commited on 8 days ago

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference

3781ce7

samrat-rm commited on 8 days ago

fix: reward scores are updated to be between 0 and 1

c130122

samrat-rm commited on 8 days ago

chore: code cleanup

2014a9f

samrat-rm commited on 8 days ago

chore: logs format update

e7b5e0d

samrat-rm commited on 8 days ago

chore: updating logs

faf4fb8

samrat-rm commited on 8 days ago

fix: rewards field shows single composite score instead of step CSV

f74015b

samrat-rm commited on 8 days ago

chore: 2 sceanrios per difficulty - inference

a884ccb

samrat-rm commited on 8 days ago

chore: only one scenario is run per difficulty

d6bb519

samrat-rm commited on 8 days ago

chore: restrict stdout to START/STEP/END for eval compliance

87b840b

samrat-rm commited on 8 days ago

chore: clean up all the unnecessary comments

afa4b9d

samrat-rm commited on 8 days ago

chore: updating logs

051a1af

samrat-rm commited on 8 days ago

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules

77f9568

samrat-rm commited on 8 days ago

fix: error handling for episode run loop

aa1c27d

samrat-rm commited on 8 days ago

fix: updating the logs to align with the evaluation standards

2ae2b18

samrat-rm commited on 8 days ago

fix: error handling for HF_TOKEN env

5ca33a3

samrat-rm commited on 8 days ago

fix: tighten label rules for underfitting, overfitting, and vanishing gradients

25fff92

samrat-rm commited on 8 days ago

feat: add judge fallback

53f3a58

samrat-rm commited on 8 days ago

refactor: moving llm judge inside server dir

149177d

samrat-rm commited on 8 days ago

fix: harden label rules to prevent missing_regularization misfires

3eeca00

samrat-rm commited on 8 days ago

feat(inference): add label decision rules and stop rules to system prompt

d8e7a25

samrat-rm Claude Sonnet 4.6 commited on 9 days ago

feat: add local agent support

e8c7211

samrat-rm commited on 9 days ago

feat: upgrading the system and user prompt, upgrading the _make_env() function

dc7aeea

samrat-rm commited on 9 days ago

feat: adding [END] log for each episode and error handling for websocket

a310ad6

samrat-rm commited on 9 days ago

chore: docstring update

7cc0ee9

samrat-rm commited on 9 days ago

fix: adding reasoning in episode run loop and refactor commments

775f9bc

samrat-rm commited on 9 days ago

feat: implementing judge LLM which contributes to 15% of scoring

ae1e803

samrat-rm commited on 9 days ago

feat: 3 modes of difficulty and updating the logs

66d62a2

samrat-rm commited on 9 days ago

feat: init inference

b3d65f0

samrat-rm commited on 10 days ago

refactor: WhyDidItFailAction and WhyDidItFailObservation classes

87037e2

samrat-rm commited on 11 days ago

Initial commit

b37875f

samrat-rm commited on 11 days ago

Commit History

chore: removing logs 1dce05c

fix: remove exception 43c1c2a

feat: rewards upgrade 61e83f1

feat: updating prompt and reward consdition f58e721

chore: updating the [END] log 87f9568

feat: LLM agent model change 3ef6b97

fix: clamp all rewards and scores to [0.10, 0.90] d3b224f

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line bf98c78

fix: reduced the number of scenario in inference 196955c

fix: logs in inference c348367

chore: doc string update and remove unused import 0252dc5

fic: score condition d933934

fix: score format and range a583a04

fix: score range 26f0b41

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference 3781ce7

fix: reward scores are updated to be between 0 and 1 c130122

chore: code cleanup 2014a9f

chore: logs format update e7b5e0d

chore: updating logs faf4fb8

fix: rewards field shows single composite score instead of step CSV f74015b

chore: 2 sceanrios per difficulty - inference a884ccb

chore: only one scenario is run per difficulty d6bb519

chore: restrict stdout to START/STEP/END for eval compliance 87b840b

chore: clean up all the unnecessary comments afa4b9d

chore: updating logs 051a1af

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules 77f9568

fix: error handling for episode run loop aa1c27d

fix: updating the logs to align with the evaluation standards 2ae2b18

fix: error handling for HF_TOKEN env 5ca33a3

fix: tighten label rules for underfitting, overfitting, and vanishing gradients 25fff92

feat: add judge fallback 53f3a58

refactor: moving llm judge inside server dir 149177d

fix: harden label rules to prevent missing_regularization misfires 3eeca00

feat(inference): add label decision rules and stop rules to system prompt d8e7a25

feat: add local agent support e8c7211

feat: upgrading the system and user prompt, upgrading the _make_env() function dc7aeea

feat: adding [END] log for each episode and error handling for websocket a310ad6

chore: docstring update 7cc0ee9

fix: adding reasoning in episode run loop and refactor commments 775f9bc

feat: implementing judge LLM which contributes to 15% of scoring ae1e803

feat: 3 modes of difficulty and updating the logs 66d62a2

feat: init inference b3d65f0

refactor: WhyDidItFailAction and WhyDidItFailObservation classes 87037e2

Initial commit b37875f

chore: removing logs

1dce05c

fix: remove exception

43c1c2a

feat: rewards upgrade

61e83f1

feat: updating prompt and reward consdition

f58e721

chore: updating the [END] log

87f9568

feat: LLM agent model change

3ef6b97

fix: clamp all rewards and scores to [0.10, 0.90]

d3b224f

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line

bf98c78

fix: reduced the number of scenario in inference

196955c

fix: logs in inference

c348367

chore: doc string update and remove unused import

0252dc5

fic: score condition

d933934

fix: score format and range

a583a04

fix: score range

26f0b41

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference

3781ce7

fix: reward scores are updated to be between 0 and 1

c130122

chore: code cleanup

2014a9f

chore: logs format update

e7b5e0d

chore: updating logs

faf4fb8

fix: rewards field shows single composite score instead of step CSV

f74015b

chore: 2 sceanrios per difficulty - inference

a884ccb

chore: only one scenario is run per difficulty

d6bb519

chore: restrict stdout to START/STEP/END for eval compliance

87b840b

chore: clean up all the unnecessary comments

afa4b9d

chore: updating logs

051a1af

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules

77f9568

fix: error handling for episode run loop

aa1c27d

fix: updating the logs to align with the evaluation standards

2ae2b18

fix: error handling for HF_TOKEN env

5ca33a3

fix: tighten label rules for underfitting, overfitting, and vanishing gradients

25fff92

feat: add judge fallback

53f3a58

refactor: moving llm judge inside server dir

149177d

fix: harden label rules to prevent missing_regularization misfires

3eeca00

feat(inference): add label decision rules and stop rules to system prompt

d8e7a25

feat: add local agent support

e8c7211

feat: upgrading the system and user prompt, upgrading the _make_env() function

dc7aeea

feat: adding [END] log for each episode and error handling for websocket

a310ad6

chore: docstring update

7cc0ee9

fix: adding reasoning in episode run loop and refactor commments

775f9bc

feat: implementing judge LLM which contributes to 15% of scoring

ae1e803

feat: 3 modes of difficulty and updating the logs

66d62a2

feat: init inference

b3d65f0

refactor: WhyDidItFailAction and WhyDidItFailObservation classes

87037e2

Initial commit

b37875f