Spaces:

samrat-rm
/

WhyDidItFail

Sleeping

App Files Files Community

WhyDidItFail

Commit History

chore: removing logs

1dce05c

samrat-rm commited on 8 days ago

fix: remove exception

43c1c2a

samrat-rm commited on 8 days ago

feat: rewards upgrade

61e83f1

samrat-rm commited on 8 days ago

fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range

89b370c

samrat-rm commited on 8 days ago

feat: updating prompt and reward consdition

f58e721

samrat-rm commited on 8 days ago

chore: updating the [END] log

87f9568

samrat-rm commited on 8 days ago

feat: LLM agent model change

3ef6b97

samrat-rm commited on 8 days ago

chore: update doc string

6b279f6

samrat-rm commited on 8 days ago

fix: clamp all rewards and scores to [0.10, 0.90]

d3b224f

samrat-rm Claude Sonnet 4.6 commited on 8 days ago

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line

bf98c78

samrat-rm commited on 8 days ago

fix: harden grader prompts to prevent out-of-range scores

7a56cd1

samrat-rm commited on 8 days ago

fix: reduced the number of scenario in inference

196955c

samrat-rm commited on 8 days ago

fix: logs in inference

c348367

samrat-rm commited on 8 days ago

chore: doc string update and remove unused import

0252dc5

samrat-rm commited on 8 days ago

fic: score condition

d933934

samrat-rm commited on 8 days ago

fix: score format and range

a583a04

samrat-rm commited on 8 days ago

fix: score range

26f0b41

samrat-rm commited on 8 days ago

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference

3781ce7

samrat-rm commited on 8 days ago

fix: reward scores are updated to be between 0 and 1

c130122

samrat-rm commited on 8 days ago

chore: code cleanup

2014a9f

samrat-rm commited on 8 days ago

chore: logs format update

e7b5e0d

samrat-rm commited on 8 days ago

chore: updating logs

faf4fb8

samrat-rm commited on 8 days ago

fix: rewards field shows single composite score instead of step CSV

f74015b

samrat-rm commited on 8 days ago

chore: 2 sceanrios per difficulty - inference

a884ccb

samrat-rm commited on 8 days ago

chore: only one scenario is run per difficulty

d6bb519

samrat-rm commited on 8 days ago

chore: restrict stdout to START/STEP/END for eval compliance

87b840b

samrat-rm commited on 8 days ago

fix: cast fallback action_type to Literal for Pylance compliance and remove image from repo root

26630c7

samrat-rm commited on 8 days ago

docs: updating readme with state changes and test

f0681d9

samrat-rm commited on 8 days ago

feat: implement WhyDidItFailState for full OpenEnv state compliance

ff8ce5f

samrat-rm commited on 8 days ago

docs: Update README.md

15f091b

samrat-rm commited on 8 days ago

docs: updating the readme with AI usage disclosure section

608b10a

samrat-rm commited on 8 days ago

docs: adding detailed docs for agent_prompt , grade and scenarios

6338fc0

samrat-rm commited on 8 days ago

docs: expand setup section, fix factual errors, add features summary

e6bf1cd

samrat-rm commited on 8 days ago

feat: updating the readme

bece8d8

samrat-rm commited on 8 days ago

fix: minor refactor of env class name

8e889bd

samrat-rm commited on 8 days ago

chore: clean up all the unnecessary comments

afa4b9d

samrat-rm commited on 8 days ago

Merge pull request #2 from samrat-rm/feat/llm-judge

696784a

samrat-rm commited on 8 days ago

chore: updating logs

051a1af

samrat-rm commited on 8 days ago

feat: openEnv playground UI basic implementation

f7c4516

samrat-rm commited on 8 days ago

feat: updated the yaml file with tasks for evaluation

c6913d5

samrat-rm commited on 8 days ago

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules

77f9568

samrat-rm commited on 8 days ago

fix: error handling for episode run loop

aa1c27d

samrat-rm commited on 8 days ago

fix: updating the logs to align with the evaluation standards

2ae2b18

samrat-rm commited on 8 days ago

fix: error handling for HF_TOKEN env

5ca33a3

samrat-rm commited on 8 days ago

feat: update the readme.md

8f1e681

samrat-rm commited on 8 days ago

fix: tighten label rules for underfitting, overfitting, and vanishing gradients

25fff92

samrat-rm commited on 8 days ago

fix: normalize underfitting gradient norms and guard vague-answer penalty

909dfde

samrat-rm commited on 8 days ago

feat: add judge fallback

53f3a58

samrat-rm commited on 8 days ago

refactor: moving llm judge inside server dir

149177d

samrat-rm commited on 8 days ago

feat: add playground static file serving

aac6b30

samrat-rm commited on 8 days ago

Commit History

chore: removing logs 1dce05c

fix: remove exception 43c1c2a

feat: rewards upgrade 61e83f1

fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range 89b370c

feat: updating prompt and reward consdition f58e721

chore: updating the [END] log 87f9568

feat: LLM agent model change 3ef6b97

chore: update doc string 6b279f6

fix: clamp all rewards and scores to [0.10, 0.90] d3b224f

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line bf98c78

fix: harden grader prompts to prevent out-of-range scores 7a56cd1

fix: reduced the number of scenario in inference 196955c

fix: logs in inference c348367

chore: doc string update and remove unused import 0252dc5

fic: score condition d933934

fix: score format and range a583a04

fix: score range 26f0b41

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference 3781ce7

fix: reward scores are updated to be between 0 and 1 c130122

chore: code cleanup 2014a9f

chore: logs format update e7b5e0d

chore: updating logs faf4fb8

fix: rewards field shows single composite score instead of step CSV f74015b

chore: 2 sceanrios per difficulty - inference a884ccb

chore: only one scenario is run per difficulty d6bb519

chore: restrict stdout to START/STEP/END for eval compliance 87b840b

fix: cast fallback action_type to Literal for Pylance compliance and remove image from repo root 26630c7

docs: updating readme with state changes and test f0681d9

feat: implement WhyDidItFailState for full OpenEnv state compliance ff8ce5f

docs: Update README.md 15f091b

docs: updating the readme with AI usage disclosure section 608b10a

docs: adding detailed docs for agent_prompt , grade and scenarios 6338fc0

docs: expand setup section, fix factual errors, add features summary e6bf1cd

feat: updating the readme bece8d8

fix: minor refactor of env class name 8e889bd

chore: clean up all the unnecessary comments afa4b9d

Merge pull request #2 from samrat-rm/feat/llm-judge 696784a

chore: updating logs 051a1af

feat: openEnv playground UI basic implementation f7c4516

feat: updated the yaml file with tasks for evaluation c6913d5

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules 77f9568

fix: error handling for episode run loop aa1c27d

fix: updating the logs to align with the evaluation standards 2ae2b18

fix: error handling for HF_TOKEN env 5ca33a3

feat: update the readme.md 8f1e681

fix: tighten label rules for underfitting, overfitting, and vanishing gradients 25fff92

fix: normalize underfitting gradient norms and guard vague-answer penalty 909dfde

feat: add judge fallback 53f3a58

refactor: moving llm judge inside server dir 149177d

feat: add playground static file serving aac6b30

chore: removing logs

1dce05c

fix: remove exception

43c1c2a

feat: rewards upgrade

61e83f1

fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range

89b370c

feat: updating prompt and reward consdition

f58e721

chore: updating the [END] log

87f9568

feat: LLM agent model change

3ef6b97

chore: update doc string

6b279f6

fix: clamp all rewards and scores to [0.10, 0.90]

d3b224f

fix: clamp all score paths to (0.01, 0.99), fix reward field name, add per-task score line

bf98c78

fix: harden grader prompts to prevent out-of-range scores

7a56cd1

fix: reduced the number of scenario in inference

196955c

fix: logs in inference

c348367

chore: doc string update and remove unused import

0252dc5

fic: score condition

d933934

fix: score format and range

a583a04

fix: score range

26f0b41

fix: enforce reward bounds (0.01–0.99) and 2 decimal precision across grader, env, and inference

3781ce7

fix: reward scores are updated to be between 0 and 1

c130122

chore: code cleanup

2014a9f

chore: logs format update

e7b5e0d

chore: updating logs

faf4fb8

fix: rewards field shows single composite score instead of step CSV

f74015b

chore: 2 sceanrios per difficulty - inference

a884ccb

chore: only one scenario is run per difficulty

d6bb519

chore: restrict stdout to START/STEP/END for eval compliance

87b840b

fix: cast fallback action_type to Literal for Pylance compliance and remove image from repo root

26630c7

docs: updating readme with state changes and test

f0681d9

feat: implement WhyDidItFailState for full OpenEnv state compliance

ff8ce5f

docs: Update README.md

15f091b

docs: updating the readme with AI usage disclosure section

608b10a

docs: adding detailed docs for agent_prompt , grade and scenarios

6338fc0

docs: expand setup section, fix factual errors, add features summary

e6bf1cd

feat: updating the readme

bece8d8

fix: minor refactor of env class name

8e889bd

chore: clean up all the unnecessary comments

afa4b9d

Merge pull request #2 from samrat-rm/feat/llm-judge

696784a

chore: updating logs

051a1af

feat: openEnv playground UI basic implementation

f7c4516

feat: updated the yaml file with tasks for evaluation

c6913d5

fix: comply with openenv stdout spec, preserve inspection data in history, sharpen medium-tier label rules

77f9568

fix: error handling for episode run loop

aa1c27d

fix: updating the logs to align with the evaluation standards

2ae2b18

fix: error handling for HF_TOKEN env

5ca33a3

feat: update the readme.md

8f1e681

fix: tighten label rules for underfitting, overfitting, and vanishing gradients

25fff92

fix: normalize underfitting gradient norms and guard vague-answer penalty

909dfde

feat: add judge fallback

53f3a58

refactor: moving llm judge inside server dir

149177d

feat: add playground static file serving

aac6b30