Spaces:

Jayant2304
/

commitment-os

Sleeping

jayantaggarwal-sketch

Sync latest project updates to Hugging Face Space.

d53a65c 28 days ago

1.08 kB

Improvement Evaluation Artifacts

This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.

This is not the same as the real LLM checkpoint comparison; see root README section B) True LLM Learning Eval and artifacts/evals_llm/.

eval_protocol.json: fixed protocol (task set, seed, max steps, decode config)
baseline_eval.json: per-task baseline rollouts
trained_eval.json: per-task improved/trained-style rollouts (same protocol)
improved_eval.json: alias of trained outputs for backward compatibility
comparison.csv: task-by-task delta table
summary.json: aggregate metrics (mean/median deltas, difficulty splits, steps, success)
case_study_hard_011.md: concise before/after narrative for one hard scenario
reward_by_task.svg: visual comparison of final reward by task
violations_before_after.svg: visual comparison of commitment violations

cd commitment_os
python3 evaluation/evaluate_improvement.py
python3 evaluation/plot_improvement.py