jayantaggarwal-sketch
Sync latest project updates to Hugging Face Space.
d53a65c

Improvement Evaluation Artifacts

This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.

This is not the same as the real LLM checkpoint comparison; see root README section B) True LLM Learning Eval and artifacts/evals_llm/.

Files

  • eval_protocol.json: fixed protocol (task set, seed, max steps, decode config)
  • baseline_eval.json: per-task baseline rollouts
  • trained_eval.json: per-task improved/trained-style rollouts (same protocol)
  • improved_eval.json: alias of trained outputs for backward compatibility
  • comparison.csv: task-by-task delta table
  • summary.json: aggregate metrics (mean/median deltas, difficulty splits, steps, success)
  • case_study_hard_011.md: concise before/after narrative for one hard scenario
  • reward_by_task.svg: visual comparison of final reward by task
  • violations_before_after.svg: visual comparison of commitment violations

Reproduce

cd commitment_os
python3 evaluation/evaluate_improvement.py
python3 evaluation/plot_improvement.py