Spaces:
Sleeping
Sleeping
| # Improvement Evaluation Artifacts | |
| This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks. | |
| This is **not** the same as the real LLM checkpoint comparison; see root **README** section **B) True LLM Learning Eval** and `artifacts/evals_llm/`. | |
| ## Files | |
| - `eval_protocol.json`: fixed protocol (task set, seed, max steps, decode config) | |
| - `baseline_eval.json`: per-task baseline rollouts | |
| - `trained_eval.json`: per-task improved/trained-style rollouts (same protocol) | |
| - `improved_eval.json`: alias of trained outputs for backward compatibility | |
| - `comparison.csv`: task-by-task delta table | |
| - `summary.json`: aggregate metrics (mean/median deltas, difficulty splits, steps, success) | |
| - `case_study_hard_011.md`: concise before/after narrative for one hard scenario | |
| - `reward_by_task.svg`: visual comparison of final reward by task | |
| - `violations_before_after.svg`: visual comparison of commitment violations | |
| ## Reproduce | |
| ```bash | |
| cd commitment_os | |
| python3 evaluation/evaluate_improvement.py | |
| python3 evaluation/plot_improvement.py | |
| ``` | |