Spaces:
Sleeping
Sleeping
Improvement Evaluation Artifacts
This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.
This is not the same as the real LLM checkpoint comparison; see root README section B) True LLM Learning Eval and artifacts/evals_llm/.
Files
eval_protocol.json: fixed protocol (task set, seed, max steps, decode config)baseline_eval.json: per-task baseline rolloutstrained_eval.json: per-task improved/trained-style rollouts (same protocol)improved_eval.json: alias of trained outputs for backward compatibilitycomparison.csv: task-by-task delta tablesummary.json: aggregate metrics (mean/median deltas, difficulty splits, steps, success)case_study_hard_011.md: concise before/after narrative for one hard scenarioreward_by_task.svg: visual comparison of final reward by taskviolations_before_after.svg: visual comparison of commitment violations
Reproduce
cd commitment_os
python3 evaluation/evaluate_improvement.py
python3 evaluation/plot_improvement.py