Spaces:
Sleeping
Sleeping
File size: 1,084 Bytes
98b25a9 d53a65c 98b25a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # Improvement Evaluation Artifacts
This folder contains deterministic baseline-vs-trained-style evaluation outputs for all 15 CommitmentOS tasks.
This is **not** the same as the real LLM checkpoint comparison; see root **README** section **B) True LLM Learning Eval** and `artifacts/evals_llm/`.
## Files
- `eval_protocol.json`: fixed protocol (task set, seed, max steps, decode config)
- `baseline_eval.json`: per-task baseline rollouts
- `trained_eval.json`: per-task improved/trained-style rollouts (same protocol)
- `improved_eval.json`: alias of trained outputs for backward compatibility
- `comparison.csv`: task-by-task delta table
- `summary.json`: aggregate metrics (mean/median deltas, difficulty splits, steps, success)
- `case_study_hard_011.md`: concise before/after narrative for one hard scenario
- `reward_by_task.svg`: visual comparison of final reward by task
- `violations_before_after.svg`: visual comparison of commitment violations
## Reproduce
```bash
cd commitment_os
python3 evaluation/evaluate_improvement.py
python3 evaluation/plot_improvement.py
```
|