Red-Button / evaluation /evaluate.py
Arun-Sanjay's picture
Phase 1: Initial scaffold, Claude Code workspace, repo structure per PROJECT.md Section 5
f707fd4
raw
history blame contribute delete
231 Bytes
"""Evaluation harness: compare checkpoints across tiers and print the markdown table.
TODO (Phase 15): implement CLI per PROJECT.md Section 20. Must reproduce within
2% across 50-rollout evaluations (PROJECT.md Section 20.4).
"""