""" Training infrastructure for the two-phase RL system. Modules: policies — pluggable policy interface (random / LLM API) pool_b_baseline — null-context P2 baseline runner (Stage-3 prereq) behavioral_metrics — stopping/ordering/calibration/breadth metrics trajectory_dataset — load/save trajectories, HF-Dataset adapter belief_aux_loss — calibration regression + consistency loss segment_grpo — framework-agnostic segment-level GRPO loss curriculum — Stage-2 → Stage-3 → Stage-4 driver variance_gate — Pool-B variance check + r_cross weight warmup ablations — four ablation runners (claims 1-4) report — paper tables + behavioral plots """