Spaces:
Sleeping
Sleeping
RL+LLM Wrapper Sweep
This sweep keeps both checkpoints fixed:
- RL weights stay fixed.
- LLM weights stay fixed.
- Only the inference-time
target_only_softwrapper settings change.
Recommended First Sweep
Run the default cheap preset on one city, one scenario, and three seeds:
python scripts/sweep_rl_llm_wrapper.py \
--rl-checkpoint artifacts/dqn_shared/best_validation.pt \
--llm-model-path artifacts/district_llm_adapter_v3/main_run/adapter \
--preset strength_targets_gating \
--split val \
--cities city_0001 \
--scenarios normal \
--seeds 7 11 13 \
--episodes-per-seed 1 \
--max-episode-seconds 300 \
--guidance-refresh-steps 10 \
--queue-threshold 150 \
--imbalance-threshold 20 \
--fallback-policy no_op \
--output-dir artifacts/rl_llm_wrapper_sweep/first_pass
This preset sweeps a small curated grid over:
bias_strengthin{0.025, 0.05, 0.075}max_intersections_affectedin{1, 2}gating_modein{always_on, incident_or_spillback, queue_or_imbalance}guidance_persistence_steps = 5enable_bias_decay = false
It also includes baseline_current_soft as a reference row.
Cheaper Probe
If you only want the fastest possible read on strength sensitivity:
python scripts/sweep_rl_llm_wrapper.py \
--rl-checkpoint artifacts/dqn_shared/best_validation.pt \
--llm-model-path artifacts/district_llm_adapter_v3/main_run/adapter \
--preset strength_only \
--cities city_0001 \
--scenarios normal \
--seeds 7 11 13 \
--episodes-per-seed 1 \
--max-episode-seconds 300 \
--output-dir artifacts/rl_llm_wrapper_sweep/strength_only
Broader Conservative Follow-Up
After the first pass identifies a promising strength/gating region:
python scripts/sweep_rl_llm_wrapper.py \
--rl-checkpoint artifacts/dqn_shared/best_validation.pt \
--llm-model-path artifacts/district_llm_adapter_v3/main_run/adapter \
--preset full_conservative \
--cities city_0001 \
--scenarios normal \
--seeds 7 11 13 \
--episodes-per-seed 1 \
--max-episode-seconds 300 \
--output-dir artifacts/rl_llm_wrapper_sweep/full_conservative
Outputs
Each sweep writes:
config.jsonsweep_results.csvsweep_results.parquetwhen parquet support is availablepaired_episode_metrics.csvranking.jsonsummary_report.json- optional
step_metrics.* - optional
guidance_traces.jsonl
What To Inspect
Start with:
summary_report.jsonranking.jsonpaired_episode_metrics.csv
Key fields:
mean_return_delta_vs_rl_onlymean_throughput_delta_vs_rl_onlymean_avg_queue_delta_vs_rl_onlymean_avg_wait_delta_vs_rl_onlymean_percent_steps_with_active_guidancemean_avg_num_affected_intersectionsmean_num_steps_guidance_blocked_by_gate
Interpretation
The most promising configs should usually look like:
- small negative or positive
mean_return_delta_vs_rl_only - low
mean_avg_num_affected_intersections - moderate or low
mean_percent_steps_with_active_guidance - low fallback / invalid guidance counts
If the best configs cluster around:
- lower
bias_strength max_intersections_affected = 1- gated modes like
incident_or_spillbackorqueue_or_imbalance
then the wrapper was still too active and guidance needs to remain a rare local prior.