| --- |
| title: Agentic Traffic |
| emoji: 🏢 |
| colorFrom: green |
| colorTo: purple |
| sdk: docker |
| pinned: false |
| short_description: Agentic AI to control traffic lights |
| app_port: 7860 |
| language: |
| - en |
| base_model: |
| - meta-llama/Llama-3.1-8B-Instruct |
| pipeline_tag: reinforcement-learning |
| --- |
| |
| # traffic-llm |
|
|
| CityFlow-based traffic-control project with intersection-level multi-agent DQN training and district-aware policy variants. |
|
|
| A huge thank you to @tokev (Kevin Truong) to helping me with this project. |
|
|
| ## OpenEnv UI |
|
|
| For the deployed OpenEnv web interface: |
|
|
| - Click `Reset` before using `Step`. |
| - Leave `Use Llm` unchecked for the fast, stable DQN-only path. |
| - Use `District Actions` = `{}` for a valid no-op step payload. |
| - Only enable `Use Llm` when you explicitly want district-level LLM guidance on top of the DQN executor. |
|
|
| ## Training |
|
|
| The default local-policy trainer now uses parameter-shared dueling Double DQN with prioritized replay and n-step returns: |
|
|
| ```bash |
| python3 -m training.train_local_policy train |
| ``` |
|
|
| That trains against `data/generated`, uses `data/splits`, writes checkpoints to `artifacts/dqn_shared`, enables TensorBoard logging, uses parallel CPU rollout workers by default, shows `tqdm` progress bars, and now validates plus checkpoints every 40 updates by default. |
|
|
| For a broader but still manageable validation pass: |
|
|
| ```bash |
| python3 -m training.train_local_policy train --max-val-cities 3 --val-scenarios-per-city 7 |
| ``` |
|
|
| That evaluates 3 validation cities across all 7 scenario types. This gives 21 learned-policy validation episodes per eval, or 63 total episodes if random and fixed baselines are also enabled. |
|
|
| Phase-3-style full training with the same 40-update eval/checkpoint cadence: |
|
|
| ```bash |
| python3 -m training.train_local_policy train \ |
| --max-train-cities 70 \ |
| --max-val-cities 3 \ |
| --val-scenarios-per-city 7 \ |
| --policy-arch single_head_with_district_feature \ |
| --reward-variant wait_queue_throughput |
| ``` |
|
|
| Useful ablations: |
|
|
| ```bash |
| python3 -m training.train_local_policy train --policy-arch multi_head --reward-variant current |
| python3 -m training.train_local_policy train --policy-arch single_head --reward-variant current |
| python3 -m training.train_local_policy train --policy-arch single_head_with_district_feature --reward-variant wait_queue_throughput |
| ``` |
|
|
| For a fast phase-1 overfit run on one fixed world: |
|
|
| ```bash |
| python3 -m training.train_local_policy train \ |
| --total-updates 25 \ |
| --train-city-id city_0072 \ |
| --train-scenario-name normal \ |
| --overfit-val-on-train-scenario \ |
| --fast-overfit \ |
| --policy-arch single_head_with_district_feature \ |
| --reward-variant wait_queue_throughput |
| ``` |
|
|
| To create or refresh dataset splits: |
|
|
| ```bash |
| python3 -m training.train_local_policy make-splits |
| ``` |
|
|
| To evaluate the best checkpoint: |
|
|
| ```bash |
| python3 -m training.train_local_policy evaluate \ |
| --checkpoint artifacts/dqn_shared/best_validation.pt \ |
| --split val |
| ``` |
|
|
| To evaluate a heuristic baseline directly: |
|
|
| ```bash |
| python3 -m training.train_local_policy evaluate --baseline queue_greedy --split val |
| ``` |
|
|
| ## TensorBoard |
|
|
| TensorBoard logs are written to `artifacts/dqn_shared/tensorboard` by default. |
|
|
| ```bash |
| tensorboard --logdir artifacts/dqn_shared/tensorboard |
| ``` |
|
|
| ## District LLM |
|
|
| The district LLM stack lives under `district_llm/`. It treats the learned DQN local controller as the low-level executor, derives district-scale SFT labels automatically from DQN rollout windows, and defaults district-model fine-tuning to DQN-derived rows only. |
|
|
| Generate district-LLM data from a learned checkpoint: |
|
|
| ```bash |
| python3 -m district_llm.generate_dataset \ |
| --controller rl_checkpoint \ |
| --checkpoint artifacts/dqn_shared/best_validation.pt \ |
| --episodes 100 \ |
| --decision-interval 10 \ |
| --use-checkpoint-env-config \ |
| --output data/district_llm_train.jsonl |
| ``` |
|
|
| Generate from fixed or heuristic baselines: |
|
|
| ```bash |
| python3 -m district_llm.generate_dataset --controller fixed --episodes 50 --decision-interval 10 --output data/district_llm_fixed.jsonl |
| python3 -m district_llm.generate_dataset --controller queue_greedy --episodes 50 --decision-interval 10 --output data/district_llm_heuristic.jsonl |
| python3 -m district_llm.generate_dataset --teacher-spec fixed --teacher-spec random --episodes 50 --decision-interval 10 --output data/district_llm_multi_teacher.jsonl |
| ``` |
|
|
| Train a first-pass district model with Unsloth/QLoRA: |
|
|
| ```bash |
| python3 -m training.train_district_llm \ |
| --dataset data/district_llm_train.jsonl \ |
| --output-dir artifacts/district_llm_qwen \ |
| --model-name Qwen/Qwen2.5-7B-Instruct \ |
| --load-in-4bit \ |
| --lora-rank 16 \ |
| --max-seq-length 1024 \ |
| --max-steps 1000 |
| ``` |
|
|
| Run single-sample inference: |
|
|
| ```bash |
| python3 -m district_llm.inference \ |
| --model artifacts/district_llm_qwen \ |
| --city-id city_0006 \ |
| --scenario-name accident \ |
| --district-id d_00 |
| ``` |
|
|
| Run the OpenEnv-compatible district wrapper on top of the current DQN stack: |
|
|
| ```bash |
| uvicorn openenv_app.app:app --reload |
| ``` |
|
|
| ## Algorithm |
|
|
| - Training algorithm: parameter-shared dueling Double DQN. |
| - Replay: prioritized replay over per-intersection transitions gathered from full CityFlow worlds. |
| - Return target: n-step bootstrap target with target-network updates. |
| - Execution: all controllable intersections act simultaneously every RL decision interval. |
| - Action space: `0 = hold current phase`, `1 = switch to next green phase`. |
| - Safety: `min_green_time` is enforced in the environment and exposed through action masking. |
|
|
| Policy architecture modes: |
|
|
| - `multi_head`: shared trunk with district-type-specific Q heads. |
| - `single_head`: one shared Q head for all intersections, with district type removed from the observation. |
| - `single_head_with_district_feature`: one shared Q head for all intersections, with district type left in the observation as an explicit feature. |
|
|
| Reward variants: |
|
|
| - `current`: backward-compatible waiting and queue penalty. |
| - `normalized_wait_queue`: normalized queue and waiting reduction reward. |
| - `wait_queue_throughput`: normalized queue/wait reduction plus throughput bonus and imbalance penalty. |
|
|
| ## Smoke Test |
|
|
| To sanity-check one generated scenario with the real CityFlow environment: |
|
|
| ```bash |
| python3 scripts/smoke_test_env.py --city-id city_0001 --scenario-name normal --policy random |
| ``` |
|
|
| ## Project layout |
|
|
| - `agents/`: heuristic local policies and simple baselines. |
| - `env/`: CityFlow environment, topology parsing, observation building, and reward logic. |
| - `training/`: dataset utilities, replay-based DQN training, evaluation helpers, TensorBoard logging, and CLIs. |
| - `data/`: generated synthetic cities, split files, and dataset generation utilities. |
| - `scripts/`: utility scripts, including the CityFlow smoke test. |
| - `third_party/`: vendored dependencies, including CityFlow source. |
|
|
| ## Notes |
|
|
| - The generated dataset is assumed to already exist under `data/generated`. |
| - District membership comes from `district_map.json`. |
| - District types come from `metadata.json`. |
| - Runtime training and evaluation require the `cityflow` Python module to be installed in the active environment. |