rl-bus-optimizer / README.md
voldemort6996's picture
README: Update task IDs, score ranges, and quick start instructions for final compliance
91761e0
metadata
title: OpenEnv Bus Routing
emoji: 🚌
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - transport-optimization
  - dueling-dqn
  - gtfs

🚌 OpenEnv Bus Routing Optimizer

Dueling DDQN + Prioritized Experience Replay for Urban Transit

Real data. Real constraints. Real RL.

Built on OpenEnv Python 3.10+ Algorithm Data License: MIT

πŸš€ VIEW LIVE DEMO ON HUGGING FACE


🎯 Problem Statement

Urban public transit faces a fundamental optimization tension: Service Quality vs. Operational Cost.

In dynamic-demand scenarios (micro-transit, campus shuttles, last-mile connectivity), fixed schedules are inherently suboptimal. A bus that waits too long at a sparse stop causes downstream passenger anger; one that moves constantly without picking up wastes fuel.

This project trains a Deep RL agent to act as an intelligent dispatcher, dynamically deciding when to wait, move, or skip β€” all under strict fuel constraints and with real-world demand patterns calibrated from Indian city transit (GTFS) data.

Key Results

Metric Greedy Baseline Our Trained DQN Improvement
Avg Wait Time ~6.5 steps ~3.2 steps ↓ 51%
Total Reward 115.0 185.0 ↑ 61%
Fuel Efficiency 0.18 pax/fuel 0.31 pax/fuel ↑ 72%
Overall Score ~0.50 ~0.92 ↑ 84%
Neural Load N/A Thinking-Aware XAI+

Evaluated over 20 episodes on Task Medium (10-stop weekday demand profile).


πŸ“Š Performance Visualizations

Training Progress

Training Curves

The RL agent (Dueling DDQN + PER) significantly outperforms both greedy and random baselines, achieving 61% improvement in cumulative reward over training episodes.

Task Difficulty Performance

Task Difficulty Heatmap

Agent performance scales appropriately with task difficulty, maintaining strong performance (70%+ score) even on extreme-scale tasks with 25 stops.

Baseline Comparison

Metrics Comparison

Comprehensive comparison across key metrics shows the agent outperforms all baselines by 15-40% on wait time, reward, fuel efficiency, and coverage.

Route Distribution Analysis

Stop Visitation Heatmap

The RL agent demonstrates balanced route coverage compared to greedy baselines which tend to concentrate on high-demand stops, leading to better overall service quality.


To regenerate these charts, run:

python generate_visualizations.py

πŸ— Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    OPENENV BUS OPTIMIZER                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Dashboard   │◄──►│  Endpoints   │◄──►│  Panel + CoT  β”‚      β”‚
β”‚  β”‚ (server/app) β”‚    β”‚ (/reset,etc) β”‚    β”‚ (Insight XAI)β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚         β”‚                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  BusRoutingEnv  (OpenEnv Gymnasium Interface)        β”‚      β”‚
β”‚  β”‚                                                       β”‚      β”‚
β”‚  β”‚  POST /reset β†’ Observation (Pydantic)                β”‚      β”‚
β”‚  β”‚  POST /step  β†’ (Observation, Reward, done, info)    β”‚      β”‚
β”‚  β”‚  GET  /state β†’ Full environment state                β”‚      β”‚
β”‚  β”‚                                                       β”‚      β”‚
β”‚  β”‚  Demand: GTFS-Calibrated (Pune PMPML / Mumbai BEST)  β”‚      β”‚
β”‚  β”‚  Constraints: Fuel, Capacity, Anti-Camp, Coverage     β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚         β”‚                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Dueling Double DQN Agent + PER                      β”‚      β”‚
β”‚  β”‚                                                       β”‚      β”‚
β”‚  β”‚  Q(s,a) = V(s) + A(s,a) - mean(A)                   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  tasks.py    β”‚    β”‚  grader.py   β”‚    β”‚  inference.py β”‚      β”‚
β”‚  β”‚  3 Tiers     β”‚    β”‚  Log Markers β”‚    β”‚  Strict Tags β”‚      β”‚
β”‚  β”‚  Easy/Med/Hd β”‚    β”‚ [START/END]  β”‚    β”‚  compliant   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  GTFS Data Layer (data/gtfs_profiles.py)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ€– Algorithm Details

Dueling Double DQN with Prioritized Experience Replay

Our agent combines three state-of-the-art improvements over vanilla DQN:

1. Dueling Architecture (Wang et al., 2016)

The Q-network is split into two streams:

Q(s, a) = V(s) + A(s, a) - mean(A(s, Β·))
  • Value stream V(s): "How good is this state?" β€” learns state quality independent of actions
  • Advantage stream A(s,a): "How much better is action a vs. average?" β€” learns relative action benefit

2. Double DQN (van Hasselt et al., 2016)

Standard DQN overestimates Q-values because it uses the same network for both selecting and evaluating actions. Double DQN decouples these.

3. Prioritized Experience Replay (Schaul et al., 2016)

Instead of sampling uniformly, PER samples transitions proportional to their TD-error, accelerating learning on edge cases like fuel depletion.


🌍 Real-World Data: GTFS-Calibrated Demand

Instead of uniform synthetic arrivals, our environment uses time-of-day demand curves and stop-type heterogeneity calibrated from publicly available GTFS feeds (Pune PMPML / Mumbai BEST).


πŸ“¦ OpenEnv Compliance

Requirement Status Implementation
reset()/step/state API βœ… FastAPI endpoints for automated validation
Multi-task framework βœ… 3 tiers: task1, task2, task3
Deterministic graders βœ… grade_task1/2/3() -> score [0.05, 0.95]
LLM inference support βœ… inference.py with OpenAI client
START/STEP/END logging βœ… Mandatory structured tags for evaluation
Docker containerization βœ… optimized Dockerfile with entry points
Neural Load XAI βœ… Real-time reasoning token tracking

πŸš€ Setup & Running

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the grader
python grader.py --episodes 5

# Run the inference script (LLM mode)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your_token_here"
python inference.py --mode llm

# Launch the dashboard + API server
python server/app.py

Pre-Submission Validation

Before submitting to the hackathon, run:

python tests/FINAL_CHECK.py

Expected output: SUCCESS: ALL CHECKS PASSED

See VALIDATION_GUIDE.md for detailed validation instructions.

πŸ“š Documentation


πŸ”¬ Research References


Built for the OpenEnv Hackathon 2026 β€” Meta PyTorch