title: Adaptive Traffic Controller
emoji: π¦
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7860
tags:
- openenv
- llm-agent
- traffic-control
license: mit
Adaptive Backend Traffic Controller
An OpenEnv-compatible environment where an LLM agent prevents backend server crashes by intelligently throttling incoming traffic in real-time. The agent observes server metrics, reasons about the situation, and picks the optimal throttling action β no training required, just prompting.
Built for the Scaler Γ Meta PyTorch Hackathon.
Why This Matters β Traditional vs LLM Approach
The Problem
Backend servers crash under traffic spikes. This is a real-world problem every tech company faces β Black Friday sales, viral content, DDoS attacks.
Traditional Approaches (and their limits)
| Approach | How it works | Limitation |
|---|---|---|
| Static rate limiting | Fixed threshold (e.g. 100 req/s max) | Can't adapt β wastes capacity at low load, still crashes during unexpected spikes |
| Auto-scaling | Spin up more servers when load increases | Slow (minutes to provision), expensive, doesn't help with instant spikes |
| Rule-based throttling | If CPU > 80% then drop 50% | Hardcoded thresholds β different servers need different rules, no learning |
| PID controllers | Feedback loop adjusting admission rate | Requires manual tuning per deployment, poor at handling non-linear dynamics |
Our Approach: LLM as Adaptive Controller
An LLM agent that:
- Reads real-time server metrics (CPU, memory, latency, queue) as natural language
- Reasons about the situation ("traffic is 160% of capacity, need to throttle aggressively")
- Adapts to any server configuration without code changes β just tell it the capacity in the prompt
- Generalizes β the same agent works for a 50 req/s server or a 500 req/s server
This environment lets anyone test whether their LLM can make these split-second decisions correctly.
How the Environment Works
βββββββββββββββββββ
Traffic Spikes βββΊβ Environment ββββΊ Server Metrics (CPU, latency, queue)
β (simulator.py) β β
ββββββββββ¬βββββββββ βΌ
β ββββββββββββββββ
β β LLM Agent β
β β (inference.py) β
β ββββββββ¬ββββββββ
β β
ββββββββββββββββββββ
Action: allow_all / throttle_70 / throttle_40 / drop_aggressive
- Environment generates traffic patterns (spikes, ramps, sustained overload)
- Agent observes server state each step and picks a throttling action
- Environment simulates the effect β CPU changes, latency spikes, queue builds
- Grader scores the agent: did it survive? Was latency acceptable? How much traffic got through?
The agent must balance two competing goals:
- Maximize throughput β let as much traffic through as possible (users want fast responses)
- Prevent crashes β don't overload the server (crashes = total failure, score = 0)
Overview
The environment simulates a backend server receiving variable traffic. The agent observes system metrics every time step and chooses a throttling action to keep the server healthy. The server's physics are modelled realistically: CPU and memory track load linearly, latency spikes superlinearly, and sustained overload causes crashes.
Observation Space
| Field | Type | Range | Description |
|---|---|---|---|
cpu_usage |
float | 0.0 β 1.0 | CPU utilization fraction |
memory_usage |
float | 0.0 β 1.0 | Memory utilization fraction |
request_rate |
float | β₯ 0 | Incoming requests per second |
queue_length |
int | 0 β 500 | Pending requests in backlog |
avg_latency |
float | β₯ 0 | Average response latency (ms) |
step |
int | β₯ 0 | Current episode step |
crashed |
bool | β | Whether the server crashed this step |
Action Space
| Action | Accept Rate | Description |
|---|---|---|
allow_all |
100% | Safe load β accept all requests |
throttle_70 |
70% | Moderate load β drop 30% |
throttle_40 |
40% | High load β drop 60% |
drop_aggressive |
20% | Imminent crash β drop 80% |
Tasks
Task Easy β Single Spike
- Traffic: 40 req/s baseline β 160 req/s spike at step 10 for 5 steps β back to 40
- Episode length: 30 steps
- Scoring:
1.0β no crash AND avg latency < 300 ms0.5β no crash, but avg latency β₯ 300 ms0.0β any crash
Task Medium β Multiple Spikes
- Traffic: 50 req/s baseline with 3 spikes of 150 req/s at steps 5, 15, 25 (3 steps each)
- Episode length: 40 steps
- Scoring:
(steps_without_crash / total_steps) Γ latency_factorlatency_factor= 1.0 at β€ 200 ms, 0.5 at β₯ 600 ms, linear between
Task Hard β Sustained Overload
- Traffic: ramps 60 β 200 req/s over 20 steps, stays at 200 for 20 steps, drops to 80
- Episode length: 50 steps
- Scoring:
throughput_ratio Γ 0.7 + queue_factor Γ 0.3throughput_ratio= total allowed / total incomingqueue_factor= fraction of steps with queue < 100
API Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/reset |
Reset environment, returns initial state |
POST |
/step |
Execute action, returns state/reward/done/info |
GET |
/state |
Current server state |
GET |
/tasks |
List all 3 tasks |
GET |
/openenv.yaml |
OpenEnv specification |
GET |
/health |
Liveness probe |
Configurable Environment
The environment is fully configurable via the /reset endpoint. Pass a config object to simulate different server profiles:
curl -X POST localhost:7860/reset -H "Content-Type: application/json" \
-d '{"task_id": "task_easy", "config": {"server_capacity": 200, "base_latency": 30}}'
| Parameter | Default | Description |
|---|---|---|
server_capacity |
100.0 | Max requests/sec the server can handle |
base_latency |
50.0 | Response time at zero load (ms) |
crash_load_ratio |
1.3 | Server crashes at this multiple of capacity |
max_queue |
500 | Maximum pending request queue size |
traffic_scale |
1.0 | Multiplier for traffic patterns (2.0 = double traffic) |
The LLM agent adapts automatically β the system prompt includes the configured capacity so the model knows the server's limits.
Setup
Local (Python)
pip install -r requirements.txt
# Start the environment + Gradio UI
python app.py
# Smoke tests
curl -s localhost:7860/health
curl -s -X POST localhost:7860/reset -H "Content-Type: application/json" \
-d '{"task_id": "task_easy"}' | python -m json.tool
curl -s -X POST localhost:7860/step -H "Content-Type: application/json" \
-d '{"action": "throttle_70"}' | python -m json.tool
curl -s localhost:7860/tasks | python -m json.tool
curl -s localhost:7860/openenv.yaml
Docker
docker build -t traffic-controller .
docker run -p 7860:7860 traffic-controller
Running Inference
Set the three required environment variables then run inference.py:
export API_BASE_URL="https://api-inference.huggingface.co/models/<your-model>/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_..."
export ENV_URL="http://localhost:7860" # optional, defaults to this
python inference.py
Expected output:
Environment URL : http://localhost:7860
Model : meta-llama/Llama-3.1-8B-Instruct
API base : https://api-inference.huggingface.co/...
Health check OK
=== TASK_EASY ===
Starting task_easy (max_steps=30)
step= 1 action=allow_all reward=+0.950 latency= 56.5ms queue= 0 cpu=0.54
...
task_easy done β total_reward=27.3, score=1.000
=== RESULTS ===
task_easy : 1.000
task_medium : 0.875
task_hard : 0.623
Overall : 0.833
Baseline Scores
Measured on the deterministic simulator. Scores are in 0.0 β 1.0.
| Agent | task_easy | task_medium | task_hard | Overall |
|---|---|---|---|---|
| Always allow_all (naive) | 0.000 π₯ | 0.833 | 0.300 π₯ | 0.378 |
| Always drop_aggressive (conservative) | 1.000 | 1.000 | 0.440 | 0.813 |
| Adaptive agent (scales to config) | 1.000 | 1.000 | 0.500 | 0.833 |
| LLM agent (target) | β₯ 0.9 | β₯ 0.9 | β₯ 0.6 | β₯ 0.8 |
π₯ = server crash occurred during episode
Key insight: The hard task is the differentiator β naive and conservative agents score β€ 0.44 because sustained 200 req/s overload requires balancing throughput (don't drop too much) against stability (don't let load crash the server). A smart LLM agent should outperform all rule-based baselines here.
Infrastructure
- Port: 7860 (HuggingFace Spaces)
- CPU: 2 vCPU
- Memory: 8 GB
- GPU: not required
- Inference timeout: < 20 minutes total
Project Structure
.
βββ app.py # Gradio UI + mounts FastAPI endpoints
βββ environment.py # FastAPI app + episode logic
βββ simulator.py # Backend physics (latency, CPU, memory, crash)
βββ models.py # Pydantic models (state, action, config, request/response)
βββ tasks.py # Traffic patterns + task metadata
βββ graders.py # Per-task scoring functions (0.0β1.0)
βββ inference.py # LLM agent runner (OpenAI client)
βββ client.py # Python EnvClient for programmatic access
βββ __init__.py # Exports Action, ServerState, EnvClient
βββ openenv.yaml # OpenEnv spec
βββ pyproject.toml # Package metadata
βββ Dockerfile
βββ requirements.txt
βββ README.md
Interactive Demo (UI Guide)
The HF Space serves a Gradio dashboard alongside the API. Here's what you can do:
Controls
- Traffic Scenario β pick
task_easy(single spike),task_medium(3 spikes), ortask_hard(sustained ramp to 200 req/s) - Agent Strategy β compare
Adaptive Agentvs baselines (Always Allowcrashes,Always Throttle 40%wastes capacity) - Server Configuration β sliders to customize:
- Server Capacity (20β500 req/s)
- Base Latency (10β200ms)
- Crash Threshold (1.1xβ2.0x)
- Traffic Scale (0.5xβ3.0x)
- LLM Agent β plug in your own API key + model to test a real LLM as the controller
Dashboard Charts (6 panels)
- Traffic: Incoming vs Allowed β red line = raw traffic, green = what agent lets through, dashed = server capacity
- Agent Actions β color-coded bars (green=allow, yellow=70%, orange=40%, red=drop)
- CPU & Memory β server health with danger line at 80%
- Avg Latency β response time with danger at 400ms
- Queue Length β pending request backlog
- Cumulative Reward β running score (higher = better)
Agent Reasoning Log
Below the charts, a step-by-step log shows the agent's thinking at each step:
Step 10 β οΈ | Traffic: 160 req/s | Action: π΄ drop_aggressive β allowed 32 req/s
> Rate 160 req/s = 320% of capacity (50). Drop aggressively!
> CPU: 68% | Latency: 70ms | Queue: 0 | Reward: +0.200
Building an Agent for This Environment
Quick Start (Python)
from client import EnvClient
with EnvClient("http://localhost:7860") as env:
data = env.reset("task_easy")
state = data["state"]
while True:
# Your agent logic here
action = decide(state) # returns "allow_all", "throttle_70", etc.
result = env.step(action)
state = result["state"]
reward = result["reward"]
if result["done"]:
print(f"Score: {result['info']['final_score']}")
break
What Makes a Good Agent
- React to
request_rateβ this shows upcoming traffic, not past traffic. If it exceeds capacity, throttle BEFORE the server overloads. - Don't over-throttle β
drop_aggressiveprevents crashes but kills throughput. Use it only when truly needed. - Watch the queue β if
queue_lengthis building up, the server is falling behind. Throttle more until it drains. - Adapt to config β different servers have different capacities. Read the config from
/resetresponse and adjust thresholds accordingly.
Reward Function
reward = throughput_reward - latency_penalty * 0.5 - queue_penalty * 0.3
throughput_reward = allowed_requests / incoming_requests (0.0β1.0)
latency_penalty = (avg_latency - 200) / 800 (0 at 200ms, 1 at 1000ms)
queue_penalty = queue_length / max_queue (0.0β1.0)
crash = -10.0 (instant game over)
The agent is rewarded for letting traffic through, penalized for high latency/queue, and heavily punished for crashing.