Spaces:

rishavutk
/

fleetmind

Running

App Files Files Community

fleetmind / PROJECT_SPEC.md

Rishav

Document non-LLM evaluator workflow

2e7762d 3 months ago

preview code

Raw

History Blame Contribute Delete

11.4 kB

LLM-Driven Last-Mile Delivery Dispatch Environment

Problem Statement

Build an OpenEnv environment that simulates a last-mile delivery dispatch system in a simplified city grid. A fixed fleet of delivery agents must be assigned to dynamically arriving orders. Each order has a pickup location, drop location, reward value, and delivery deadline. Some regions of the city may also be demand hotspots, and some zones may have fixed congestion that increases travel cost.

At every decision step, an LLM acts as the central dispatcher. It observes the current environment state and decides which idle agents should be assigned to which active orders. The objective is to maximize total earned reward while minimizing lateness, missed orders, invalid assignments, and poor fleet utilization.

This environment is designed to represent a simplified version of real-world dispatch systems used in logistics and food delivery platforms. The challenge is not low-level route control, but high-level sequential decision-making under limited resources, time pressure, and spatial constraints.

Implementation note:

the environment is intentionally designed for LLM-style decision making
the submission runtime should still remain self-contained and reproducible if external model credentials are unavailable
optional external model execution may be supported, but the environment should remain meaningful and runnable without relying on paid provider credits
the HTTP API should remain equally usable by non-LLM evaluators such as heuristics, planners, or scripted policies

One-Line Summary

An LLM-driven delivery dispatch simulator where a model assigns limited agents to dynamic orders under time, distance, deadline, reward, and congestion constraints.

Environment Design

The city is represented as a 2D grid.

Each episode contains:

a fixed grid layout
a fixed number of delivery agents
a seeded, reproducible schedule of incoming orders
optional hotspot zones where more orders originate
optional fixed congestion zones that increase movement cost

The environment should remain reproducible for grading. The same seed should recreate the same world instance. If no seed is provided, the environment may generate one internally, but it should return the actual used_seed so the run can be replayed exactly.

Core Entities

Agents

Each agent has:

agent_id
location: (x, y)
status: idle | busy
busy_until
assigned_order_id | null

Rules:

an agent can handle at most one order at a time
only idle agents can be assigned new orders
agents become available again when their current job is completed

Orders

Each order has:

order_id
created_at
pickup_location: (x, y)
drop_location: (x, y)
reward_value
deadline
status: unassigned | assigned | completed | expired

Rules:

an order can be assigned to only one agent
an order expires if not completed before the scenario's expiration rule
expired orders cannot be reassigned

Zones

Each cell in the grid may be one of:

normal: movement cost 1
congested: movement cost 2

Optional metadata:

hotspot: true | false

Hotspots affect order generation frequency, not movement directly.

State Representation

The state() output should be compact, structured, and easy for an LLM to read.

Observation design principle:

expose raw world state and outcome signals
do not expose planner-side helper hints such as nearest-agent suggestions, future schedules, or direct feasibility labels

Recommended state schema:

{
  "time": 12,
  "grid": {
    "width": 15,
    "height": 15,
    "congested_zones": [[6, 6], [6, 7], [7, 6], [7, 7]],
    "hotspots": [[11, 11], [12, 11], [11, 12]]
  },
  "agents": [
    {
      "agent_id": "a1",
      "location": [2, 3],
      "status": "idle",
      "busy_until": 12,
      "assigned_order_id": null
    }
  ],
  "orders": [
    {
      "order_id": "o4",
      "created_at": 10,
      "pickup_location": [11, 12],
      "drop_location": [13, 14],
      "reward_value": 18,
      "deadline": 20,
      "status": "unassigned"
    }
  ],
  "scenario_info": {
    "name": "hotspot_congestion",
    "episode_horizon": 40,
    "used_seed": 7
  }
}

Action Space

At each step, the dispatcher returns assignment decisions in strict JSON.

Recommended format:

{
  "assignments": [
    {"agent_id": "a1", "order_id": "o4"},
    {"agent_id": "a2", "order_id": "o2"}
  ]
}

Action rules:

only idle agents may be assigned
each agent may appear at most once
each order may appear at most once
omitted idle agents remain idle
malformed, duplicate, or infeasible assignments are ignored and penalized lightly

Step Semantics

Each step() does the following:

receives the LLM assignment action
validates all assignments
assigns valid (agent, order) pairs
computes each assigned job's total completion time
updates agents to busy
advances the simulator to the next event
resolves completed jobs
expires overdue orders
injects any new scheduled orders
returns updated state, reward, done flag, and info

Episodes are also bounded by a configurable max_decision_steps limit. When that limit is reached:

no further decisions are accepted
future unseen orders are ignored
already assigned orders are deterministically rolled forward and scored
visible unassigned orders are terminally expired and penalized
the environment returns a final summary through done = true

Movement and Travel Cost

Base movement uses grid travel.

For v1, use shortest-path travel cost over the grid where:

entering a normal cell costs 1
entering a congested cell costs 2

Delivery time for an assigned order:

job_time = travel(agent -> pickup) + travel(pickup -> drop) + service_time

Use:

service_time = 1

To keep implementation simple and valid:

use deterministic shortest-path cost
no random traffic
no dynamic road closures
no stochastic delays

Time Advancement

Use event-based advancement.

After each dispatch action, advance time to the earliest of:

next agent becoming available
next order arrival
episode horizon reached

This keeps the episode efficient and avoids unnecessary empty steps.

Reward Function

Reward should be shaped and interpretable.

Recommended components:

+ reward_value for on-time delivery
+ early_bonus if delivered well before deadline
- lateness_penalty for late completion
- missed_penalty for expired orders
- invalid_action_penalty for invalid assignments
- idle_penalty when idle agents exist and feasible orders remain

Suggested concrete version:

on-time completion: +reward_value
early completion bonus: +0.1 * reward_value if completed with slack >= 3
late completion: +0.3 * reward_value - lateness
expired order: -0.5 * reward_value
invalid assignment: -1
avoidable idle penalty: -0.5

This creates meaningful feedback without making the score hard to interpret.

Objective

Maximize cumulative episode reward.

A strong dispatcher should:

choose assignments that finish more orders on time
avoid wasting agents on low-value or infeasible jobs
handle congestion-aware tradeoffs
position the fleet effectively in hotspot-heavy scenarios

Scenario Suite

Use 3 deterministic tasks.

Task 1: Low Demand

Purpose:

verify core assignment logic
reward simple nearest-feasible reasoning

Setup:

grid: 8x8
agents: 3
orders: 8-10
congestion: none or minimal
deadlines: generous
hotspot effect: low

Expected behavior:

most orders should be serviceable
mistakes come mostly from poor assignment choices
hotspot structure should be mostly stable

Task 2: High Demand

Purpose:

test prioritization under scarcity
introduce moderate but steadier world evolution

Setup:

grid: 10x10 or 12x12
agents: 3-4
orders: 18-25
congestion: a few fixed zones
deadlines: moderate
hotspot effect: medium

Expected behavior:

not all orders can be served
agent must prioritize based on reward, distance, and urgency

Task 3: Hotspot + Congestion

Purpose:

test strategic dispatch in the richest setting

Setup:

grid: 15x15
agents: 4-5
orders: 20-28
hotspot zones: concentrated demand in selected regions
congestion zones: fixed cells with movement cost 2
deadlines: mixed, with some tight high-value orders

Expected behavior:

the LLM must trade off:
- high-value but congested orders
- urgent nearby orders
- long-term fleet positioning around hotspots

Scenario Generation Rules

To preserve reproducibility:

use fixed seeds or fully predefined schedules
use deterministic order arrival times
use deterministic congestion maps
use fixed hotspot coordinates per scenario

Hotspots should influence where pickups appear more often, especially in the hard task.

To support robustness testing without turning the benchmark into noise:

the environment may accept a seed
different seeds should perturb timings and spatial patterns within bounded ranges
the same seed should produce the same episode

Grading

Each task gets a normalized score in [0, 1].

Recommended formula:

score = clamp((agent_reward - baseline_reward) / (target_reward - baseline_reward), 0, 1)

Where:

baseline_reward is produced by a naive deterministic policy
target_reward is produced by a stronger heuristic dispatch policy

Baseline Policy

Simple rule-based dispatcher:

sort active orders by earliest deadline
assign nearest idle agent greedily

Target Policy

Stronger heuristic dispatcher:

score orders using reward, travel cost, deadline slack, and congestion-adjusted feasibility
greedily assign the best feasible matches

Example heuristic:

priority = 1.5 * reward_value - 1.0 * travel_cost - 2.0 * urgency_penalty

This gives stable anchors for normalization and makes scores interpretable across tasks.

Per-Task Grader Output

Recommended output:

{
  "task_id": "hotspot_congestion",
  "raw_reward": 57.0,
  "baseline_reward": 24.0,
  "target_reward": 68.0,
  "score": 0.75
}

Terminal Resolution

When the configured decision-step budget is exhausted:

assigned orders that would still finish before their service cutoff are resolved and scored
assigned orders that would miss cutoff are expired
visible unassigned orders are expired immediately
not-yet-visible future orders are ignored

This keeps episode length bounded while preserving deterministic final scoring.

Overall Score

Recommended weighted average:

low demand: 0.2
high demand: 0.3
hotspot + congestion: 0.5

This makes the hardest and most realistic task matter most.

Submission Requirements Alignment

This spec is designed to support:

real-world environment framing
clear step(), reset(), state() behavior
3 distinct tasks
reproducible graders
meaningful reward shaping
lightweight execution under hackathon constraints

Why This Version Is Good

This version is strong because it stays:

realistic
deterministic
judge-friendly
fast enough to validate
rich enough to show non-trivial LLM reasoning

It also avoids the common trap of overbuilding simulation complexity before the environment core is solid.