Spaces:

rishavutk
/

fleetmind

Running

App Files Files Community

fleetmind / PROJECT_SPEC.md

Rishav

Document non-LLM evaluator workflow

2e7762d 3 months ago

preview code

Raw

History Blame Contribute Delete

11.4 kB

	# LLM-Driven Last-Mile Delivery Dispatch Environment

	## Problem Statement

	Build an `OpenEnv` environment that simulates a last-mile delivery dispatch system in a simplified city grid. A fixed fleet of delivery agents must be assigned to dynamically arriving orders. Each order has a pickup location, drop location, reward value, and delivery deadline. Some regions of the city may also be demand hotspots, and some zones may have fixed congestion that increases travel cost.

	At every decision step, an LLM acts as the central dispatcher. It observes the current environment state and decides which idle agents should be assigned to which active orders. The objective is to maximize total earned reward while minimizing lateness, missed orders, invalid assignments, and poor fleet utilization.

	This environment is designed to represent a simplified version of real-world dispatch systems used in logistics and food delivery platforms. The challenge is not low-level route control, but high-level sequential decision-making under limited resources, time pressure, and spatial constraints.

	Implementation note:
	- the environment is intentionally designed for LLM-style decision making
	- the submission runtime should still remain self-contained and reproducible if external model credentials are unavailable
	- optional external model execution may be supported, but the environment should remain meaningful and runnable without relying on paid provider credits
	- the HTTP API should remain equally usable by non-LLM evaluators such as heuristics, planners, or scripted policies

	## One-Line Summary

	An LLM-driven delivery dispatch simulator where a model assigns limited agents to dynamic orders under time, distance, deadline, reward, and congestion constraints.

	## Environment Design

	The city is represented as a 2D grid.

	Each episode contains:
	- a fixed grid layout
	- a fixed number of delivery agents
	- a seeded, reproducible schedule of incoming orders
	- optional hotspot zones where more orders originate
	- optional fixed congestion zones that increase movement cost

	The environment should remain reproducible for grading. The same seed should recreate the same world instance.
	If no seed is provided, the environment may generate one internally, but it should return the actual `used_seed` so the run can be replayed exactly.

	## Core Entities

	### Agents

	Each agent has:
	- `agent_id`
	- `location: (x, y)`
	- `status: idle \| busy`
	- `busy_until`
	- `assigned_order_id \| null`

	Rules:
	- an agent can handle at most one order at a time
	- only idle agents can be assigned new orders
	- agents become available again when their current job is completed

	### Orders

	Each order has:
	- `order_id`
	- `created_at`
	- `pickup_location: (x, y)`
	- `drop_location: (x, y)`
	- `reward_value`
	- `deadline`
	- `status: unassigned \| assigned \| completed \| expired`

	Rules:
	- an order can be assigned to only one agent
	- an order expires if not completed before the scenario's expiration rule
	- expired orders cannot be reassigned

	### Zones

	Each cell in the grid may be one of:
	- `normal`: movement cost `1`
	- `congested`: movement cost `2`

	Optional metadata:
	- `hotspot: true \| false`

	Hotspots affect order generation frequency, not movement directly.

	## State Representation

	The `state()` output should be compact, structured, and easy for an LLM to read.

	Observation design principle:
	- expose raw world state and outcome signals
	- do not expose planner-side helper hints such as nearest-agent suggestions, future schedules, or direct feasibility labels

	Recommended state schema:

	```json
	{
	"time": 12,
	"grid": {
	"width": 15,
	"height": 15,
	"congested_zones": [[6, 6], [6, 7], [7, 6], [7, 7]],
	"hotspots": [[11, 11], [12, 11], [11, 12]]
	},
	"agents": [
	{
	"agent_id": "a1",
	"location": [2, 3],
	"status": "idle",
	"busy_until": 12,
	"assigned_order_id": null
	}
	],
	"orders": [
	{
	"order_id": "o4",
	"created_at": 10,
	"pickup_location": [11, 12],
	"drop_location": [13, 14],
	"reward_value": 18,
	"deadline": 20,
	"status": "unassigned"
	}
	],
	"scenario_info": {
	"name": "hotspot_congestion",
	"episode_horizon": 40,
	"used_seed": 7
	}
	}
	```

	## Action Space

	At each step, the dispatcher returns assignment decisions in strict JSON.

	Recommended format:

	```json
	{
	"assignments": [
	{"agent_id": "a1", "order_id": "o4"},
	{"agent_id": "a2", "order_id": "o2"}
	]
	}
	```

	Action rules:
	- only idle agents may be assigned
	- each agent may appear at most once
	- each order may appear at most once
	- omitted idle agents remain idle
	- malformed, duplicate, or infeasible assignments are ignored and penalized lightly

	## Step Semantics

	Each `step()` does the following:
	1. receives the LLM assignment action
	2. validates all assignments
	3. assigns valid `(agent, order)` pairs
	4. computes each assigned job's total completion time
	5. updates agents to `busy`
	6. advances the simulator to the next event
	7. resolves completed jobs
	8. expires overdue orders
	9. injects any new scheduled orders
	10. returns updated state, reward, done flag, and info

	Episodes are also bounded by a configurable `max_decision_steps` limit. When that limit is reached:
	- no further decisions are accepted
	- future unseen orders are ignored
	- already assigned orders are deterministically rolled forward and scored
	- visible unassigned orders are terminally expired and penalized
	- the environment returns a final summary through `done = true`

	## Movement and Travel Cost

	Base movement uses grid travel.

	For `v1`, use shortest-path travel cost over the grid where:
	- entering a `normal` cell costs `1`
	- entering a `congested` cell costs `2`

	Delivery time for an assigned order:

	```text
	job_time = travel(agent -> pickup) + travel(pickup -> drop) + service_time
	```

	Use:
	- `service_time = 1`

	To keep implementation simple and valid:
	- use deterministic shortest-path cost
	- no random traffic
	- no dynamic road closures
	- no stochastic delays

	## Time Advancement

	Use event-based advancement.

	After each dispatch action, advance time to the earliest of:
	- next agent becoming available
	- next order arrival
	- episode horizon reached

	This keeps the episode efficient and avoids unnecessary empty steps.

	## Reward Function

	Reward should be shaped and interpretable.

	Recommended components:
	- `+ reward_value` for on-time delivery
	- `+ early_bonus` if delivered well before deadline
	- `- lateness_penalty` for late completion
	- `- missed_penalty` for expired orders
	- `- invalid_action_penalty` for invalid assignments
	- `- idle_penalty` when idle agents exist and feasible orders remain

	Suggested concrete version:
	- on-time completion: `+reward_value`
	- early completion bonus: `+0.1 * reward_value` if completed with slack `>= 3`
	- late completion: `+0.3 * reward_value - lateness`
	- expired order: `-0.5 * reward_value`
	- invalid assignment: `-1`
	- avoidable idle penalty: `-0.5`

	This creates meaningful feedback without making the score hard to interpret.

	## Objective

	Maximize cumulative episode reward.

	A strong dispatcher should:
	- choose assignments that finish more orders on time
	- avoid wasting agents on low-value or infeasible jobs
	- handle congestion-aware tradeoffs
	- position the fleet effectively in hotspot-heavy scenarios

	## Scenario Suite

	Use 3 deterministic tasks.

	### Task 1: Low Demand

	Purpose:
	- verify core assignment logic
	- reward simple nearest-feasible reasoning

	Setup:
	- grid: `8x8`
	- agents: `3`
	- orders: `8-10`
	- congestion: none or minimal
	- deadlines: generous
	- hotspot effect: low

	Expected behavior:
	- most orders should be serviceable
	- mistakes come mostly from poor assignment choices
	- hotspot structure should be mostly stable

	### Task 2: High Demand

	Purpose:
	- test prioritization under scarcity
	- introduce moderate but steadier world evolution

	Setup:
	- grid: `10x10` or `12x12`
	- agents: `3-4`
	- orders: `18-25`
	- congestion: a few fixed zones
	- deadlines: moderate
	- hotspot effect: medium

	Expected behavior:
	- not all orders can be served
	- agent must prioritize based on reward, distance, and urgency

	### Task 3: Hotspot + Congestion

	Purpose:
	- test strategic dispatch in the richest setting

	Setup:
	- grid: `15x15`
	- agents: `4-5`
	- orders: `20-28`
	- hotspot zones: concentrated demand in selected regions
	- congestion zones: fixed cells with movement cost `2`
	- deadlines: mixed, with some tight high-value orders

	Expected behavior:
	- the LLM must trade off:
	- high-value but congested orders
	- urgent nearby orders
	- long-term fleet positioning around hotspots

	## Scenario Generation Rules

	To preserve reproducibility:
	- use fixed seeds or fully predefined schedules
	- use deterministic order arrival times
	- use deterministic congestion maps
	- use fixed hotspot coordinates per scenario

	Hotspots should influence where pickups appear more often, especially in the hard task.

	To support robustness testing without turning the benchmark into noise:
	- the environment may accept a `seed`
	- different seeds should perturb timings and spatial patterns within bounded ranges
	- the same seed should produce the same episode

	## Grading

	Each task gets a normalized score in `[0, 1]`.

	Recommended formula:

	```text
	score = clamp((agent_reward - baseline_reward) / (target_reward - baseline_reward), 0, 1)
	```

	Where:
	- `baseline_reward` is produced by a naive deterministic policy
	- `target_reward` is produced by a stronger heuristic dispatch policy

	### Baseline Policy

	Simple rule-based dispatcher:
	- sort active orders by earliest deadline
	- assign nearest idle agent greedily

	### Target Policy

	Stronger heuristic dispatcher:
	- score orders using reward, travel cost, deadline slack, and congestion-adjusted feasibility
	- greedily assign the best feasible matches

	Example heuristic:

	```text
	priority = 1.5 * reward_value - 1.0 * travel_cost - 2.0 * urgency_penalty
	```

	This gives stable anchors for normalization and makes scores interpretable across tasks.

	## Per-Task Grader Output

	Recommended output:

	```json
	{
	"task_id": "hotspot_congestion",
	"raw_reward": 57.0,
	"baseline_reward": 24.0,
	"target_reward": 68.0,
	"score": 0.75
	}
	```

	## Terminal Resolution

	When the configured decision-step budget is exhausted:
	- assigned orders that would still finish before their service cutoff are resolved and scored
	- assigned orders that would miss cutoff are expired
	- visible unassigned orders are expired immediately
	- not-yet-visible future orders are ignored

	This keeps episode length bounded while preserving deterministic final scoring.

	## Overall Score

	Recommended weighted average:
	- low demand: `0.2`
	- high demand: `0.3`
	- hotspot + congestion: `0.5`

	This makes the hardest and most realistic task matter most.

	## Submission Requirements Alignment

	This spec is designed to support:
	- real-world environment framing
	- clear `step()`, `reset()`, `state()` behavior
	- 3 distinct tasks
	- reproducible graders
	- meaningful reward shaping
	- lightweight execution under hackathon constraints

	## Why This Version Is Good

	This version is strong because it stays:
	- realistic
	- deterministic
	- judge-friendly
	- fast enough to validate
	- rich enough to show non-trivial LLM reasoning

	It also avoids the common trap of overbuilding simulation complexity before the environment core is solid.