title: Fleetmind V3 OpenEnv
sdk: docker
app_port: 7860
Fleetmind
Can an AI run a city's delivery fleet when tomorrow's demand is hidden?
Fleetmind is a benchmark for real-world orchestration.
Fleetmind is a delivery benchmark for long-horizon decision making under uncertainty. An agent sees the current city state: active zones, visible demand, courier availability, and local congestion. It must decide how to rebalance the fleet before the next wave of demand arrives.
That is the core tension:
- the agent must reason from partial signals
- the environment reveals demand round by round
- the benchmark measures whether the fleet was positioned well over time
This makes Fleetmind a benchmark about anticipation, not just reaction.
Real-World Orchestrator
Fleetmind is designed around a simple but important question:
Can an LLM behave like a real operational orchestrator instead of just a reactive assistant?
In Fleetmind, the model is not answering a static question. It is:
- allocating scarce courier capacity
- reacting to shifting demand
- balancing immediate service against future positioning
- operating under uncertainty the way real dispatch systems do
That framing is a big part of what makes the benchmark compelling.
Visual Overview
flowchart LR
A["Visible city state"] --> B["LLM dispatcher"]
B --> C["Fleet reallocation decision"]
C --> D["Couriers move across zones"]
D --> E["Orders get served or missed"]
E --> F["Next demand wave arrives"]
F --> A
Why Fleetmind Is Interesting
Most agent benchmarks reward good local moves. Fleetmind is designed to reward good positioning.
The hard part is not clicking the obvious best action right now. The hard part is deciding:
- whether a current spike is real demand or a decoy
- which zones deserve extra courier coverage before the evidence is obvious
- when to preserve flexibility instead of overcommitting
- how to trade immediate service against future coverage
In other words: the agent sees the city, but not the future.
What We Built
Fleetmind V3 is a fully playable OpenEnv-style benchmark with:
- a clean
reset / state / stepAPI - public task tiers:
easy_dispatchmedium_dispatchhard_dispatch
- hidden curated case banks behind public seeds
- deterministic graded episodes
- deterministic validation and Docker packaging
- a live Hugging Face Space deployment
Submission-facing shell:
- [app.py](/C:/Users/risha/Documents/New project/app.py)
- [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml)
- [inference.py](/C:/Users/risha/Documents/New project/inference.py)
- [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py)
Core benchmark implementation:
- [api.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/api.py)
- [environment.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/environment.py)
- [generator.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/generator.py)
- [solver.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/solver.py)
- [grading.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/grading.py)
- [seed_catalog.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/seed_catalog.py)
The Benchmark Design
Fleetmind uses a delivery-domain benchmark family with:
Kdelivery zonesMcouriersTdecision rounds- visible current demand by zone
- hidden future demand patterns
- optional congestion and premium windows
At each round, the agent chooses a target courier allocation across zones.
The agent sees:
- current visible orders
- per-order rewards
- congestion multipliers
- courier counts by zone
- remaining rounds
This lets us build tasks that are:
- hard for the agent
- practical to evaluate at benchmark scale
That was a central design goal.
Why It Feels Different
Fleetmind is not trying to be a flashy simulator.
It is intentionally shaped around a realistic orchestration loop:
- observe the city
- move capacity
- live with the downstream consequences
- adapt on the next round
That means success depends on operational judgment, not just finding a locally attractive move.
Difficulty Tiers
Fleetmind keeps the interface simple, and scales difficulty through hidden structure rather than simulator clutter.
Easy
- visible demand is fairly informative
- sensible rebalancing works reasonably well
- strong one-shot policies can do well
Medium
- current demand starts to mislead
- short-term greed leaves value on the table
- repositioning becomes strategically important
Hard
- future demand stays ambiguous for longer
- overcommitting early creates downstream regret
- the agent must hedge, infer, and adapt over time
Public API
Endpoints:
GET /healthPOST /resetGET /statePOST /step
POST /reset behavior:
- with
task_id, starts a fresh episode in that tier - with
seed, deterministically maps the public seed to a hidden curated case - with no
seed, randomly selects a hidden curated case - with no
task_id, randomly chooses one of the public tasks
The observation includes:
task_idround_indexremaining_rounds- per-zone courier and demand state
- feedback from the last step
scenario_infowith fleet limits and episode hints
Example action format:
{
"target_allocations": [
{"zone_id": "north", "courier_count": 2},
{"zone_id": "east", "courier_count": 1},
{"zone_id": "south", "courier_count": 1},
{"zone_id": "west", "courier_count": 2}
]
}
Rules:
- include every zone exactly once
- counts must sum to the total courier count
- invalid or over-cap rebalances are penalized and ignored
Live Demo
Hugging Face Space:
Health endpoint:
Try It
If you want to explore Fleetmind through the codebase or the live Space, the main entrypoints are:
- [app.py](/C:/Users/risha/Documents/New project/app.py)
- [inference.py](/C:/Users/risha/Documents/New project/inference.py)
- [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml)
- [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py)
The benchmark exposes a simple episode loop:
resetstatestep
That makes it easy to plug in:
- LLM agents
- scripted heuristics
- black-box evaluation agents
- external orchestrator policies
Why It Matters
A lot of real work does not look like question answering. It looks like:
- monitoring a changing system
- reallocating limited resources
- acting before the full picture is visible
- absorbing the cost of bad early decisions
Fleetmind brings that style of decision making into a compact benchmark loop.
Project Map
Important files:
- [README.md](/C:/Users/risha/Documents/New project/README.md)
- [HACKATHON_REQUIREMENTS.md](/C:/Users/risha/Documents/New project/HACKATHON_REQUIREMENTS.md)
- [PROJECT_SPEC.md](/C:/Users/risha/Documents/New project/PROJECT_SPEC.md)
- [src/delivery_dispatch_v3](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3)
- [docs/v3_blackbox_subagent_contract.md](/C:/Users/risha/Documents/New project/docs/v3_blackbox_subagent_contract.md)
What Makes It Cool
Fleetmind is not just "delivery dispatch with JSON."
It turns delivery operations into an agent benchmark where the model has to think like a dispatcher:
- where should capacity move before demand becomes obvious?
- which signals are real and which are decoys?
- when is it better to hedge than to commit?
That makes Fleetmind a benchmark for real-world orchestration behavior, not just single-step response quality.