--- title: Fleetmind V3 OpenEnv sdk: docker app_port: 7860 --- # Fleetmind **Can an AI run a city's delivery fleet when tomorrow's demand is hidden?** **Fleetmind is a benchmark for real-world orchestration.** Fleetmind is a delivery benchmark for long-horizon decision making under uncertainty. An agent sees the current city state: active zones, visible demand, courier availability, and local congestion. It must decide how to rebalance the fleet before the next wave of demand arrives. That is the core tension: - the agent must reason from partial signals - the environment reveals demand round by round - the benchmark measures whether the fleet was positioned well over time This makes Fleetmind a benchmark about anticipation, not just reaction. ## Real-World Orchestrator Fleetmind is designed around a simple but important question: **Can an LLM behave like a real operational orchestrator instead of just a reactive assistant?** In Fleetmind, the model is not answering a static question. It is: - allocating scarce courier capacity - reacting to shifting demand - balancing immediate service against future positioning - operating under uncertainty the way real dispatch systems do That framing is a big part of what makes the benchmark compelling. ## Visual Overview ```mermaid flowchart LR A["Visible city state"] --> B["LLM dispatcher"] B --> C["Fleet reallocation decision"] C --> D["Couriers move across zones"] D --> E["Orders get served or missed"] E --> F["Next demand wave arrives"] F --> A ``` ## Why Fleetmind Is Interesting Most agent benchmarks reward good local moves. Fleetmind is designed to reward good positioning. The hard part is not clicking the obvious best action right now. The hard part is deciding: - whether a current spike is real demand or a decoy - which zones deserve extra courier coverage before the evidence is obvious - when to preserve flexibility instead of overcommitting - how to trade immediate service against future coverage In other words: the agent sees the city, but not the future. ## What We Built Fleetmind V3 is a fully playable OpenEnv-style benchmark with: - a clean `reset / state / step` API - public task tiers: - `easy_dispatch` - `medium_dispatch` - `hard_dispatch` - hidden curated case banks behind public seeds - deterministic graded episodes - deterministic validation and Docker packaging - a live Hugging Face Space deployment Submission-facing shell: - [app.py](/C:/Users/risha/Documents/New project/app.py) - [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml) - [inference.py](/C:/Users/risha/Documents/New project/inference.py) - [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py) Core benchmark implementation: - [api.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/api.py) - [environment.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/environment.py) - [generator.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/generator.py) - [solver.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/solver.py) - [grading.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/grading.py) - [seed_catalog.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/seed_catalog.py) ## The Benchmark Design Fleetmind uses a delivery-domain benchmark family with: - `K` delivery zones - `M` couriers - `T` decision rounds - visible current demand by zone - hidden future demand patterns - optional congestion and premium windows At each round, the agent chooses a target courier allocation across zones. The agent sees: - current visible orders - per-order rewards - congestion multipliers - courier counts by zone - remaining rounds This lets us build tasks that are: - hard for the agent - practical to evaluate at benchmark scale That was a central design goal. ## Why It Feels Different Fleetmind is not trying to be a flashy simulator. It is intentionally shaped around a realistic orchestration loop: - observe the city - move capacity - live with the downstream consequences - adapt on the next round That means success depends on operational judgment, not just finding a locally attractive move. ## Difficulty Tiers Fleetmind keeps the interface simple, and scales difficulty through hidden structure rather than simulator clutter. ### Easy - visible demand is fairly informative - sensible rebalancing works reasonably well - strong one-shot policies can do well ### Medium - current demand starts to mislead - short-term greed leaves value on the table - repositioning becomes strategically important ### Hard - future demand stays ambiguous for longer - overcommitting early creates downstream regret - the agent must hedge, infer, and adapt over time ## Public API Endpoints: - `GET /health` - `POST /reset` - `GET /state` - `POST /step` `POST /reset` behavior: - with `task_id`, starts a fresh episode in that tier - with `seed`, deterministically maps the public seed to a hidden curated case - with no `seed`, randomly selects a hidden curated case - with no `task_id`, randomly chooses one of the public tasks The observation includes: - `task_id` - `round_index` - `remaining_rounds` - per-zone courier and demand state - feedback from the last step - `scenario_info` with fleet limits and episode hints Example action format: ```json { "target_allocations": [ {"zone_id": "north", "courier_count": 2}, {"zone_id": "east", "courier_count": 1}, {"zone_id": "south", "courier_count": 1}, {"zone_id": "west", "courier_count": 2} ] } ``` Rules: - include every zone exactly once - counts must sum to the total courier count - invalid or over-cap rebalances are penalized and ignored ## Live Demo Hugging Face Space: - [rishavutk/fleetmind](https://huggingface.co/spaces/rishavutk/fleetmind) Health endpoint: - [Space health](https://rishavutk-fleetmind.hf.space/health) ## Try It If you want to explore Fleetmind through the codebase or the live Space, the main entrypoints are: - [app.py](/C:/Users/risha/Documents/New project/app.py) - [inference.py](/C:/Users/risha/Documents/New project/inference.py) - [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml) - [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py) The benchmark exposes a simple episode loop: - `reset` - `state` - `step` That makes it easy to plug in: - LLM agents - scripted heuristics - black-box evaluation agents - external orchestrator policies ## Why It Matters A lot of real work does not look like question answering. It looks like: - monitoring a changing system - reallocating limited resources - acting before the full picture is visible - absorbing the cost of bad early decisions Fleetmind brings that style of decision making into a compact benchmark loop. ## Project Map Important files: - [README.md](/C:/Users/risha/Documents/New project/README.md) - [HACKATHON_REQUIREMENTS.md](/C:/Users/risha/Documents/New project/HACKATHON_REQUIREMENTS.md) - [PROJECT_SPEC.md](/C:/Users/risha/Documents/New project/PROJECT_SPEC.md) - [src/delivery_dispatch_v3](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3) - [docs/v3_blackbox_subagent_contract.md](/C:/Users/risha/Documents/New project/docs/v3_blackbox_subagent_contract.md) ## What Makes It Cool Fleetmind is not just "delivery dispatch with JSON." It turns delivery operations into an agent benchmark where the model has to think like a dispatcher: - where should capacity move before demand becomes obvious? - which signals are real and which are decoys? - when is it better to hedge than to commit? That makes Fleetmind a benchmark for real-world orchestration behavior, not just single-step response quality.