| --- |
| title: Fleetmind V3 OpenEnv |
| sdk: docker |
| app_port: 7860 |
| --- |
| |
| # Fleetmind |
|
|
| **Can an AI run a city's delivery fleet when tomorrow's demand is hidden?** |
|
|
| **Fleetmind is a benchmark for real-world orchestration.** |
|
|
| Fleetmind is a delivery benchmark for long-horizon decision making under uncertainty. An agent sees the current city state: active zones, visible demand, courier availability, and local congestion. It must decide how to rebalance the fleet before the next wave of demand arrives. |
|
|
| That is the core tension: |
| - the agent must reason from partial signals |
| - the environment reveals demand round by round |
| - the benchmark measures whether the fleet was positioned well over time |
|
|
| This makes Fleetmind a benchmark about anticipation, not just reaction. |
|
|
| ## Real-World Orchestrator |
|
|
| Fleetmind is designed around a simple but important question: |
|
|
| **Can an LLM behave like a real operational orchestrator instead of just a reactive assistant?** |
|
|
| In Fleetmind, the model is not answering a static question. It is: |
| - allocating scarce courier capacity |
| - reacting to shifting demand |
| - balancing immediate service against future positioning |
| - operating under uncertainty the way real dispatch systems do |
|
|
| That framing is a big part of what makes the benchmark compelling. |
|
|
| ## Visual Overview |
|
|
| ```mermaid |
| flowchart LR |
| A["Visible city state"] --> B["LLM dispatcher"] |
| B --> C["Fleet reallocation decision"] |
| C --> D["Couriers move across zones"] |
| D --> E["Orders get served or missed"] |
| E --> F["Next demand wave arrives"] |
| F --> A |
| ``` |
|
|
| ## Why Fleetmind Is Interesting |
|
|
| Most agent benchmarks reward good local moves. Fleetmind is designed to reward good positioning. |
|
|
| The hard part is not clicking the obvious best action right now. The hard part is deciding: |
| - whether a current spike is real demand or a decoy |
| - which zones deserve extra courier coverage before the evidence is obvious |
| - when to preserve flexibility instead of overcommitting |
| - how to trade immediate service against future coverage |
|
|
| In other words: the agent sees the city, but not the future. |
|
|
| ## What We Built |
|
|
| Fleetmind V3 is a fully playable OpenEnv-style benchmark with: |
| - a clean `reset / state / step` API |
| - public task tiers: |
| - `easy_dispatch` |
| - `medium_dispatch` |
| - `hard_dispatch` |
| - hidden curated case banks behind public seeds |
| - deterministic graded episodes |
| - deterministic validation and Docker packaging |
| - a live Hugging Face Space deployment |
|
|
| Submission-facing shell: |
| - [app.py](/C:/Users/risha/Documents/New project/app.py) |
| - [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml) |
| - [inference.py](/C:/Users/risha/Documents/New project/inference.py) |
| - [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py) |
|
|
| Core benchmark implementation: |
| - [api.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/api.py) |
| - [environment.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/environment.py) |
| - [generator.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/generator.py) |
| - [solver.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/solver.py) |
| - [grading.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/grading.py) |
| - [seed_catalog.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/seed_catalog.py) |
|
|
| ## The Benchmark Design |
|
|
| Fleetmind uses a delivery-domain benchmark family with: |
| - `K` delivery zones |
| - `M` couriers |
| - `T` decision rounds |
| - visible current demand by zone |
| - hidden future demand patterns |
| - optional congestion and premium windows |
|
|
| At each round, the agent chooses a target courier allocation across zones. |
|
|
| The agent sees: |
| - current visible orders |
| - per-order rewards |
| - congestion multipliers |
| - courier counts by zone |
| - remaining rounds |
|
|
| This lets us build tasks that are: |
| - hard for the agent |
| - practical to evaluate at benchmark scale |
|
|
| That was a central design goal. |
|
|
| ## Why It Feels Different |
|
|
| Fleetmind is not trying to be a flashy simulator. |
|
|
| It is intentionally shaped around a realistic orchestration loop: |
| - observe the city |
| - move capacity |
| - live with the downstream consequences |
| - adapt on the next round |
|
|
| That means success depends on operational judgment, not just finding a locally attractive move. |
|
|
| ## Difficulty Tiers |
|
|
| Fleetmind keeps the interface simple, and scales difficulty through hidden structure rather than simulator clutter. |
|
|
| ### Easy |
| - visible demand is fairly informative |
| - sensible rebalancing works reasonably well |
| - strong one-shot policies can do well |
|
|
| ### Medium |
| - current demand starts to mislead |
| - short-term greed leaves value on the table |
| - repositioning becomes strategically important |
|
|
| ### Hard |
| - future demand stays ambiguous for longer |
| - overcommitting early creates downstream regret |
| - the agent must hedge, infer, and adapt over time |
|
|
| ## Public API |
|
|
| Endpoints: |
| - `GET /health` |
| - `POST /reset` |
| - `GET /state` |
| - `POST /step` |
|
|
| `POST /reset` behavior: |
| - with `task_id`, starts a fresh episode in that tier |
| - with `seed`, deterministically maps the public seed to a hidden curated case |
| - with no `seed`, randomly selects a hidden curated case |
| - with no `task_id`, randomly chooses one of the public tasks |
|
|
| The observation includes: |
| - `task_id` |
| - `round_index` |
| - `remaining_rounds` |
| - per-zone courier and demand state |
| - feedback from the last step |
| - `scenario_info` with fleet limits and episode hints |
|
|
| Example action format: |
|
|
| ```json |
| { |
| "target_allocations": [ |
| {"zone_id": "north", "courier_count": 2}, |
| {"zone_id": "east", "courier_count": 1}, |
| {"zone_id": "south", "courier_count": 1}, |
| {"zone_id": "west", "courier_count": 2} |
| ] |
| } |
| ``` |
|
|
| Rules: |
| - include every zone exactly once |
| - counts must sum to the total courier count |
| - invalid or over-cap rebalances are penalized and ignored |
|
|
| ## Live Demo |
|
|
| Hugging Face Space: |
| - [rishavutk/fleetmind](https://huggingface.co/spaces/rishavutk/fleetmind) |
|
|
| Health endpoint: |
| - [Space health](https://rishavutk-fleetmind.hf.space/health) |
|
|
| ## Try It |
|
|
| If you want to explore Fleetmind through the codebase or the live Space, the main entrypoints are: |
| - [app.py](/C:/Users/risha/Documents/New project/app.py) |
| - [inference.py](/C:/Users/risha/Documents/New project/inference.py) |
| - [openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml) |
| - [validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py) |
|
|
| The benchmark exposes a simple episode loop: |
| - `reset` |
| - `state` |
| - `step` |
|
|
| That makes it easy to plug in: |
| - LLM agents |
| - scripted heuristics |
| - black-box evaluation agents |
| - external orchestrator policies |
|
|
| ## Why It Matters |
|
|
| A lot of real work does not look like question answering. It looks like: |
| - monitoring a changing system |
| - reallocating limited resources |
| - acting before the full picture is visible |
| - absorbing the cost of bad early decisions |
|
|
| Fleetmind brings that style of decision making into a compact benchmark loop. |
|
|
| ## Project Map |
|
|
| Important files: |
| - [README.md](/C:/Users/risha/Documents/New project/README.md) |
| - [HACKATHON_REQUIREMENTS.md](/C:/Users/risha/Documents/New project/HACKATHON_REQUIREMENTS.md) |
| - [PROJECT_SPEC.md](/C:/Users/risha/Documents/New project/PROJECT_SPEC.md) |
| - [src/delivery_dispatch_v3](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3) |
| - [docs/v3_blackbox_subagent_contract.md](/C:/Users/risha/Documents/New project/docs/v3_blackbox_subagent_contract.md) |
|
|
| ## What Makes It Cool |
|
|
| Fleetmind is not just "delivery dispatch with JSON." |
|
|
| It turns delivery operations into an agent benchmark where the model has to think like a dispatcher: |
| - where should capacity move before demand becomes obvious? |
| - which signals are real and which are decoys? |
| - when is it better to hedge than to commit? |
|
|
| That makes Fleetmind a benchmark for real-world orchestration behavior, not just single-step response quality. |
|
|