Spaces:

rishavutk
/

fleetmind

Running

App Files Files Community

fleetmind / README.md

Rishav

Refresh Fleetmind README

2a81ca3 3 months ago

preview code

Raw

History Blame Contribute Delete

7.83 kB

metadata

title: Fleetmind V3 OpenEnv
sdk: docker
app_port: 7860

Fleetmind

Can an AI run a city's delivery fleet when tomorrow's demand is hidden?

Fleetmind is a benchmark for real-world orchestration.

Fleetmind is a delivery benchmark for long-horizon decision making under uncertainty. An agent sees the current city state: active zones, visible demand, courier availability, and local congestion. It must decide how to rebalance the fleet before the next wave of demand arrives.

That is the core tension:

the agent must reason from partial signals
the environment reveals demand round by round
the benchmark measures whether the fleet was positioned well over time

This makes Fleetmind a benchmark about anticipation, not just reaction.

Real-World Orchestrator

Fleetmind is designed around a simple but important question:

Can an LLM behave like a real operational orchestrator instead of just a reactive assistant?

In Fleetmind, the model is not answering a static question. It is:

allocating scarce courier capacity
reacting to shifting demand
balancing immediate service against future positioning
operating under uncertainty the way real dispatch systems do

That framing is a big part of what makes the benchmark compelling.

Visual Overview

flowchart LR
    A["Visible city state"] --> B["LLM dispatcher"]
    B --> C["Fleet reallocation decision"]
    C --> D["Couriers move across zones"]
    D --> E["Orders get served or missed"]
    E --> F["Next demand wave arrives"]
    F --> A

Why Fleetmind Is Interesting

Most agent benchmarks reward good local moves. Fleetmind is designed to reward good positioning.

The hard part is not clicking the obvious best action right now. The hard part is deciding:

whether a current spike is real demand or a decoy
which zones deserve extra courier coverage before the evidence is obvious
when to preserve flexibility instead of overcommitting
how to trade immediate service against future coverage

In other words: the agent sees the city, but not the future.

What We Built

Fleetmind V3 is a fully playable OpenEnv-style benchmark with:

a clean reset / state / step API
public task tiers:
- easy_dispatch
- medium_dispatch
- hard_dispatch
hidden curated case banks behind public seeds
deterministic graded episodes
deterministic validation and Docker packaging
a live Hugging Face Space deployment

Submission-facing shell:

[app.py](/C:/Users/risha/Documents/New project/app.py)
[openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml)
[inference.py](/C:/Users/risha/Documents/New project/inference.py)
[validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py)

Core benchmark implementation:

[api.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/api.py)
[environment.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/environment.py)
[generator.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/generator.py)
[solver.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/solver.py)
[grading.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/grading.py)
[seed_catalog.py](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3/seed_catalog.py)

The Benchmark Design

Fleetmind uses a delivery-domain benchmark family with:

K delivery zones
M couriers
T decision rounds
visible current demand by zone
hidden future demand patterns
optional congestion and premium windows

At each round, the agent chooses a target courier allocation across zones.

The agent sees:

current visible orders
per-order rewards
congestion multipliers
courier counts by zone
remaining rounds

This lets us build tasks that are:

hard for the agent
practical to evaluate at benchmark scale

That was a central design goal.

Why It Feels Different

Fleetmind is not trying to be a flashy simulator.

It is intentionally shaped around a realistic orchestration loop:

observe the city
move capacity
live with the downstream consequences
adapt on the next round

That means success depends on operational judgment, not just finding a locally attractive move.

Difficulty Tiers

Fleetmind keeps the interface simple, and scales difficulty through hidden structure rather than simulator clutter.

Easy

visible demand is fairly informative
sensible rebalancing works reasonably well
strong one-shot policies can do well

Medium

current demand starts to mislead
short-term greed leaves value on the table
repositioning becomes strategically important

Hard

future demand stays ambiguous for longer
overcommitting early creates downstream regret
the agent must hedge, infer, and adapt over time

Public API

Endpoints:

GET /health
POST /reset
GET /state
POST /step

POST /reset behavior:

with task_id, starts a fresh episode in that tier
with seed, deterministically maps the public seed to a hidden curated case
with no seed, randomly selects a hidden curated case
with no task_id, randomly chooses one of the public tasks

The observation includes:

task_id
round_index
remaining_rounds
per-zone courier and demand state
feedback from the last step
scenario_info with fleet limits and episode hints

Example action format:

{
  "target_allocations": [
    {"zone_id": "north", "courier_count": 2},
    {"zone_id": "east", "courier_count": 1},
    {"zone_id": "south", "courier_count": 1},
    {"zone_id": "west", "courier_count": 2}
  ]
}

Rules:

include every zone exactly once
counts must sum to the total courier count
invalid or over-cap rebalances are penalized and ignored

Live Demo

Hugging Face Space:

rishavutk/fleetmind

Health endpoint:

Space health

Try It

If you want to explore Fleetmind through the codebase or the live Space, the main entrypoints are:

[app.py](/C:/Users/risha/Documents/New project/app.py)
[inference.py](/C:/Users/risha/Documents/New project/inference.py)
[openenv.yaml](/C:/Users/risha/Documents/New project/openenv.yaml)
[validate_submission.py](/C:/Users/risha/Documents/New project/validate_submission.py)

The benchmark exposes a simple episode loop:

reset
state
step

That makes it easy to plug in:

LLM agents
scripted heuristics
black-box evaluation agents
external orchestrator policies

Why It Matters

A lot of real work does not look like question answering. It looks like:

monitoring a changing system
reallocating limited resources
acting before the full picture is visible
absorbing the cost of bad early decisions

Fleetmind brings that style of decision making into a compact benchmark loop.

Project Map

Important files:

[README.md](/C:/Users/risha/Documents/New project/README.md)
[HACKATHON_REQUIREMENTS.md](/C:/Users/risha/Documents/New project/HACKATHON_REQUIREMENTS.md)
[PROJECT_SPEC.md](/C:/Users/risha/Documents/New project/PROJECT_SPEC.md)
[src/delivery_dispatch_v3](/C:/Users/risha/Documents/New project/src/delivery_dispatch_v3)
[docs/v3_blackbox_subagent_contract.md](/C:/Users/risha/Documents/New project/docs/v3_blackbox_subagent_contract.md)

What Makes It Cool

Fleetmind is not just "delivery dispatch with JSON."

It turns delivery operations into an agent benchmark where the model has to think like a dispatcher:

where should capacity move before demand becomes obvious?
which signals are real and which are decoys?
when is it better to hedge than to commit?

That makes Fleetmind a benchmark for real-world orchestration behavior, not just single-step response quality.