Spaces:

rishavutk
/

fleetmind

Running

App Files Files Community

fleetmind / docs /agent_eval_prompt.md

Rishav

Add OpenEnv deployment compatibility

dc3a41c 3 months ago

preview code

Raw

History Blame Contribute Delete

2.05 kB

Black-Box Agent Evaluation Prompt

Use this prompt when evaluating an external agent against Fleetmind without giving it source-code access.

Goal

Maximize cumulative reward in the live environment by interacting only through the HTTP API.

Core Principle

Treat Fleetmind as a black-box environment.

You may use any reasoning or planning tools available to you, including calculations, helper code, temporary scripts, or policy notes, but you must not inspect the environment source code, repository files, or hidden implementation details. The HTTP API is the only allowed interface to the environment itself.

Allowed Endpoints

GET /health
POST /reset
GET /state
POST /step

Recommended Evaluation Flow

Call GET /health to confirm the service is live.
Start a fresh episode with POST /reset.
Play the episode entirely through repeated POST /step calls until done = true.
Use GET /state only when needed for recovery, inspection, or consistency checks.
Base all decisions only on API observations and returned feedback.

Agent Freedom

The agent is allowed to:

compute distances, route costs, or heuristics externally
write temporary helper scripts or planning code
keep notes or policy summaries across episodes
retry on new seeds and compare strategies

The agent is not allowed to:

inspect local repository files or source code
rely on hidden future schedules or undisclosed reward logic
modify the environment implementation

Suggested Curriculum

If you are evaluating learning or strategy improvement across multiple runs:

start with low_demand
move to high_demand
finish on hotspot_congestion

This keeps the progression aligned with the environment's intended easy -> medium -> hard ladder.

Final Report

At the end of each evaluation run, report:

final cumulative reward
the policy or strategy you followed
key assignment and rejection decisions
what the API feedback taught you
what felt confusing, too easy, too derived, or gameable