mini-rl-env / README.md
Soham Bose
updated README
468d9ec unverified
metadata
title: RL-Env Warehouse Fulfillment
emoji: 🏭
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false

RL-Env: Warehouse Fulfillment Environment

A progressively challenging OpenEnv-style warehouse fulfillment environment simulating realistic pharmacy micro-fulfillment workflows with obstacles, weight constraints, stamina management, and budget optimization.

Overview

Agents control a warehouse robot navigating a 7Γ—7 grid to fulfill customer orders. Tasks range from simple single-item pickups to complex multi-constraint challenges requiring strategic planning across obstacles, battery management, item weights, stamina conservation, and profit optimization.

Requirements Coverage

Core Mechanics

Actions

  • Navigation: turn_left, turn_right, move_forward
  • Operations: scan_bin, pick_item, pack_item
  • Resources: recharge (battery), rest (stamina)
  • Utility: wait

Advanced Mechanics (task-dependent)

  1. Obstacles: Impassable cells blocking direct paths β€” agents must route around them
  2. Item Weight & Carry Capacity: Items have weight (1-4 units); heavier items drain more battery while moving
  3. Stamina System: Movement costs stamina; when depleted, movement costs double battery
  4. Money & Profit Targets: Items have dollar values; correct packs earn money, wrong packs lose money

Environment Interface

from grid_env import WarehouseFulfillmentEnv

env = WarehouseFulfillmentEnv(task_id="easy_single_pick", seed=7)

# Reset environment
observation = env.reset(task_id="easy_single_pick", seed=7)

# Step with action
observation, reward, done, info = env.step("move_forward")

# Get full state
state = env.state()

All data types (WarehouseAction, WarehouseObservation, WarehouseReward, WarehouseState) are Pydantic models.

Tasks

The environment includes 8 progressively challenging tasks across 4 difficulty levels:

Easy

  • easy_single_pick: Fulfill one urgent thermometer order (40 steps, battery 36)

Medium

  • medium_multi_item: Two-line order with scan verification (60 steps, battery 34)
  • obstacle_course: Navigate around 6 obstacles to fulfill two-item order (70 steps, battery 40)

Hard

  • hard_restock_priority: Three-line order with battery management (85 steps, battery 24)
  • heavy_lifting: Weight-constrained picking β€” items weigh 1-4 units, carry capacity 3 (90 steps, battery 32)
  • stamina_run: Stamina management β€” movement drains stamina; rest to recover (80 steps, battery 36, stamina 12)

Expert

  • budget_run: Profit-driven fulfillment β€” earn $15+ from valued items (70 steps, battery 30, target $15)
  • gauntlet: All mechanics combined β€” obstacles + weight + stamina + $20 profit target (120 steps, battery 28, stamina 10, carry capacity 3)

Each grader returns a deterministic rubric-based score in [0.0, 1.0] based on completion, efficiency, and constraint satisfaction.

Reward Design

The reward function provides dense trajectory-wide feedback:

Positive Rewards:

  • Correct scans: +0.12
  • Correct picks: +0.20
  • Correct packs: +0.35
  • Completion bonus: +0.50
  • Timely recharge/rest: +0.06-0.08

Penalties:

  • Invalid actions: -0.08 to -0.10
  • Wrong picks: -0.18
  • Wrong packs: -0.15
  • Obstacle collisions: -0.12
  • Overweight attempts: -0.12
  • Waiting: -0.01 per step

Money System (expert tasks):

  • Correct packs earn item value in dollars
  • Wrong packs lose 50% of item value

Installation & Usage

Install Dependencies

pip install -r requirements.txt

Run Baseline with OpenAI

The baseline runner uses the OpenAI Python SDK with multi-seed evaluation for robust scoring:

export OPENAI_API_KEY="your_api_key_here"
export OPENAI_MODEL="gpt-4o-mini"

# Single-seed evaluation (backward compatible)
export EVAL_SEEDS="7"
python3 -m grid_env.baseline

# Multi-seed evaluation (default: 5 seeds)
export EVAL_SEEDS="7,42,123,456,789"
python3 -m grid_env.baseline

Run Inference Script

For comprehensive evaluation across all 8 tasks with multi-seed support:

# Create .env file with:
# HF_TOKEN=your_api_key
# API_BASE_URL=https://api.openai.com/v1
# MODEL_NAME=gpt-4o-mini
# EVAL_SEEDS=7,42,123,456,789

# Load .env and run
set -a && source .env && set +a
python3 inference.py

Multi-seed evaluation runs each task across multiple random seeds and reports:

  • Mean score Β± standard deviation
  • Min/max scores across seeds
  • Success rate across seeds

This prevents overfitting to specific warehouse layouts and provides more reliable performance metrics.

Testing

Run the full test suite (205 tests):

pytest tests/ -v

Multi-Seed Evaluation Demo

See how multi-seed evaluation works without requiring an API key:

python3 example_multiseed.py

This runs a random policy across multiple seeds and shows score variance.

Server Deployment

The environment includes a FastAPI server for remote access:

# Local development
uvicorn grid_env.Server.app:app --reload

# Docker deployment
docker build -t warehouse-env .
docker run -p 8000:8000 warehouse-env

API endpoints:

  • GET /health β€” Server status
  • GET /tasks β€” List all tasks
  • POST /reset β€” Reset environment
  • POST /step β€” Execute action
  • GET /state β€” Get current state

Validation

If openenv is installed, validate the manifest:

openenv validate grid_env/openv.yaml

Project Structure

grid_env/
β”œβ”€β”€ env.py           # Core environment logic
β”œβ”€β”€ tasks.py         # Task definitions (8 tasks)
β”œβ”€β”€ graders.py       # Rubric-based graders
β”œβ”€β”€ models.py        # Pydantic data models
β”œβ”€β”€ baseline.py      # OpenAI baseline runner
β”œβ”€β”€ tools.py         # Action tool definitions
β”œβ”€β”€ world.py         # World state wrapper
β”œβ”€β”€ client.py        # Client facade
└── Server/
    β”œβ”€β”€ app.py           # FastAPI server
    └── warehouse_env.py # Service wrapper

tests/               # 205 test cases
inference.py         # LLM inference runner

License

MIT License - Copyright 2026 Soham Bose