---
title: RL-Env Warehouse Fulfillment
emoji: 🏭
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
---

# RL-Env: Warehouse Fulfillment Environment

A progressively challenging OpenEnv-style warehouse fulfillment environment simulating realistic pharmacy micro-fulfillment workflows with obstacles, weight constraints, stamina management, and budget optimization.

## Overview

Agents control a warehouse robot navigating a 7×7 grid to fulfill customer orders. Tasks range from simple single-item pickups to complex multi-constraint challenges requiring strategic planning across obstacles, battery management, item weights, stamina conservation, and profit optimization.

## Requirements Coverage

- **Real-world task**: Implemented in [`env.py`](grid_env/env.py)
- **Eight graded tasks**: Defined in [`tasks.py`](grid_env/tasks.py) with progressive difficulty
- **Deterministic graders**: Implemented in [`graders.py`](grid_env/graders.py)
- **Typed OpenEnv models**: Implemented in [`models.py`](grid_env/models.py)
- **OpenEnv manifests**: [`openenv.yaml`](openenv.yaml) and [`openv.yaml`](grid_env/openv.yaml)
- **OpenAI baseline runner**: [`baseline.py`](grid_env/baseline.py)

## Core Mechanics

### Actions
- **Navigation**: `turn_left`, `turn_right`, `move_forward`
- **Operations**: `scan_bin`, `pick_item`, `pack_item`
- **Resources**: `recharge` (battery), `rest` (stamina)
- **Utility**: `wait`

### Advanced Mechanics (task-dependent)
1. **Obstacles**: Impassable cells blocking direct paths — agents must route around them
2. **Item Weight & Carry Capacity**: Items have weight (1-4 units); heavier items drain more battery while moving
3. **Stamina System**: Movement costs stamina; when depleted, movement costs double battery
4. **Money & Profit Targets**: Items have dollar values; correct packs earn money, wrong packs lose money

## Environment Interface

```python
from grid_env import WarehouseFulfillmentEnv

env = WarehouseFulfillmentEnv(task_id="easy_single_pick", seed=7)

# Reset environment
observation = env.reset(task_id="easy_single_pick", seed=7)

# Step with action
observation, reward, done, info = env.step("move_forward")

# Get full state
state = env.state()
```

All data types (`WarehouseAction`, `WarehouseObservation`, `WarehouseReward`, `WarehouseState`) are Pydantic models.

## Tasks

The environment includes **8 progressively challenging tasks** across 4 difficulty levels:

### Easy
- **`easy_single_pick`**: Fulfill one urgent thermometer order (40 steps, battery 36)

### Medium
- **`medium_multi_item`**: Two-line order with scan verification (60 steps, battery 34)
- **`obstacle_course`**: Navigate around 6 obstacles to fulfill two-item order (70 steps, battery 40)

### Hard
- **`hard_restock_priority`**: Three-line order with battery management (85 steps, battery 24)
- **`heavy_lifting`**: Weight-constrained picking — items weigh 1-4 units, carry capacity 3 (90 steps, battery 32)
- **`stamina_run`**: Stamina management — movement drains stamina; rest to recover (80 steps, battery 36, stamina 12)

### Expert
- **`budget_run`**: Profit-driven fulfillment — earn $15+ from valued items (70 steps, battery 30, target $15)
- **`gauntlet`**: All mechanics combined — obstacles + weight + stamina + $20 profit target (120 steps, battery 28, stamina 10, carry capacity 3)

Each grader returns a deterministic rubric-based score in `[0.0, 1.0]` based on completion, efficiency, and constraint satisfaction.

## Reward Design

The reward function provides dense trajectory-wide feedback:

**Positive Rewards:**
- Correct scans: +0.12
- Correct picks: +0.20
- Correct packs: +0.35
- Completion bonus: +0.50
- Timely recharge/rest: +0.06-0.08

**Penalties:**
- Invalid actions: -0.08 to -0.10
- Wrong picks: -0.18
- Wrong packs: -0.15
- Obstacle collisions: -0.12
- Overweight attempts: -0.12
- Waiting: -0.01 per step

**Money System (expert tasks):**
- Correct packs earn item value in dollars
- Wrong packs lose 50% of item value

## Installation & Usage

### Install Dependencies

```bash
pip install -r requirements.txt
```

### Run Baseline with OpenAI

The baseline runner uses the OpenAI Python SDK with **multi-seed evaluation** for robust scoring:

```bash
export OPENAI_API_KEY="your_api_key_here"
export OPENAI_MODEL="gpt-4o-mini"

# Single-seed evaluation (backward compatible)
export EVAL_SEEDS="7"
python3 -m grid_env.baseline

# Multi-seed evaluation (default: 5 seeds)
export EVAL_SEEDS="7,42,123,456,789"
python3 -m grid_env.baseline
```

### Run Inference Script

For comprehensive evaluation across all 8 tasks with multi-seed support:

```bash
# Create .env file with:
# HF_TOKEN=your_api_key
# API_BASE_URL=https://api.openai.com/v1
# MODEL_NAME=gpt-4o-mini
# EVAL_SEEDS=7,42,123,456,789

# Load .env and run
set -a && source .env && set +a
python3 inference.py
```

**Multi-seed evaluation** runs each task across multiple random seeds and reports:
- Mean score ± standard deviation
- Min/max scores across seeds
- Success rate across seeds

This prevents overfitting to specific warehouse layouts and provides more reliable performance metrics.

## Testing

Run the full test suite (205 tests):

```bash
pytest tests/ -v
```

### Multi-Seed Evaluation Demo

See how multi-seed evaluation works without requiring an API key:

```bash
python3 example_multiseed.py
```

This runs a random policy across multiple seeds and shows score variance.

## Server Deployment

The environment includes a FastAPI server for remote access:

```bash
# Local development
uvicorn grid_env.Server.app:app --reload

# Docker deployment
docker build -t warehouse-env .
docker run -p 8000:8000 warehouse-env
```

API endpoints:
- `GET /health` — Server status
- `GET /tasks` — List all tasks
- `POST /reset` — Reset environment
- `POST /step` — Execute action
- `GET /state` — Get current state

## Validation

If `openenv` is installed, validate the manifest:

```bash
openenv validate grid_env/openv.yaml
```

## Project Structure

```
grid_env/
├── env.py           # Core environment logic
├── tasks.py         # Task definitions (8 tasks)
├── graders.py       # Rubric-based graders
├── models.py        # Pydantic data models
├── baseline.py      # OpenAI baseline runner
├── tools.py         # Action tool definitions
├── world.py         # World state wrapper
├── client.py        # Client facade
└── Server/
    ├── app.py           # FastAPI server
    └── warehouse_env.py # Service wrapper

tests/               # 205 test cases
inference.py         # LLM inference runner
```

## License

MIT License - Copyright 2026 Soham Bose