mini-rl-env / README.md
Soham Bose
updated README
468d9ec unverified
---
title: RL-Env Warehouse Fulfillment
emoji: 🏭
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
---
# RL-Env: Warehouse Fulfillment Environment
A progressively challenging OpenEnv-style warehouse fulfillment environment simulating realistic pharmacy micro-fulfillment workflows with obstacles, weight constraints, stamina management, and budget optimization.
## Overview
Agents control a warehouse robot navigating a 7Γ—7 grid to fulfill customer orders. Tasks range from simple single-item pickups to complex multi-constraint challenges requiring strategic planning across obstacles, battery management, item weights, stamina conservation, and profit optimization.
## Requirements Coverage
- **Real-world task**: Implemented in [`env.py`](grid_env/env.py)
- **Eight graded tasks**: Defined in [`tasks.py`](grid_env/tasks.py) with progressive difficulty
- **Deterministic graders**: Implemented in [`graders.py`](grid_env/graders.py)
- **Typed OpenEnv models**: Implemented in [`models.py`](grid_env/models.py)
- **OpenEnv manifests**: [`openenv.yaml`](openenv.yaml) and [`openv.yaml`](grid_env/openv.yaml)
- **OpenAI baseline runner**: [`baseline.py`](grid_env/baseline.py)
## Core Mechanics
### Actions
- **Navigation**: `turn_left`, `turn_right`, `move_forward`
- **Operations**: `scan_bin`, `pick_item`, `pack_item`
- **Resources**: `recharge` (battery), `rest` (stamina)
- **Utility**: `wait`
### Advanced Mechanics (task-dependent)
1. **Obstacles**: Impassable cells blocking direct paths β€” agents must route around them
2. **Item Weight & Carry Capacity**: Items have weight (1-4 units); heavier items drain more battery while moving
3. **Stamina System**: Movement costs stamina; when depleted, movement costs double battery
4. **Money & Profit Targets**: Items have dollar values; correct packs earn money, wrong packs lose money
## Environment Interface
```python
from grid_env import WarehouseFulfillmentEnv
env = WarehouseFulfillmentEnv(task_id="easy_single_pick", seed=7)
# Reset environment
observation = env.reset(task_id="easy_single_pick", seed=7)
# Step with action
observation, reward, done, info = env.step("move_forward")
# Get full state
state = env.state()
```
All data types (`WarehouseAction`, `WarehouseObservation`, `WarehouseReward`, `WarehouseState`) are Pydantic models.
## Tasks
The environment includes **8 progressively challenging tasks** across 4 difficulty levels:
### Easy
- **`easy_single_pick`**: Fulfill one urgent thermometer order (40 steps, battery 36)
### Medium
- **`medium_multi_item`**: Two-line order with scan verification (60 steps, battery 34)
- **`obstacle_course`**: Navigate around 6 obstacles to fulfill two-item order (70 steps, battery 40)
### Hard
- **`hard_restock_priority`**: Three-line order with battery management (85 steps, battery 24)
- **`heavy_lifting`**: Weight-constrained picking β€” items weigh 1-4 units, carry capacity 3 (90 steps, battery 32)
- **`stamina_run`**: Stamina management β€” movement drains stamina; rest to recover (80 steps, battery 36, stamina 12)
### Expert
- **`budget_run`**: Profit-driven fulfillment β€” earn $15+ from valued items (70 steps, battery 30, target $15)
- **`gauntlet`**: All mechanics combined β€” obstacles + weight + stamina + $20 profit target (120 steps, battery 28, stamina 10, carry capacity 3)
Each grader returns a deterministic rubric-based score in `[0.0, 1.0]` based on completion, efficiency, and constraint satisfaction.
## Reward Design
The reward function provides dense trajectory-wide feedback:
**Positive Rewards:**
- Correct scans: +0.12
- Correct picks: +0.20
- Correct packs: +0.35
- Completion bonus: +0.50
- Timely recharge/rest: +0.06-0.08
**Penalties:**
- Invalid actions: -0.08 to -0.10
- Wrong picks: -0.18
- Wrong packs: -0.15
- Obstacle collisions: -0.12
- Overweight attempts: -0.12
- Waiting: -0.01 per step
**Money System (expert tasks):**
- Correct packs earn item value in dollars
- Wrong packs lose 50% of item value
## Installation & Usage
### Install Dependencies
```bash
pip install -r requirements.txt
```
### Run Baseline with OpenAI
The baseline runner uses the OpenAI Python SDK with **multi-seed evaluation** for robust scoring:
```bash
export OPENAI_API_KEY="your_api_key_here"
export OPENAI_MODEL="gpt-4o-mini"
# Single-seed evaluation (backward compatible)
export EVAL_SEEDS="7"
python3 -m grid_env.baseline
# Multi-seed evaluation (default: 5 seeds)
export EVAL_SEEDS="7,42,123,456,789"
python3 -m grid_env.baseline
```
### Run Inference Script
For comprehensive evaluation across all 8 tasks with multi-seed support:
```bash
# Create .env file with:
# HF_TOKEN=your_api_key
# API_BASE_URL=https://api.openai.com/v1
# MODEL_NAME=gpt-4o-mini
# EVAL_SEEDS=7,42,123,456,789
# Load .env and run
set -a && source .env && set +a
python3 inference.py
```
**Multi-seed evaluation** runs each task across multiple random seeds and reports:
- Mean score Β± standard deviation
- Min/max scores across seeds
- Success rate across seeds
This prevents overfitting to specific warehouse layouts and provides more reliable performance metrics.
## Testing
Run the full test suite (205 tests):
```bash
pytest tests/ -v
```
### Multi-Seed Evaluation Demo
See how multi-seed evaluation works without requiring an API key:
```bash
python3 example_multiseed.py
```
This runs a random policy across multiple seeds and shows score variance.
## Server Deployment
The environment includes a FastAPI server for remote access:
```bash
# Local development
uvicorn grid_env.Server.app:app --reload
# Docker deployment
docker build -t warehouse-env .
docker run -p 8000:8000 warehouse-env
```
API endpoints:
- `GET /health` β€” Server status
- `GET /tasks` β€” List all tasks
- `POST /reset` β€” Reset environment
- `POST /step` β€” Execute action
- `GET /state` β€” Get current state
## Validation
If `openenv` is installed, validate the manifest:
```bash
openenv validate grid_env/openv.yaml
```
## Project Structure
```
grid_env/
β”œβ”€β”€ env.py # Core environment logic
β”œβ”€β”€ tasks.py # Task definitions (8 tasks)
β”œβ”€β”€ graders.py # Rubric-based graders
β”œβ”€β”€ models.py # Pydantic data models
β”œβ”€β”€ baseline.py # OpenAI baseline runner
β”œβ”€β”€ tools.py # Action tool definitions
β”œβ”€β”€ world.py # World state wrapper
β”œβ”€β”€ client.py # Client facade
└── Server/
β”œβ”€β”€ app.py # FastAPI server
└── warehouse_env.py # Service wrapper
tests/ # 205 test cases
inference.py # LLM inference runner
```
## License
MIT License - Copyright 2026 Soham Bose