ademarteau
fix: switch HF Space SDK from gradio to docker
4992459
---
title: Inventory Reasoning Environment
emoji: πŸ“¦
colorFrom: blue
colorTo: green
sdk: docker
tags:
- openenv
- reinforcement-learning
- inventory-optimization
- long-horizon
license: apache-2.0
---
# Inventory Simulations
Stochastic inventory simulation comparing rule-based agents against an LLM agent (Qwen2.5-72B) trained with GRPO reinforcement learning. Runs over a 2-year (730-day) horizon with a live Gradio UI on HF Spaces.
## Overview
The environment simulates day-to-day inventory decisions under stochastic demand. At each step an agent sets a **reorder point** β€” the inventory level that triggers a replenishment order. The goal is to maximize fill rate (β‰₯95%) while minimizing holding costs, write-offs, and stockout penalties.
A P&L-based reward function is used throughout:
```
daily_reward = (revenue βˆ’ holding_cost βˆ’ stockout_penalty βˆ’ order_cost βˆ’ writeoff_cost)
Γ· baseline_profit
```
## Agents
| Agent | Strategy |
|---|---|
| **Historical Mean (Baseline)** | ROP = mean historical demand Γ— lead time |
| **Safety Stock** | Adds normal-quantile safety buffer on top of historical mean |
| **Forecast** | Uses future distribution means + safety stock on forecast error |
| **Monte Carlo** | Samples lead-time demand distributions; uses service-level quantile |
| **LLM (Qwen2.5-72B)** | Calls HF Inference API every 5 days; reasons over demand trend, pending orders, fill rate; outputs `reorder_point` as JSON |
The LLM agent can be fine-tuned locally with GRPO (`agent/train_grpo.py`) to produce a LoRA adapter that replaces the base Qwen model.
## Demand Environments
| Environment | Distribution |
|---|---|
| `GammaPoisson` | 90/10 mixture of Gamma and Poisson |
| `GammaGammaHighVariance` | 50/50 mixture of two Gamma distributions (bimodal) |
| `SpikingDemand` | Gamma with occasional demand spikes |
| `SingleGammaLowVariance` | Single Gamma, low variance |
All environments apply **seasonality multipliers** (by month and weekday) to the base scale.
## Project Structure
```
β”œβ”€β”€ app.py # Gradio UI (Baseline + LLM tabs)
β”œβ”€β”€ config.py # Global constants
β”œβ”€β”€ reward.py # Unified P&L reward function
β”œβ”€β”€ demand_environment.py # Demand distribution classes
β”œβ”€β”€ demand_calculator.py # Per-day demand sampling
β”œβ”€β”€ inventory_manager.py # Inventory state, reordering, write-offs
β”œβ”€β”€ order_processor.py # Order queue with stochastic lead time
β”œβ”€β”€ performance_tracker.py # Fill rate, write-offs, lost sales
β”œβ”€β”€ agent_environment.py # Rule-based agent classes
β”œβ”€β”€ server/inventory_env.py # FastAPI HTTP environment (OpenEnv API)
β”œβ”€β”€ client/inventory_client.py# Async Python client for the HTTP env
β”œβ”€β”€ agent/
β”‚ β”œβ”€β”€ train_grpo.py # GRPO fine-tuning of Qwen2.5-3B-Instruct
β”‚ β”œβ”€β”€ finetune_agent.py # Local inference with optional LoRA adapter
β”‚ └── llm_agent_runner.py # CLI runner for the LLM agent
└── Dockerfile # HF Spaces container
```
## Key Parameters (`config.py`)
| Parameter | Default | Description |
|---|---|---|
| `SIM_DAYS` | 730 | Total simulation horizon (days) |
| `HISTO_DAYS` | 365 | Warm-up history before decisions begin |
| `LEAD_TIME` | 3 | Order-to-delivery delay (days) |
| `LEAD_TIME_JITTER` | 1 | Β±stochastic jitter on lead time |
| `WRITE_OFF_RATE` | 0.00143 | Daily spoilage fraction |
| `SELLING_PRICE` | 10.0 | Revenue per unit sold |
| `UNIT_COST` | 4.0 | Cost per unit ordered |
| `FIXED_ORDER_COST` | 50.0 | Fixed cost per order placed |
## GRPO Training (LLM fine-tuning)
Requires a GPU (run on Colab/Kaggle). Start the HTTP environment server first, then:
```bash
# Terminal 1 β€” start the environment server
uvicorn server.inventory_env:app --port 7860
# Terminal 2 β€” train
python agent/train_grpo.py \
--base-model Qwen/Qwen2.5-3B-Instruct \
--base-url http://localhost:7860 \
--n-iterations 5 \
--episodes-per-iter 20 \
--output-dir ./grpo_inventory
```
Each iteration: collect rollouts β†’ compute P&L rewards β†’ GRPO update β†’ save LoRA adapter.
## OpenEnv HTTP API
| Endpoint | Method | Description |
|---|---|---|
| `/reset?env_type={0-3}` | POST | Start a new episode, returns `InventoryObservation` |
| `/step` | POST | Send `{"reorder_point": float, "reasoning": str}`, returns `StepResult` |
| `/state` | GET | Episode metadata (day, fill_rate, done) |
**env_type**: 0=GammaPoisson, 1=GammaGammaHighVariance, 2=SpikingDemand, 3=SingleGammaLowVariance
### Connecting an external agent
```python
import asyncio
from client.inventory_client import InventoryEnvClient, InventoryAction
async def run_agent():
async with InventoryEnvClient("https://ademarteau-rl-inventory-simulations.hf.space") as env:
obs = await env.reset(env_type=0)
while obs.days_remaining > 0:
rop = obs.demand_mean_30d * 3 + obs.demand_std_30d * 1.65
result = await env.step(InventoryAction(reorder_point=rop))
obs = result.observation
print(f"Final fill rate: {obs.fill_rate_so_far:.3f}")
asyncio.run(run_agent())
```
### Local setup
```bash
pip install -r requirements.txt
# Run the Gradio UI
python app.py
# Or run just the HTTP environment server
uvicorn server.inventory_env:app --reload --port 7860
```
### Docker
```bash
docker build -t inventory-sim .
docker run -p 7860:7860 inventory-sim
```