Spaces:
Runtime error
title: Inventory Reasoning Environment
emoji: π¦
colorFrom: blue
colorTo: green
sdk: docker
tags:
- openenv
- reinforcement-learning
- inventory-optimization
- long-horizon
license: apache-2.0
Inventory Simulations
Stochastic inventory simulation comparing rule-based agents against an LLM agent (Qwen2.5-72B) trained with GRPO reinforcement learning. Runs over a 2-year (730-day) horizon with a live Gradio UI on HF Spaces.
Overview
The environment simulates day-to-day inventory decisions under stochastic demand. At each step an agent sets a reorder point β the inventory level that triggers a replenishment order. The goal is to maximize fill rate (β₯95%) while minimizing holding costs, write-offs, and stockout penalties.
A P&L-based reward function is used throughout:
daily_reward = (revenue β holding_cost β stockout_penalty β order_cost β writeoff_cost)
Γ· baseline_profit
Agents
| Agent | Strategy |
|---|---|
| Historical Mean (Baseline) | ROP = mean historical demand Γ lead time |
| Safety Stock | Adds normal-quantile safety buffer on top of historical mean |
| Forecast | Uses future distribution means + safety stock on forecast error |
| Monte Carlo | Samples lead-time demand distributions; uses service-level quantile |
| LLM (Qwen2.5-72B) | Calls HF Inference API every 5 days; reasons over demand trend, pending orders, fill rate; outputs reorder_point as JSON |
The LLM agent can be fine-tuned locally with GRPO (agent/train_grpo.py) to produce a LoRA adapter that replaces the base Qwen model.
Demand Environments
| Environment | Distribution |
|---|---|
GammaPoisson |
90/10 mixture of Gamma and Poisson |
GammaGammaHighVariance |
50/50 mixture of two Gamma distributions (bimodal) |
SpikingDemand |
Gamma with occasional demand spikes |
SingleGammaLowVariance |
Single Gamma, low variance |
All environments apply seasonality multipliers (by month and weekday) to the base scale.
Project Structure
βββ app.py # Gradio UI (Baseline + LLM tabs)
βββ config.py # Global constants
βββ reward.py # Unified P&L reward function
βββ demand_environment.py # Demand distribution classes
βββ demand_calculator.py # Per-day demand sampling
βββ inventory_manager.py # Inventory state, reordering, write-offs
βββ order_processor.py # Order queue with stochastic lead time
βββ performance_tracker.py # Fill rate, write-offs, lost sales
βββ agent_environment.py # Rule-based agent classes
βββ server/inventory_env.py # FastAPI HTTP environment (OpenEnv API)
βββ client/inventory_client.py# Async Python client for the HTTP env
βββ agent/
β βββ train_grpo.py # GRPO fine-tuning of Qwen2.5-3B-Instruct
β βββ finetune_agent.py # Local inference with optional LoRA adapter
β βββ llm_agent_runner.py # CLI runner for the LLM agent
βββ Dockerfile # HF Spaces container
Key Parameters (config.py)
| Parameter | Default | Description |
|---|---|---|
SIM_DAYS |
730 | Total simulation horizon (days) |
HISTO_DAYS |
365 | Warm-up history before decisions begin |
LEAD_TIME |
3 | Order-to-delivery delay (days) |
LEAD_TIME_JITTER |
1 | Β±stochastic jitter on lead time |
WRITE_OFF_RATE |
0.00143 | Daily spoilage fraction |
SELLING_PRICE |
10.0 | Revenue per unit sold |
UNIT_COST |
4.0 | Cost per unit ordered |
FIXED_ORDER_COST |
50.0 | Fixed cost per order placed |
GRPO Training (LLM fine-tuning)
Requires a GPU (run on Colab/Kaggle). Start the HTTP environment server first, then:
# Terminal 1 β start the environment server
uvicorn server.inventory_env:app --port 7860
# Terminal 2 β train
python agent/train_grpo.py \
--base-model Qwen/Qwen2.5-3B-Instruct \
--base-url http://localhost:7860 \
--n-iterations 5 \
--episodes-per-iter 20 \
--output-dir ./grpo_inventory
Each iteration: collect rollouts β compute P&L rewards β GRPO update β save LoRA adapter.
OpenEnv HTTP API
| Endpoint | Method | Description |
|---|---|---|
/reset?env_type={0-3} |
POST | Start a new episode, returns InventoryObservation |
/step |
POST | Send {"reorder_point": float, "reasoning": str}, returns StepResult |
/state |
GET | Episode metadata (day, fill_rate, done) |
env_type: 0=GammaPoisson, 1=GammaGammaHighVariance, 2=SpikingDemand, 3=SingleGammaLowVariance
Connecting an external agent
import asyncio
from client.inventory_client import InventoryEnvClient, InventoryAction
async def run_agent():
async with InventoryEnvClient("https://ademarteau-rl-inventory-simulations.hf.space") as env:
obs = await env.reset(env_type=0)
while obs.days_remaining > 0:
rop = obs.demand_mean_30d * 3 + obs.demand_std_30d * 1.65
result = await env.step(InventoryAction(reorder_point=rop))
obs = result.observation
print(f"Final fill rate: {obs.fill_rate_so_far:.3f}")
asyncio.run(run_agent())
Local setup
pip install -r requirements.txt
# Run the Gradio UI
python app.py
# Or run just the HTTP environment server
uvicorn server.inventory_env:app --reload --port 7860
Docker
docker build -t inventory-sim .
docker run -p 7860:7860 inventory-sim