ademarteau
fix: switch HF Space SDK from gradio to docker
4992459
metadata
title: Inventory Reasoning Environment
emoji: πŸ“¦
colorFrom: blue
colorTo: green
sdk: docker
tags:
  - openenv
  - reinforcement-learning
  - inventory-optimization
  - long-horizon
license: apache-2.0

Inventory Simulations

Stochastic inventory simulation comparing rule-based agents against an LLM agent (Qwen2.5-72B) trained with GRPO reinforcement learning. Runs over a 2-year (730-day) horizon with a live Gradio UI on HF Spaces.

Overview

The environment simulates day-to-day inventory decisions under stochastic demand. At each step an agent sets a reorder point β€” the inventory level that triggers a replenishment order. The goal is to maximize fill rate (β‰₯95%) while minimizing holding costs, write-offs, and stockout penalties.

A P&L-based reward function is used throughout:

daily_reward = (revenue βˆ’ holding_cost βˆ’ stockout_penalty βˆ’ order_cost βˆ’ writeoff_cost)
               Γ· baseline_profit

Agents

Agent Strategy
Historical Mean (Baseline) ROP = mean historical demand Γ— lead time
Safety Stock Adds normal-quantile safety buffer on top of historical mean
Forecast Uses future distribution means + safety stock on forecast error
Monte Carlo Samples lead-time demand distributions; uses service-level quantile
LLM (Qwen2.5-72B) Calls HF Inference API every 5 days; reasons over demand trend, pending orders, fill rate; outputs reorder_point as JSON

The LLM agent can be fine-tuned locally with GRPO (agent/train_grpo.py) to produce a LoRA adapter that replaces the base Qwen model.

Demand Environments

Environment Distribution
GammaPoisson 90/10 mixture of Gamma and Poisson
GammaGammaHighVariance 50/50 mixture of two Gamma distributions (bimodal)
SpikingDemand Gamma with occasional demand spikes
SingleGammaLowVariance Single Gamma, low variance

All environments apply seasonality multipliers (by month and weekday) to the base scale.

Project Structure

β”œβ”€β”€ app.py                    # Gradio UI (Baseline + LLM tabs)
β”œβ”€β”€ config.py                 # Global constants
β”œβ”€β”€ reward.py                 # Unified P&L reward function
β”œβ”€β”€ demand_environment.py     # Demand distribution classes
β”œβ”€β”€ demand_calculator.py      # Per-day demand sampling
β”œβ”€β”€ inventory_manager.py      # Inventory state, reordering, write-offs
β”œβ”€β”€ order_processor.py        # Order queue with stochastic lead time
β”œβ”€β”€ performance_tracker.py    # Fill rate, write-offs, lost sales
β”œβ”€β”€ agent_environment.py      # Rule-based agent classes
β”œβ”€β”€ server/inventory_env.py   # FastAPI HTTP environment (OpenEnv API)
β”œβ”€β”€ client/inventory_client.py# Async Python client for the HTTP env
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ train_grpo.py         # GRPO fine-tuning of Qwen2.5-3B-Instruct
β”‚   β”œβ”€β”€ finetune_agent.py     # Local inference with optional LoRA adapter
β”‚   └── llm_agent_runner.py   # CLI runner for the LLM agent
└── Dockerfile                # HF Spaces container

Key Parameters (config.py)

Parameter Default Description
SIM_DAYS 730 Total simulation horizon (days)
HISTO_DAYS 365 Warm-up history before decisions begin
LEAD_TIME 3 Order-to-delivery delay (days)
LEAD_TIME_JITTER 1 Β±stochastic jitter on lead time
WRITE_OFF_RATE 0.00143 Daily spoilage fraction
SELLING_PRICE 10.0 Revenue per unit sold
UNIT_COST 4.0 Cost per unit ordered
FIXED_ORDER_COST 50.0 Fixed cost per order placed

GRPO Training (LLM fine-tuning)

Requires a GPU (run on Colab/Kaggle). Start the HTTP environment server first, then:

# Terminal 1 β€” start the environment server
uvicorn server.inventory_env:app --port 7860

# Terminal 2 β€” train
python agent/train_grpo.py \
    --base-model Qwen/Qwen2.5-3B-Instruct \
    --base-url http://localhost:7860 \
    --n-iterations 5 \
    --episodes-per-iter 20 \
    --output-dir ./grpo_inventory

Each iteration: collect rollouts β†’ compute P&L rewards β†’ GRPO update β†’ save LoRA adapter.

OpenEnv HTTP API

Endpoint Method Description
/reset?env_type={0-3} POST Start a new episode, returns InventoryObservation
/step POST Send {"reorder_point": float, "reasoning": str}, returns StepResult
/state GET Episode metadata (day, fill_rate, done)

env_type: 0=GammaPoisson, 1=GammaGammaHighVariance, 2=SpikingDemand, 3=SingleGammaLowVariance

Connecting an external agent

import asyncio
from client.inventory_client import InventoryEnvClient, InventoryAction

async def run_agent():
    async with InventoryEnvClient("https://ademarteau-rl-inventory-simulations.hf.space") as env:
        obs = await env.reset(env_type=0)
        while obs.days_remaining > 0:
            rop = obs.demand_mean_30d * 3 + obs.demand_std_30d * 1.65
            result = await env.step(InventoryAction(reorder_point=rop))
            obs = result.observation
        print(f"Final fill rate: {obs.fill_rate_so_far:.3f}")

asyncio.run(run_agent())

Local setup

pip install -r requirements.txt

# Run the Gradio UI
python app.py

# Or run just the HTTP environment server
uvicorn server.inventory_env:app --reload --port 7860

Docker

docker build -t inventory-sim .
docker run -p 7860:7860 inventory-sim