shopify-store-audit / README.md
aatmk-panse
feat: environment redesign β€” real CSV data, shaped rewards, difficulty tiers
329e3d3
---
title: Shopify Store Audit
emoji: πŸ›’
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
tags:
- openenv
---
# Shopify Store Audit & Remediation β€” OpenEnv Environment
Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API.
## Motivation
Store auditing is a **$5K–$15K consulting service** that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses **real Shopify product data** (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations.
**Why this matters for the agent community:**
- **Real data** β€” 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs)
- **184 discoverable issues** β€” auto-scanned from real data quality gaps + synthetic injections
- **Randomised episodes** β€” different issues sampled each reset (seeded for reproducibility)
- **Shaped rewards** β€” discovery, partial fix, efficiency bonus, regression & repetition penalties
- **Genuine difficulty progression** β€” hint level scales from guided to fully autonomous exploration
- **18 API commands** mirroring real Shopify Admin GraphQL mutations
## How It Works
The environment loads real Shopify product exports (`apparel.csv`, `jewelery.csv`) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An `IssuePool` scans the catalog and discovers **real data quality issues** (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top.
On each `reset(seed=N)`, the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands.
### Difficulty Tiers
The three tasks aren't just "more items" β€” they differ in **how much the agent is told**:
| Task | Issues | Steps | `query_store_health` returns | Agent must... |
|------|--------|-------|------------------------------|---------------|
| **Easy** | 8 | 25 | Each issue + suggested command name | Fill in the right params |
| **Medium** | 12 | 35 | Issue descriptions only | Figure out which command AND params |
| **Hard** | 20 | 50 | Only category counts (e.g. "16 SEO issues") | Explore, discover, diagnose, and fix |
### Reward Function
Multi-signal shaped reward that provides gradient throughout the episode:
| Signal | Reward | When |
|--------|--------|------|
| **Full fix** | `+1/N` | Issue fully resolved (N = total issues) |
| **Partial fix** | `+0.03` | Mutation targets the right resource but wrong value |
| **Discovery** | `+0.02` | First query of a resource that has an issue |
| **Efficiency bonus** | `+0.01` | Fixing without querying that resource first |
| **Query cost** | `-0.005` | Exploration has a small cost |
| **Failed mutation** | `-0.01` | Wrong resource or field targeted |
| **Repetition** | `-0.02` | Exact same command+params sent again |
| **Regression** | `-0.15` | Broke something that was previously correct |
This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0.
## Action Space
Actions are JSON objects with a `command` and `params`:
```json
{"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}}
```
| Command | Type | Description |
|---------|------|-------------|
| `query_products` | Query | List/filter products (params: `status`, `search`, `product_type`, `limit`) |
| `query_product` | Query | Get product detail (params: `product_id`) |
| `query_collections` | Query | List all collections |
| `query_collection` | Query | Get collection detail (params: `collection_id`) |
| `query_inventory` | Query | Get inventory levels (params: `product_id`, `location_id`) |
| `query_orders` | Query | List orders (params: `fulfillment_status`) |
| `query_store_health` | Query | Diagnostic overview (detail varies by difficulty) |
| `update_product` | Mutation | Update product fields (description, status, tags) |
| `update_variant` | Mutation | Update variant (price, compare_at_price, sku) |
| `update_product_seo` | Mutation | Set SEO title/description |
| `update_image_alt_text` | Mutation | Set image alt text |
| `add_product_image` | Mutation | Add image to a product |
| `update_collection` | Mutation | Update collection fields/rules |
| `add_product_to_collection` | Mutation | Add product to collection |
| `remove_product_from_collection` | Mutation | Remove product from collection |
| `adjust_inventory` | Mutation | Set inventory quantity at location |
| `update_metafield` | Mutation | Set metafield value |
| `publish_product` | Mutation | Set product status to active |
| `update_order` | Mutation | Update order fulfillment status |
## Observation Space
| Field | Type | Description |
|-------|------|-------------|
| `message` | `str` | Human-readable result description |
| `data` | `dict` | Structured API response data |
| `issues_remaining` | `int` | Unfixed issues count |
| `issues_fixed` | `int` | Issues fixed so far |
| `total_issues` | `int` | Total issues in task |
| `store_health_score` | `float` | Store health (0.0–1.0) |
| `available_commands` | `list[str]` | Available commands |
| `task_name` | `str` | Current task ID |
| `done` | `bool` | Whether episode has ended |
| `reward` | `float` | Step reward (shaped, multi-signal) |
## Baseline Scores
| Task | Model | Score | Steps | Behavior |
|------|-------|-------|-------|----------|
| `product_listing_qa` | gpt-4o | **99%** | 16/25 | Reads hints, fixes all 8 issues efficiently |
| `seo_collection_optimization` | gpt-4o | **99%** | 28/35 | Investigates then fixes, figures out commands from descriptions |
| `full_store_audit` | gpt-4o | **1%** | 50/50 | Gets stuck β€” can't reason from category counts to specific fixes |
The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box.
## Setup Instructions
### Prerequisites
- Python 3.10+
- Docker
- `openenv-core` (`pip install openenv-core`)
### Local Development
```bash
cd /path/to/project
pip install -e .
# Start server
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Test
curl http://localhost:8000/health
curl http://localhost:8000/tasks
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
```
### Docker
```bash
docker build -t shopify-store-audit .
docker run -p 8000:8000 shopify-store-audit
```
### Run Inference
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:8000"
python inference.py
```
### Validate
```bash
openenv validate
```
## Shopify API Mapping
Every environment command maps to a real Shopify Admin GraphQL operation:
| Environment Command | Shopify GraphQL Equivalent |
|---|---|
| `update_product` | `productUpdate` mutation |
| `update_variant` | `productVariantUpdate` mutation |
| `update_product_seo` | `productUpdate` (seo fields) |
| `update_image_alt_text` | `productImageUpdate` mutation |
| `add_product_image` | `productCreateMedia` mutation |
| `update_collection` | `collectionUpdate` mutation |
| `add_product_to_collection` | `collectionAddProducts` mutation |
| `adjust_inventory` | `inventoryAdjustQuantities` mutation |
| `update_metafield` | `metafieldsSet` mutation |
| `publish_product` | `publishablePublish` mutation |
| `update_order` | `orderUpdate` mutation |
| `query_products` | `products` query |
| `query_inventory` | `inventoryLevels` query |
Agents trained here learn patterns directly transferable to real Shopify store management via [Shopify MCP](https://github.com/Shopify/shopify-mcp) or Shopify CLI.
## Architecture
```
β”œβ”€β”€ apparel.csv, jewelery.csv # Real Shopify product exports (45 products)
β”œβ”€β”€ models.py # Pydantic Action & Observation types
β”œβ”€β”€ client.py # EnvClient for WebSocket connection
β”œβ”€β”€ openenv.yaml # OpenEnv spec metadata
β”œβ”€β”€ pyproject.toml # Dependencies
β”œβ”€β”€ Dockerfile # Container definition
β”œβ”€β”€ inference.py # Baseline agent (runs all 3 tasks)
β”œβ”€β”€ test_live.py # WebSocket integration test
└── server/
β”œβ”€β”€ app.py # FastAPI + /tasks + /grade endpoints
β”œβ”€β”€ shopify_store_audit_environment.py # Environment (reset/step/state)
β”œβ”€β”€ store.py # CSV loader, IssuePool, ShopifyStore CRUD
β”œβ”€β”€ tasks.py # TaskConfig (num_issues, hint_level, categories)
└── graders.py # Per-task grading functions
```
## Extensibility
The architecture supports connecting to a **real Shopify store** via the Admin GraphQL API. The `ShopifyStore` class can be subclassed with a `LiveShopifyStore` that makes real API calls instead of in-memory mutations. Environment variables `SHOPIFY_STORE_URL` and `SHOPIFY_ACCESS_TOKEN` would enable live mode. The action space and observation format remain identical β€” the agent doesn't know which mode it's in.
## License
MIT