--- title: Shopify Store Audit emoji: 🛒 colorFrom: green colorTo: blue sdk: docker app_port: 8000 tags: - openenv --- # Shopify Store Audit & Remediation — OpenEnv Environment Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API. ## Motivation Store auditing is a **$5K–$15K consulting service** that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses **real Shopify product data** (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations. **Why this matters for the agent community:** - **Real data** — 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs) - **184 discoverable issues** — auto-scanned from real data quality gaps + synthetic injections - **Randomised episodes** — different issues sampled each reset (seeded for reproducibility) - **Shaped rewards** — discovery, partial fix, efficiency bonus, regression & repetition penalties - **Genuine difficulty progression** — hint level scales from guided to fully autonomous exploration - **18 API commands** mirroring real Shopify Admin GraphQL mutations ## How It Works The environment loads real Shopify product exports (`apparel.csv`, `jewelery.csv`) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An `IssuePool` scans the catalog and discovers **real data quality issues** (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top. On each `reset(seed=N)`, the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands. ### Difficulty Tiers The three tasks aren't just "more items" — they differ in **how much the agent is told**: | Task | Issues | Steps | `query_store_health` returns | Agent must... | |------|--------|-------|------------------------------|---------------| | **Easy** | 8 | 25 | Each issue + suggested command name | Fill in the right params | | **Medium** | 12 | 35 | Issue descriptions only | Figure out which command AND params | | **Hard** | 20 | 50 | Only category counts (e.g. "16 SEO issues") | Explore, discover, diagnose, and fix | ### Reward Function Multi-signal shaped reward that provides gradient throughout the episode: | Signal | Reward | When | |--------|--------|------| | **Full fix** | `+1/N` | Issue fully resolved (N = total issues) | | **Partial fix** | `+0.03` | Mutation targets the right resource but wrong value | | **Discovery** | `+0.02` | First query of a resource that has an issue | | **Efficiency bonus** | `+0.01` | Fixing without querying that resource first | | **Query cost** | `-0.005` | Exploration has a small cost | | **Failed mutation** | `-0.01` | Wrong resource or field targeted | | **Repetition** | `-0.02` | Exact same command+params sent again | | **Regression** | `-0.15` | Broke something that was previously correct | This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0. ## Action Space Actions are JSON objects with a `command` and `params`: ```json {"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}} ``` | Command | Type | Description | |---------|------|-------------| | `query_products` | Query | List/filter products (params: `status`, `search`, `product_type`, `limit`) | | `query_product` | Query | Get product detail (params: `product_id`) | | `query_collections` | Query | List all collections | | `query_collection` | Query | Get collection detail (params: `collection_id`) | | `query_inventory` | Query | Get inventory levels (params: `product_id`, `location_id`) | | `query_orders` | Query | List orders (params: `fulfillment_status`) | | `query_store_health` | Query | Diagnostic overview (detail varies by difficulty) | | `update_product` | Mutation | Update product fields (description, status, tags) | | `update_variant` | Mutation | Update variant (price, compare_at_price, sku) | | `update_product_seo` | Mutation | Set SEO title/description | | `update_image_alt_text` | Mutation | Set image alt text | | `add_product_image` | Mutation | Add image to a product | | `update_collection` | Mutation | Update collection fields/rules | | `add_product_to_collection` | Mutation | Add product to collection | | `remove_product_from_collection` | Mutation | Remove product from collection | | `adjust_inventory` | Mutation | Set inventory quantity at location | | `update_metafield` | Mutation | Set metafield value | | `publish_product` | Mutation | Set product status to active | | `update_order` | Mutation | Update order fulfillment status | ## Observation Space | Field | Type | Description | |-------|------|-------------| | `message` | `str` | Human-readable result description | | `data` | `dict` | Structured API response data | | `issues_remaining` | `int` | Unfixed issues count | | `issues_fixed` | `int` | Issues fixed so far | | `total_issues` | `int` | Total issues in task | | `store_health_score` | `float` | Store health (0.0–1.0) | | `available_commands` | `list[str]` | Available commands | | `task_name` | `str` | Current task ID | | `done` | `bool` | Whether episode has ended | | `reward` | `float` | Step reward (shaped, multi-signal) | ## Baseline Scores | Task | Model | Score | Steps | Behavior | |------|-------|-------|-------|----------| | `product_listing_qa` | gpt-4o | **99%** | 16/25 | Reads hints, fixes all 8 issues efficiently | | `seo_collection_optimization` | gpt-4o | **99%** | 28/35 | Investigates then fixes, figures out commands from descriptions | | `full_store_audit` | gpt-4o | **1%** | 50/50 | Gets stuck — can't reason from category counts to specific fixes | The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box. ## Setup Instructions ### Prerequisites - Python 3.10+ - Docker - `openenv-core` (`pip install openenv-core`) ### Local Development ```bash cd /path/to/project pip install -e . # Start server uvicorn server.app:app --host 0.0.0.0 --port 8000 # Test curl http://localhost:8000/health curl http://localhost:8000/tasks curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}' ``` ### Docker ```bash docker build -t shopify-store-audit . docker run -p 8000:8000 shopify-store-audit ``` ### Run Inference ```bash export API_BASE_URL="https://api.openai.com/v1" export MODEL_NAME="gpt-4o" export HF_TOKEN="your-api-key" export ENV_URL="http://localhost:8000" python inference.py ``` ### Validate ```bash openenv validate ``` ## Shopify API Mapping Every environment command maps to a real Shopify Admin GraphQL operation: | Environment Command | Shopify GraphQL Equivalent | |---|---| | `update_product` | `productUpdate` mutation | | `update_variant` | `productVariantUpdate` mutation | | `update_product_seo` | `productUpdate` (seo fields) | | `update_image_alt_text` | `productImageUpdate` mutation | | `add_product_image` | `productCreateMedia` mutation | | `update_collection` | `collectionUpdate` mutation | | `add_product_to_collection` | `collectionAddProducts` mutation | | `adjust_inventory` | `inventoryAdjustQuantities` mutation | | `update_metafield` | `metafieldsSet` mutation | | `publish_product` | `publishablePublish` mutation | | `update_order` | `orderUpdate` mutation | | `query_products` | `products` query | | `query_inventory` | `inventoryLevels` query | Agents trained here learn patterns directly transferable to real Shopify store management via [Shopify MCP](https://github.com/Shopify/shopify-mcp) or Shopify CLI. ## Architecture ``` ├── apparel.csv, jewelery.csv # Real Shopify product exports (45 products) ├── models.py # Pydantic Action & Observation types ├── client.py # EnvClient for WebSocket connection ├── openenv.yaml # OpenEnv spec metadata ├── pyproject.toml # Dependencies ├── Dockerfile # Container definition ├── inference.py # Baseline agent (runs all 3 tasks) ├── test_live.py # WebSocket integration test └── server/ ├── app.py # FastAPI + /tasks + /grade endpoints ├── shopify_store_audit_environment.py # Environment (reset/step/state) ├── store.py # CSV loader, IssuePool, ShopifyStore CRUD ├── tasks.py # TaskConfig (num_issues, hint_level, categories) └── graders.py # Per-task grading functions ``` ## Extensibility The architecture supports connecting to a **real Shopify store** via the Admin GraphQL API. The `ShopifyStore` class can be subclassed with a `LiveShopifyStore` that makes real API calls instead of in-memory mutations. Environment variables `SHOPIFY_STORE_URL` and `SHOPIFY_ACCESS_TOKEN` would enable live mode. The action space and observation format remain identical — the agent doesn't know which mode it's in. ## License MIT