Spaces:
Sleeping
Sleeping
| title: Shopify Store Audit | |
| emoji: π | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # Shopify Store Audit & Remediation β OpenEnv Environment | |
| Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API. | |
| ## Motivation | |
| Store auditing is a **$5Kβ$15K consulting service** that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses **real Shopify product data** (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations. | |
| **Why this matters for the agent community:** | |
| - **Real data** β 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs) | |
| - **184 discoverable issues** β auto-scanned from real data quality gaps + synthetic injections | |
| - **Randomised episodes** β different issues sampled each reset (seeded for reproducibility) | |
| - **Shaped rewards** β discovery, partial fix, efficiency bonus, regression & repetition penalties | |
| - **Genuine difficulty progression** β hint level scales from guided to fully autonomous exploration | |
| - **18 API commands** mirroring real Shopify Admin GraphQL mutations | |
| ## How It Works | |
| The environment loads real Shopify product exports (`apparel.csv`, `jewelery.csv`) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An `IssuePool` scans the catalog and discovers **real data quality issues** (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top. | |
| On each `reset(seed=N)`, the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands. | |
| ### Difficulty Tiers | |
| The three tasks aren't just "more items" β they differ in **how much the agent is told**: | |
| | Task | Issues | Steps | `query_store_health` returns | Agent must... | | |
| |------|--------|-------|------------------------------|---------------| | |
| | **Easy** | 8 | 25 | Each issue + suggested command name | Fill in the right params | | |
| | **Medium** | 12 | 35 | Issue descriptions only | Figure out which command AND params | | |
| | **Hard** | 20 | 50 | Only category counts (e.g. "16 SEO issues") | Explore, discover, diagnose, and fix | | |
| ### Reward Function | |
| Multi-signal shaped reward that provides gradient throughout the episode: | |
| | Signal | Reward | When | | |
| |--------|--------|------| | |
| | **Full fix** | `+1/N` | Issue fully resolved (N = total issues) | | |
| | **Partial fix** | `+0.03` | Mutation targets the right resource but wrong value | | |
| | **Discovery** | `+0.02` | First query of a resource that has an issue | | |
| | **Efficiency bonus** | `+0.01` | Fixing without querying that resource first | | |
| | **Query cost** | `-0.005` | Exploration has a small cost | | |
| | **Failed mutation** | `-0.01` | Wrong resource or field targeted | | |
| | **Repetition** | `-0.02` | Exact same command+params sent again | | |
| | **Regression** | `-0.15` | Broke something that was previously correct | | |
| This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0. | |
| ## Action Space | |
| Actions are JSON objects with a `command` and `params`: | |
| ```json | |
| {"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}} | |
| ``` | |
| | Command | Type | Description | | |
| |---------|------|-------------| | |
| | `query_products` | Query | List/filter products (params: `status`, `search`, `product_type`, `limit`) | | |
| | `query_product` | Query | Get product detail (params: `product_id`) | | |
| | `query_collections` | Query | List all collections | | |
| | `query_collection` | Query | Get collection detail (params: `collection_id`) | | |
| | `query_inventory` | Query | Get inventory levels (params: `product_id`, `location_id`) | | |
| | `query_orders` | Query | List orders (params: `fulfillment_status`) | | |
| | `query_store_health` | Query | Diagnostic overview (detail varies by difficulty) | | |
| | `update_product` | Mutation | Update product fields (description, status, tags) | | |
| | `update_variant` | Mutation | Update variant (price, compare_at_price, sku) | | |
| | `update_product_seo` | Mutation | Set SEO title/description | | |
| | `update_image_alt_text` | Mutation | Set image alt text | | |
| | `add_product_image` | Mutation | Add image to a product | | |
| | `update_collection` | Mutation | Update collection fields/rules | | |
| | `add_product_to_collection` | Mutation | Add product to collection | | |
| | `remove_product_from_collection` | Mutation | Remove product from collection | | |
| | `adjust_inventory` | Mutation | Set inventory quantity at location | | |
| | `update_metafield` | Mutation | Set metafield value | | |
| | `publish_product` | Mutation | Set product status to active | | |
| | `update_order` | Mutation | Update order fulfillment status | | |
| ## Observation Space | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `message` | `str` | Human-readable result description | | |
| | `data` | `dict` | Structured API response data | | |
| | `issues_remaining` | `int` | Unfixed issues count | | |
| | `issues_fixed` | `int` | Issues fixed so far | | |
| | `total_issues` | `int` | Total issues in task | | |
| | `store_health_score` | `float` | Store health (0.0β1.0) | | |
| | `available_commands` | `list[str]` | Available commands | | |
| | `task_name` | `str` | Current task ID | | |
| | `done` | `bool` | Whether episode has ended | | |
| | `reward` | `float` | Step reward (shaped, multi-signal) | | |
| ## Baseline Scores | |
| | Task | Model | Score | Steps | Behavior | | |
| |------|-------|-------|-------|----------| | |
| | `product_listing_qa` | gpt-4o | **99%** | 16/25 | Reads hints, fixes all 8 issues efficiently | | |
| | `seo_collection_optimization` | gpt-4o | **99%** | 28/35 | Investigates then fixes, figures out commands from descriptions | | |
| | `full_store_audit` | gpt-4o | **1%** | 50/50 | Gets stuck β can't reason from category counts to specific fixes | | |
| The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box. | |
| ## Setup Instructions | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Docker | |
| - `openenv-core` (`pip install openenv-core`) | |
| ### Local Development | |
| ```bash | |
| cd /path/to/project | |
| pip install -e . | |
| # Start server | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| # Test | |
| curl http://localhost:8000/health | |
| curl http://localhost:8000/tasks | |
| curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}' | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t shopify-store-audit . | |
| docker run -p 8000:8000 shopify-store-audit | |
| ``` | |
| ### Run Inference | |
| ```bash | |
| export API_BASE_URL="https://api.openai.com/v1" | |
| export MODEL_NAME="gpt-4o" | |
| export HF_TOKEN="your-api-key" | |
| export ENV_URL="http://localhost:8000" | |
| python inference.py | |
| ``` | |
| ### Validate | |
| ```bash | |
| openenv validate | |
| ``` | |
| ## Shopify API Mapping | |
| Every environment command maps to a real Shopify Admin GraphQL operation: | |
| | Environment Command | Shopify GraphQL Equivalent | | |
| |---|---| | |
| | `update_product` | `productUpdate` mutation | | |
| | `update_variant` | `productVariantUpdate` mutation | | |
| | `update_product_seo` | `productUpdate` (seo fields) | | |
| | `update_image_alt_text` | `productImageUpdate` mutation | | |
| | `add_product_image` | `productCreateMedia` mutation | | |
| | `update_collection` | `collectionUpdate` mutation | | |
| | `add_product_to_collection` | `collectionAddProducts` mutation | | |
| | `adjust_inventory` | `inventoryAdjustQuantities` mutation | | |
| | `update_metafield` | `metafieldsSet` mutation | | |
| | `publish_product` | `publishablePublish` mutation | | |
| | `update_order` | `orderUpdate` mutation | | |
| | `query_products` | `products` query | | |
| | `query_inventory` | `inventoryLevels` query | | |
| Agents trained here learn patterns directly transferable to real Shopify store management via [Shopify MCP](https://github.com/Shopify/shopify-mcp) or Shopify CLI. | |
| ## Architecture | |
| ``` | |
| βββ apparel.csv, jewelery.csv # Real Shopify product exports (45 products) | |
| βββ models.py # Pydantic Action & Observation types | |
| βββ client.py # EnvClient for WebSocket connection | |
| βββ openenv.yaml # OpenEnv spec metadata | |
| βββ pyproject.toml # Dependencies | |
| βββ Dockerfile # Container definition | |
| βββ inference.py # Baseline agent (runs all 3 tasks) | |
| βββ test_live.py # WebSocket integration test | |
| βββ server/ | |
| βββ app.py # FastAPI + /tasks + /grade endpoints | |
| βββ shopify_store_audit_environment.py # Environment (reset/step/state) | |
| βββ store.py # CSV loader, IssuePool, ShopifyStore CRUD | |
| βββ tasks.py # TaskConfig (num_issues, hint_level, categories) | |
| βββ graders.py # Per-task grading functions | |
| ``` | |
| ## Extensibility | |
| The architecture supports connecting to a **real Shopify store** via the Admin GraphQL API. The `ShopifyStore` class can be subclassed with a `LiveShopifyStore` that makes real API calls instead of in-memory mutations. Environment variables `SHOPIFY_STORE_URL` and `SHOPIFY_ACCESS_TOKEN` would enable live mode. The action space and observation format remain identical β the agent doesn't know which mode it's in. | |
| ## License | |
| MIT | |