shopify-store-audit / README.md
aatmk-panse
feat: environment redesign β€” real CSV data, shaped rewards, difficulty tiers
329e3d3
metadata
title: Shopify Store Audit
emoji: πŸ›’
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
tags:
  - openenv

Shopify Store Audit & Remediation β€” OpenEnv Environment

Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API.

Motivation

Store auditing is a $5K–$15K consulting service that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses real Shopify product data (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations.

Why this matters for the agent community:

  • Real data β€” 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs)
  • 184 discoverable issues β€” auto-scanned from real data quality gaps + synthetic injections
  • Randomised episodes β€” different issues sampled each reset (seeded for reproducibility)
  • Shaped rewards β€” discovery, partial fix, efficiency bonus, regression & repetition penalties
  • Genuine difficulty progression β€” hint level scales from guided to fully autonomous exploration
  • 18 API commands mirroring real Shopify Admin GraphQL mutations

How It Works

The environment loads real Shopify product exports (apparel.csv, jewelery.csv) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An IssuePool scans the catalog and discovers real data quality issues (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top.

On each reset(seed=N), the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands.

Difficulty Tiers

The three tasks aren't just "more items" β€” they differ in how much the agent is told:

Task Issues Steps query_store_health returns Agent must...
Easy 8 25 Each issue + suggested command name Fill in the right params
Medium 12 35 Issue descriptions only Figure out which command AND params
Hard 20 50 Only category counts (e.g. "16 SEO issues") Explore, discover, diagnose, and fix

Reward Function

Multi-signal shaped reward that provides gradient throughout the episode:

Signal Reward When
Full fix +1/N Issue fully resolved (N = total issues)
Partial fix +0.03 Mutation targets the right resource but wrong value
Discovery +0.02 First query of a resource that has an issue
Efficiency bonus +0.01 Fixing without querying that resource first
Query cost -0.005 Exploration has a small cost
Failed mutation -0.01 Wrong resource or field targeted
Repetition -0.02 Exact same command+params sent again
Regression -0.15 Broke something that was previously correct

This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0.

Action Space

Actions are JSON objects with a command and params:

{"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}}
Command Type Description
query_products Query List/filter products (params: status, search, product_type, limit)
query_product Query Get product detail (params: product_id)
query_collections Query List all collections
query_collection Query Get collection detail (params: collection_id)
query_inventory Query Get inventory levels (params: product_id, location_id)
query_orders Query List orders (params: fulfillment_status)
query_store_health Query Diagnostic overview (detail varies by difficulty)
update_product Mutation Update product fields (description, status, tags)
update_variant Mutation Update variant (price, compare_at_price, sku)
update_product_seo Mutation Set SEO title/description
update_image_alt_text Mutation Set image alt text
add_product_image Mutation Add image to a product
update_collection Mutation Update collection fields/rules
add_product_to_collection Mutation Add product to collection
remove_product_from_collection Mutation Remove product from collection
adjust_inventory Mutation Set inventory quantity at location
update_metafield Mutation Set metafield value
publish_product Mutation Set product status to active
update_order Mutation Update order fulfillment status

Observation Space

Field Type Description
message str Human-readable result description
data dict Structured API response data
issues_remaining int Unfixed issues count
issues_fixed int Issues fixed so far
total_issues int Total issues in task
store_health_score float Store health (0.0–1.0)
available_commands list[str] Available commands
task_name str Current task ID
done bool Whether episode has ended
reward float Step reward (shaped, multi-signal)

Baseline Scores

Task Model Score Steps Behavior
product_listing_qa gpt-4o 99% 16/25 Reads hints, fixes all 8 issues efficiently
seo_collection_optimization gpt-4o 99% 28/35 Investigates then fixes, figures out commands from descriptions
full_store_audit gpt-4o 1% 50/50 Gets stuck β€” can't reason from category counts to specific fixes

The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box.

Setup Instructions

Prerequisites

  • Python 3.10+
  • Docker
  • openenv-core (pip install openenv-core)

Local Development

cd /path/to/project
pip install -e .

# Start server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Test
curl http://localhost:8000/health
curl http://localhost:8000/tasks
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'

Docker

docker build -t shopify-store-audit .
docker run -p 8000:8000 shopify-store-audit

Run Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:8000"

python inference.py

Validate

openenv validate

Shopify API Mapping

Every environment command maps to a real Shopify Admin GraphQL operation:

Environment Command Shopify GraphQL Equivalent
update_product productUpdate mutation
update_variant productVariantUpdate mutation
update_product_seo productUpdate (seo fields)
update_image_alt_text productImageUpdate mutation
add_product_image productCreateMedia mutation
update_collection collectionUpdate mutation
add_product_to_collection collectionAddProducts mutation
adjust_inventory inventoryAdjustQuantities mutation
update_metafield metafieldsSet mutation
publish_product publishablePublish mutation
update_order orderUpdate mutation
query_products products query
query_inventory inventoryLevels query

Agents trained here learn patterns directly transferable to real Shopify store management via Shopify MCP or Shopify CLI.

Architecture

β”œβ”€β”€ apparel.csv, jewelery.csv    # Real Shopify product exports (45 products)
β”œβ”€β”€ models.py                    # Pydantic Action & Observation types
β”œβ”€β”€ client.py                    # EnvClient for WebSocket connection
β”œβ”€β”€ openenv.yaml                 # OpenEnv spec metadata
β”œβ”€β”€ pyproject.toml               # Dependencies
β”œβ”€β”€ Dockerfile                   # Container definition
β”œβ”€β”€ inference.py                 # Baseline agent (runs all 3 tasks)
β”œβ”€β”€ test_live.py                 # WebSocket integration test
└── server/
    β”œβ”€β”€ app.py                   # FastAPI + /tasks + /grade endpoints
    β”œβ”€β”€ shopify_store_audit_environment.py  # Environment (reset/step/state)
    β”œβ”€β”€ store.py                 # CSV loader, IssuePool, ShopifyStore CRUD
    β”œβ”€β”€ tasks.py                 # TaskConfig (num_issues, hint_level, categories)
    └── graders.py               # Per-task grading functions

Extensibility

The architecture supports connecting to a real Shopify store via the Admin GraphQL API. The ShopifyStore class can be subclassed with a LiveShopifyStore that makes real API calls instead of in-memory mutations. Environment variables SHOPIFY_STORE_URL and SHOPIFY_ACCESS_TOKEN would enable live mode. The action space and observation format remain identical β€” the agent doesn't know which mode it's in.

License

MIT