---
title: Shopify Store Audit
emoji: 🛒
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
tags:
  - openenv
---

# Shopify Store Audit & Remediation — OpenEnv Environment

Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API.

## Motivation

Store auditing is a **$5K–$15K consulting service** that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses **real Shopify product data** (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations.

**Why this matters for the agent community:**
- **Real data** — 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs)
- **184 discoverable issues** — auto-scanned from real data quality gaps + synthetic injections
- **Randomised episodes** — different issues sampled each reset (seeded for reproducibility)
- **Shaped rewards** — discovery, partial fix, efficiency bonus, regression & repetition penalties
- **Genuine difficulty progression** — hint level scales from guided to fully autonomous exploration
- **18 API commands** mirroring real Shopify Admin GraphQL mutations

## How It Works

The environment loads real Shopify product exports (`apparel.csv`, `jewelery.csv`) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An `IssuePool` scans the catalog and discovers **real data quality issues** (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top.

On each `reset(seed=N)`, the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands.

### Difficulty Tiers

The three tasks aren't just "more items" — they differ in **how much the agent is told**:

| Task | Issues | Steps | `query_store_health` returns | Agent must... |
|------|--------|-------|------------------------------|---------------|
| **Easy** | 8 | 25 | Each issue + suggested command name | Fill in the right params |
| **Medium** | 12 | 35 | Issue descriptions only | Figure out which command AND params |
| **Hard** | 20 | 50 | Only category counts (e.g. "16 SEO issues") | Explore, discover, diagnose, and fix |

### Reward Function

Multi-signal shaped reward that provides gradient throughout the episode:

| Signal | Reward | When |
|--------|--------|------|
| **Full fix** | `+1/N` | Issue fully resolved (N = total issues) |
| **Partial fix** | `+0.03` | Mutation targets the right resource but wrong value |
| **Discovery** | `+0.02` | First query of a resource that has an issue |
| **Efficiency bonus** | `+0.01` | Fixing without querying that resource first |
| **Query cost** | `-0.005` | Exploration has a small cost |
| **Failed mutation** | `-0.01` | Wrong resource or field targeted |
| **Repetition** | `-0.02` | Exact same command+params sent again |
| **Regression** | `-0.15` | Broke something that was previously correct |

This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0.

## Action Space

Actions are JSON objects with a `command` and `params`:

```json
{"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}}
```

| Command | Type | Description |
|---------|------|-------------|
| `query_products` | Query | List/filter products (params: `status`, `search`, `product_type`, `limit`) |
| `query_product` | Query | Get product detail (params: `product_id`) |
| `query_collections` | Query | List all collections |
| `query_collection` | Query | Get collection detail (params: `collection_id`) |
| `query_inventory` | Query | Get inventory levels (params: `product_id`, `location_id`) |
| `query_orders` | Query | List orders (params: `fulfillment_status`) |
| `query_store_health` | Query | Diagnostic overview (detail varies by difficulty) |
| `update_product` | Mutation | Update product fields (description, status, tags) |
| `update_variant` | Mutation | Update variant (price, compare_at_price, sku) |
| `update_product_seo` | Mutation | Set SEO title/description |
| `update_image_alt_text` | Mutation | Set image alt text |
| `add_product_image` | Mutation | Add image to a product |
| `update_collection` | Mutation | Update collection fields/rules |
| `add_product_to_collection` | Mutation | Add product to collection |
| `remove_product_from_collection` | Mutation | Remove product from collection |
| `adjust_inventory` | Mutation | Set inventory quantity at location |
| `update_metafield` | Mutation | Set metafield value |
| `publish_product` | Mutation | Set product status to active |
| `update_order` | Mutation | Update order fulfillment status |

## Observation Space

| Field | Type | Description |
|-------|------|-------------|
| `message` | `str` | Human-readable result description |
| `data` | `dict` | Structured API response data |
| `issues_remaining` | `int` | Unfixed issues count |
| `issues_fixed` | `int` | Issues fixed so far |
| `total_issues` | `int` | Total issues in task |
| `store_health_score` | `float` | Store health (0.0–1.0) |
| `available_commands` | `list[str]` | Available commands |
| `task_name` | `str` | Current task ID |
| `done` | `bool` | Whether episode has ended |
| `reward` | `float` | Step reward (shaped, multi-signal) |

## Baseline Scores

| Task | Model | Score | Steps | Behavior |
|------|-------|-------|-------|----------|
| `product_listing_qa` | gpt-4o | **99%** | 16/25 | Reads hints, fixes all 8 issues efficiently |
| `seo_collection_optimization` | gpt-4o | **99%** | 28/35 | Investigates then fixes, figures out commands from descriptions |
| `full_store_audit` | gpt-4o | **1%** | 50/50 | Gets stuck — can't reason from category counts to specific fixes |

The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box.

## Setup Instructions

### Prerequisites
- Python 3.10+
- Docker
- `openenv-core` (`pip install openenv-core`)

### Local Development

```bash
cd /path/to/project
pip install -e .

# Start server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Test
curl http://localhost:8000/health
curl http://localhost:8000/tasks
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
```

### Docker

```bash
docker build -t shopify-store-audit .
docker run -p 8000:8000 shopify-store-audit
```

### Run Inference

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:8000"

python inference.py
```

### Validate

```bash
openenv validate
```

## Shopify API Mapping

Every environment command maps to a real Shopify Admin GraphQL operation:

| Environment Command | Shopify GraphQL Equivalent |
|---|---|
| `update_product` | `productUpdate` mutation |
| `update_variant` | `productVariantUpdate` mutation |
| `update_product_seo` | `productUpdate` (seo fields) |
| `update_image_alt_text` | `productImageUpdate` mutation |
| `add_product_image` | `productCreateMedia` mutation |
| `update_collection` | `collectionUpdate` mutation |
| `add_product_to_collection` | `collectionAddProducts` mutation |
| `adjust_inventory` | `inventoryAdjustQuantities` mutation |
| `update_metafield` | `metafieldsSet` mutation |
| `publish_product` | `publishablePublish` mutation |
| `update_order` | `orderUpdate` mutation |
| `query_products` | `products` query |
| `query_inventory` | `inventoryLevels` query |

Agents trained here learn patterns directly transferable to real Shopify store management via [Shopify MCP](https://github.com/Shopify/shopify-mcp) or Shopify CLI.

## Architecture

```
├── apparel.csv, jewelery.csv    # Real Shopify product exports (45 products)
├── models.py                    # Pydantic Action & Observation types
├── client.py                    # EnvClient for WebSocket connection
├── openenv.yaml                 # OpenEnv spec metadata
├── pyproject.toml               # Dependencies
├── Dockerfile                   # Container definition
├── inference.py                 # Baseline agent (runs all 3 tasks)
├── test_live.py                 # WebSocket integration test
└── server/
    ├── app.py                   # FastAPI + /tasks + /grade endpoints
    ├── shopify_store_audit_environment.py  # Environment (reset/step/state)
    ├── store.py                 # CSV loader, IssuePool, ShopifyStore CRUD
    ├── tasks.py                 # TaskConfig (num_issues, hint_level, categories)
    └── graders.py               # Per-task grading functions
```

## Extensibility

The architecture supports connecting to a **real Shopify store** via the Admin GraphQL API. The `ShopifyStore` class can be subclassed with a `LiveShopifyStore` that makes real API calls instead of in-memory mutations. Environment variables `SHOPIFY_STORE_URL` and `SHOPIFY_ACCESS_TOKEN` would enable live mode. The action space and observation format remain identical — the agent doesn't know which mode it's in.

## License

MIT