File size: 9,643 Bytes
362bbff
 
 
 
 
 
 
 
 
 
 
329e3d3
362bbff
 
 
 
 
329e3d3
362bbff
 
329e3d3
 
 
 
 
 
362bbff
329e3d3
362bbff
329e3d3
362bbff
329e3d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
362bbff
 
 
 
329e3d3
362bbff
 
 
 
 
 
 
 
 
 
329e3d3
 
362bbff
 
 
329e3d3
362bbff
 
 
 
 
 
329e3d3
362bbff
329e3d3
362bbff
 
 
 
 
 
 
 
 
 
 
 
329e3d3
362bbff
329e3d3
362bbff
329e3d3
 
 
 
 
362bbff
329e3d3
362bbff
 
 
 
 
 
 
 
 
 
 
329e3d3
362bbff
 
329e3d3
 
362bbff
329e3d3
362bbff
329e3d3
362bbff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329e3d3
362bbff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329e3d3
362bbff
 
 
329e3d3
362bbff
329e3d3
362bbff
 
329e3d3
 
 
 
 
 
 
 
362bbff
329e3d3
 
 
 
 
362bbff
 
329e3d3
 
 
 
362bbff
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: Shopify Store Audit
emoji: πŸ›’
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
tags:
  - openenv
---

# Shopify Store Audit & Remediation β€” OpenEnv Environment

Train AI agents to find and fix real e-commerce store issues through the Shopify Admin API.

## Motivation

Store auditing is a **$5K–$15K consulting service** that Shopify merchants regularly pay for. Every store accumulates issues: missing product descriptions, broken pricing, SEO gaps, inventory discrepancies, empty collections, stuck orders. This environment uses **real Shopify product data** (45 products from actual CSV exports) and lets AI agents learn to diagnose and fix them through API operations that map 1:1 to Shopify Admin GraphQL mutations.

**Why this matters for the agent community:**
- **Real data** β€” 45 products loaded from real Shopify CSV exports (apparel + jewelry catalogs)
- **184 discoverable issues** β€” auto-scanned from real data quality gaps + synthetic injections
- **Randomised episodes** β€” different issues sampled each reset (seeded for reproducibility)
- **Shaped rewards** β€” discovery, partial fix, efficiency bonus, regression & repetition penalties
- **Genuine difficulty progression** β€” hint level scales from guided to fully autonomous exploration
- **18 API commands** mirroring real Shopify Admin GraphQL mutations

## How It Works

The environment loads real Shopify product exports (`apparel.csv`, `jewelery.csv`) containing 45 products across apparel, bags, footwear, jewelry, outdoor gear, and home goods. An `IssuePool` scans the catalog and discovers **real data quality issues** (0/45 products have SEO titles, 0/45 have image alt text, 20/20 jewelry products have no SKUs, plus handle typos and formatting artifacts). Synthetic issues (corrupted prices, draft products, negative inventory) are generated on top.

On each `reset(seed=N)`, the pool randomly samples 8/12/20 issues depending on the task. Different seed = different bugs. The agent must discover and fix them through API commands.

### Difficulty Tiers

The three tasks aren't just "more items" β€” they differ in **how much the agent is told**:

| Task | Issues | Steps | `query_store_health` returns | Agent must... |
|------|--------|-------|------------------------------|---------------|
| **Easy** | 8 | 25 | Each issue + suggested command name | Fill in the right params |
| **Medium** | 12 | 35 | Issue descriptions only | Figure out which command AND params |
| **Hard** | 20 | 50 | Only category counts (e.g. "16 SEO issues") | Explore, discover, diagnose, and fix |

### Reward Function

Multi-signal shaped reward that provides gradient throughout the episode:

| Signal | Reward | When |
|--------|--------|------|
| **Full fix** | `+1/N` | Issue fully resolved (N = total issues) |
| **Partial fix** | `+0.03` | Mutation targets the right resource but wrong value |
| **Discovery** | `+0.02` | First query of a resource that has an issue |
| **Efficiency bonus** | `+0.01` | Fixing without querying that resource first |
| **Query cost** | `-0.005` | Exploration has a small cost |
| **Failed mutation** | `-0.01` | Wrong resource or field targeted |
| **Repetition** | `-0.02` | Exact same command+params sent again |
| **Regression** | `-0.15` | Broke something that was previously correct |

This means a weak agent that explores but fails to fix still earns discovery rewards. A careless agent that breaks things gets punished. A perfect agent earns close to 1.0.

## Action Space

Actions are JSON objects with a `command` and `params`:

```json
{"command": "update_product_seo", "params": {"product_id": "ayers-chambray", "seo_title": "Ayres Chambray | Store"}}
```

| Command | Type | Description |
|---------|------|-------------|
| `query_products` | Query | List/filter products (params: `status`, `search`, `product_type`, `limit`) |
| `query_product` | Query | Get product detail (params: `product_id`) |
| `query_collections` | Query | List all collections |
| `query_collection` | Query | Get collection detail (params: `collection_id`) |
| `query_inventory` | Query | Get inventory levels (params: `product_id`, `location_id`) |
| `query_orders` | Query | List orders (params: `fulfillment_status`) |
| `query_store_health` | Query | Diagnostic overview (detail varies by difficulty) |
| `update_product` | Mutation | Update product fields (description, status, tags) |
| `update_variant` | Mutation | Update variant (price, compare_at_price, sku) |
| `update_product_seo` | Mutation | Set SEO title/description |
| `update_image_alt_text` | Mutation | Set image alt text |
| `add_product_image` | Mutation | Add image to a product |
| `update_collection` | Mutation | Update collection fields/rules |
| `add_product_to_collection` | Mutation | Add product to collection |
| `remove_product_from_collection` | Mutation | Remove product from collection |
| `adjust_inventory` | Mutation | Set inventory quantity at location |
| `update_metafield` | Mutation | Set metafield value |
| `publish_product` | Mutation | Set product status to active |
| `update_order` | Mutation | Update order fulfillment status |

## Observation Space

| Field | Type | Description |
|-------|------|-------------|
| `message` | `str` | Human-readable result description |
| `data` | `dict` | Structured API response data |
| `issues_remaining` | `int` | Unfixed issues count |
| `issues_fixed` | `int` | Issues fixed so far |
| `total_issues` | `int` | Total issues in task |
| `store_health_score` | `float` | Store health (0.0–1.0) |
| `available_commands` | `list[str]` | Available commands |
| `task_name` | `str` | Current task ID |
| `done` | `bool` | Whether episode has ended |
| `reward` | `float` | Step reward (shaped, multi-signal) |

## Baseline Scores

| Task | Model | Score | Steps | Behavior |
|------|-------|-------|-------|----------|
| `product_listing_qa` | gpt-4o | **99%** | 16/25 | Reads hints, fixes all 8 issues efficiently |
| `seo_collection_optimization` | gpt-4o | **99%** | 28/35 | Investigates then fixes, figures out commands from descriptions |
| `full_store_audit` | gpt-4o | **1%** | 50/50 | Gets stuck β€” can't reason from category counts to specific fixes |

The hard task genuinely challenges frontier models. An agent trained via RL on this environment would need to learn exploration strategies that gpt-4o doesn't exhibit out of the box.

## Setup Instructions

### Prerequisites
- Python 3.10+
- Docker
- `openenv-core` (`pip install openenv-core`)

### Local Development

```bash
cd /path/to/project
pip install -e .

# Start server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Test
curl http://localhost:8000/health
curl http://localhost:8000/tasks
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
```

### Docker

```bash
docker build -t shopify-store-audit .
docker run -p 8000:8000 shopify-store-audit
```

### Run Inference

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:8000"

python inference.py
```

### Validate

```bash
openenv validate
```

## Shopify API Mapping

Every environment command maps to a real Shopify Admin GraphQL operation:

| Environment Command | Shopify GraphQL Equivalent |
|---|---|
| `update_product` | `productUpdate` mutation |
| `update_variant` | `productVariantUpdate` mutation |
| `update_product_seo` | `productUpdate` (seo fields) |
| `update_image_alt_text` | `productImageUpdate` mutation |
| `add_product_image` | `productCreateMedia` mutation |
| `update_collection` | `collectionUpdate` mutation |
| `add_product_to_collection` | `collectionAddProducts` mutation |
| `adjust_inventory` | `inventoryAdjustQuantities` mutation |
| `update_metafield` | `metafieldsSet` mutation |
| `publish_product` | `publishablePublish` mutation |
| `update_order` | `orderUpdate` mutation |
| `query_products` | `products` query |
| `query_inventory` | `inventoryLevels` query |

Agents trained here learn patterns directly transferable to real Shopify store management via [Shopify MCP](https://github.com/Shopify/shopify-mcp) or Shopify CLI.

## Architecture

```
β”œβ”€β”€ apparel.csv, jewelery.csv    # Real Shopify product exports (45 products)
β”œβ”€β”€ models.py                    # Pydantic Action & Observation types
β”œβ”€β”€ client.py                    # EnvClient for WebSocket connection
β”œβ”€β”€ openenv.yaml                 # OpenEnv spec metadata
β”œβ”€β”€ pyproject.toml               # Dependencies
β”œβ”€β”€ Dockerfile                   # Container definition
β”œβ”€β”€ inference.py                 # Baseline agent (runs all 3 tasks)
β”œβ”€β”€ test_live.py                 # WebSocket integration test
└── server/
    β”œβ”€β”€ app.py                   # FastAPI + /tasks + /grade endpoints
    β”œβ”€β”€ shopify_store_audit_environment.py  # Environment (reset/step/state)
    β”œβ”€β”€ store.py                 # CSV loader, IssuePool, ShopifyStore CRUD
    β”œβ”€β”€ tasks.py                 # TaskConfig (num_issues, hint_level, categories)
    └── graders.py               # Per-task grading functions
```

## Extensibility

The architecture supports connecting to a **real Shopify store** via the Admin GraphQL API. The `ShopifyStore` class can be subclassed with a `LiveShopifyStore` that makes real API calls instead of in-memory mutations. Environment variables `SHOPIFY_STORE_URL` and `SHOPIFY_ACCESS_TOKEN` would enable live mode. The action space and observation format remain identical β€” the agent doesn't know which mode it's in.

## License

MIT