File size: 3,425 Bytes
3040bf7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# APIShift Data Pyramid

Three layers of training and evaluation data.

## Layer 1 β€” Real scraped OpenAPI version pairs

Scraped from public Git histories of five widely used vendor APIs using
`scripts/scrape_specs.py`. All spec files are extracted verbatim from
tagged commits in the vendors' official OpenAPI repos.

| Provider | Source repo | Spec format | Versions scraped | (v1,v2) pairs |
|----------|-------------|-------------|-----------------|---------------|
| Stripe   | stripe/openapi | JSON (spec3.sdk.json) | ~80 | ~240 |
| GitHub   | github/rest-api-description | YAML | ~50 | ~150 |
| Twilio   | twilio/twilio-oai | YAML (3 product lines) | ~40Γ—3 | ~360 |
| Slack    | slackapi/slack-api-specs | JSON | ~30 | ~80 |
| OpenAI   | openai/openai-openapi | YAML | ~20 | ~50 |

**Total: ~200 spec versions β†’ ~840+ real (v1, v2) pairs**

Each pair is generated three ways:
- **Adjacent** (n→n+1): realistic "I stayed one version behind" case
- **Skip-one** (n→n+2): "I missed a release" case
- **Long-range** (random 5-20 gap): "I am very far behind" case

Spec files live in `scenarios/layer1_real/<provider>/`.
The global index is `scenarios/layer1_real/_global_index.json`.

To reproduce:
```bash
python scripts/scrape_specs.py --out scenarios/layer1_real
python scripts/extract_client_samples.py --out scenarios/layer1_real
python scripts/build_pair_index.py --out scenarios/layer1_real
```

## Layer 2 β€” Synthetic perturbation

`scenarios/layer2_synthetic/mutator.py` takes a real OpenAPI spec and
applies one or more typed mutations from a 12-class taxonomy:

| Mutation class | Severity |
|----------------|---------|
| field_renamed | low |
| type_narrowed | medium |
| required_field_added | medium |
| endpoint_removed | high |
| enum_narrowed | high |
| response_shape_changed | medium |
| auth_scheme_changed | high |
| field_removed | high |
| param_required_added | medium |
| default_changed | low |
| method_changed | high |
| status_code_removed | high |

Each mutation produces a `(mutated_spec, ground_truth_change_record)` pair
so the grader has reliable labels.

When `layer1_real/` is populated, the mutator picks a random real spec from
any of the five providers as the mutation base β€” giving 200+ unique starting
points instead of one hand-typed seed, for a combinatorial expansion to
40,000+ unique synthetic scenarios.

Used for: bulk training (difficulty weights sampled by the CurriculumAgent).

## Layer 3 β€” Held-out evaluation

A locked set of famous real migrations the agent never sees during training:

- Stripe v22 β†’ v23 (webhook signature change)
- GitHub PAT deprecation (token β†’ bearer)
- Twilio Messages.json β†’ /v2/messaging restructure
- Slack auth.revoke β†’ admin.tokens.revoke deprecation
- Stripe v18 invoice field rename

These are the five inline seed scenarios in `scenarios/library.py`, plus the
20 most migration-rich adjacent pairs from the scraped set that were carved
out into `scenarios/layer3_holdout/` after the scrape.

Used for: the final benchmark reported in the README.

## Real-data discipline

Every Layer 1 and Layer 3 scenario links back to a real public Git commit
in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the
documented OpenAPI breaking change taxonomy. Nothing is fabricated.
Deterministic seeds (per-provider sort + fixed random seed 42) make every
training run reproducible from a fresh clone.