Spaces:
Running
APIShift Data Pyramid
Three layers of training and evaluation data.
Layer 1 β Real scraped OpenAPI version pairs
Scraped from public Git histories of five widely used vendor APIs using
scripts/scrape_specs.py. All spec files are extracted verbatim from
tagged commits in the vendors' official OpenAPI repos.
| Provider | Source repo | Spec format | Versions scraped | (v1,v2) pairs |
|---|---|---|---|---|
| Stripe | stripe/openapi | JSON (spec3.sdk.json) | ~80 | ~240 |
| GitHub | github/rest-api-description | YAML | ~50 | ~150 |
| Twilio | twilio/twilio-oai | YAML (3 product lines) | ~40Γ3 | ~360 |
| Slack | slackapi/slack-api-specs | JSON | ~30 | ~80 |
| OpenAI | openai/openai-openapi | YAML | ~20 | ~50 |
Total: ~200 spec versions β ~840+ real (v1, v2) pairs
Each pair is generated three ways:
- Adjacent (nβn+1): realistic "I stayed one version behind" case
- Skip-one (nβn+2): "I missed a release" case
- Long-range (random 5-20 gap): "I am very far behind" case
Spec files live in scenarios/layer1_real/<provider>/.
The global index is scenarios/layer1_real/_global_index.json.
To reproduce:
python scripts/scrape_specs.py --out scenarios/layer1_real
python scripts/extract_client_samples.py --out scenarios/layer1_real
python scripts/build_pair_index.py --out scenarios/layer1_real
Layer 2 β Synthetic perturbation
scenarios/layer2_synthetic/mutator.py takes a real OpenAPI spec and
applies one or more typed mutations from a 12-class taxonomy:
| Mutation class | Severity |
|---|---|
| field_renamed | low |
| type_narrowed | medium |
| required_field_added | medium |
| endpoint_removed | high |
| enum_narrowed | high |
| response_shape_changed | medium |
| auth_scheme_changed | high |
| field_removed | high |
| param_required_added | medium |
| default_changed | low |
| method_changed | high |
| status_code_removed | high |
Each mutation produces a (mutated_spec, ground_truth_change_record) pair
so the grader has reliable labels.
When layer1_real/ is populated, the mutator picks a random real spec from
any of the five providers as the mutation base β giving 200+ unique starting
points instead of one hand-typed seed, for a combinatorial expansion to
40,000+ unique synthetic scenarios.
Used for: bulk training (difficulty weights sampled by the CurriculumAgent).
Layer 3 β Held-out evaluation
A locked set of famous real migrations the agent never sees during training:
- Stripe v22 β v23 (webhook signature change)
- GitHub PAT deprecation (token β bearer)
- Twilio Messages.json β /v2/messaging restructure
- Slack auth.revoke β admin.tokens.revoke deprecation
- Stripe v18 invoice field rename
These are the five inline seed scenarios in scenarios/library.py, plus the
20 most migration-rich adjacent pairs from the scraped set that were carved
out into scenarios/layer3_holdout/ after the scrape.
Used for: the final benchmark reported in the README.
Real-data discipline
Every Layer 1 and Layer 3 scenario links back to a real public Git commit in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the documented OpenAPI breaking change taxonomy. Nothing is fabricated. Deterministic seeds (per-provider sort + fixed random seed 42) make every training run reproducible from a fresh clone.