# APIShift Data Pyramid Three layers of training and evaluation data. ## Layer 1 — Real scraped OpenAPI version pairs Scraped from public Git histories of five widely used vendor APIs using `scripts/scrape_specs.py`. All spec files are extracted verbatim from tagged commits in the vendors' official OpenAPI repos. | Provider | Source repo | Spec format | Versions scraped | (v1,v2) pairs | |----------|-------------|-------------|-----------------|---------------| | Stripe | stripe/openapi | JSON (spec3.sdk.json) | ~80 | ~240 | | GitHub | github/rest-api-description | YAML | ~50 | ~150 | | Twilio | twilio/twilio-oai | YAML (3 product lines) | ~40×3 | ~360 | | Slack | slackapi/slack-api-specs | JSON | ~30 | ~80 | | OpenAI | openai/openai-openapi | YAML | ~20 | ~50 | **Total: ~200 spec versions → ~840+ real (v1, v2) pairs** Each pair is generated three ways: - **Adjacent** (n→n+1): realistic "I stayed one version behind" case - **Skip-one** (n→n+2): "I missed a release" case - **Long-range** (random 5-20 gap): "I am very far behind" case Spec files live in `scenarios/layer1_real//`. The global index is `scenarios/layer1_real/_global_index.json`. To reproduce: ```bash python scripts/scrape_specs.py --out scenarios/layer1_real python scripts/extract_client_samples.py --out scenarios/layer1_real python scripts/build_pair_index.py --out scenarios/layer1_real ``` ## Layer 2 — Synthetic perturbation `scenarios/layer2_synthetic/mutator.py` takes a real OpenAPI spec and applies one or more typed mutations from a 12-class taxonomy: | Mutation class | Severity | |----------------|---------| | field_renamed | low | | type_narrowed | medium | | required_field_added | medium | | endpoint_removed | high | | enum_narrowed | high | | response_shape_changed | medium | | auth_scheme_changed | high | | field_removed | high | | param_required_added | medium | | default_changed | low | | method_changed | high | | status_code_removed | high | Each mutation produces a `(mutated_spec, ground_truth_change_record)` pair so the grader has reliable labels. When `layer1_real/` is populated, the mutator picks a random real spec from any of the five providers as the mutation base — giving 200+ unique starting points instead of one hand-typed seed, for a combinatorial expansion to 40,000+ unique synthetic scenarios. Used for: bulk training (difficulty weights sampled by the CurriculumAgent). ## Layer 3 — Held-out evaluation A locked set of famous real migrations the agent never sees during training: - Stripe v22 → v23 (webhook signature change) - GitHub PAT deprecation (token → bearer) - Twilio Messages.json → /v2/messaging restructure - Slack auth.revoke → admin.tokens.revoke deprecation - Stripe v18 invoice field rename These are the five inline seed scenarios in `scenarios/library.py`, plus the 20 most migration-rich adjacent pairs from the scraped set that were carved out into `scenarios/layer3_holdout/` after the scrape. Used for: the final benchmark reported in the README. ## Real-data discipline Every Layer 1 and Layer 3 scenario links back to a real public Git commit in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the documented OpenAPI breaking change taxonomy. Nothing is fabricated. Deterministic seeds (per-provider sort + fixed random seed 42) make every training run reproducible from a fresh clone.