apishift-env / docs /DATA_PYRAMID.md
yaswanth169's picture
Initial APIShift env push
3040bf7 verified

APIShift Data Pyramid

Three layers of training and evaluation data.

Layer 1 β€” Real scraped OpenAPI version pairs

Scraped from public Git histories of five widely used vendor APIs using scripts/scrape_specs.py. All spec files are extracted verbatim from tagged commits in the vendors' official OpenAPI repos.

Provider Source repo Spec format Versions scraped (v1,v2) pairs
Stripe stripe/openapi JSON (spec3.sdk.json) ~80 ~240
GitHub github/rest-api-description YAML ~50 ~150
Twilio twilio/twilio-oai YAML (3 product lines) ~40Γ—3 ~360
Slack slackapi/slack-api-specs JSON ~30 ~80
OpenAI openai/openai-openapi YAML ~20 ~50

Total: ~200 spec versions β†’ ~840+ real (v1, v2) pairs

Each pair is generated three ways:

  • Adjacent (nβ†’n+1): realistic "I stayed one version behind" case
  • Skip-one (nβ†’n+2): "I missed a release" case
  • Long-range (random 5-20 gap): "I am very far behind" case

Spec files live in scenarios/layer1_real/<provider>/. The global index is scenarios/layer1_real/_global_index.json.

To reproduce:

python scripts/scrape_specs.py --out scenarios/layer1_real
python scripts/extract_client_samples.py --out scenarios/layer1_real
python scripts/build_pair_index.py --out scenarios/layer1_real

Layer 2 β€” Synthetic perturbation

scenarios/layer2_synthetic/mutator.py takes a real OpenAPI spec and applies one or more typed mutations from a 12-class taxonomy:

Mutation class Severity
field_renamed low
type_narrowed medium
required_field_added medium
endpoint_removed high
enum_narrowed high
response_shape_changed medium
auth_scheme_changed high
field_removed high
param_required_added medium
default_changed low
method_changed high
status_code_removed high

Each mutation produces a (mutated_spec, ground_truth_change_record) pair so the grader has reliable labels.

When layer1_real/ is populated, the mutator picks a random real spec from any of the five providers as the mutation base β€” giving 200+ unique starting points instead of one hand-typed seed, for a combinatorial expansion to 40,000+ unique synthetic scenarios.

Used for: bulk training (difficulty weights sampled by the CurriculumAgent).

Layer 3 β€” Held-out evaluation

A locked set of famous real migrations the agent never sees during training:

  • Stripe v22 β†’ v23 (webhook signature change)
  • GitHub PAT deprecation (token β†’ bearer)
  • Twilio Messages.json β†’ /v2/messaging restructure
  • Slack auth.revoke β†’ admin.tokens.revoke deprecation
  • Stripe v18 invoice field rename

These are the five inline seed scenarios in scenarios/library.py, plus the 20 most migration-rich adjacent pairs from the scraped set that were carved out into scenarios/layer3_holdout/ after the scrape.

Used for: the final benchmark reported in the README.

Real-data discipline

Every Layer 1 and Layer 3 scenario links back to a real public Git commit in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the documented OpenAPI breaking change taxonomy. Nothing is fabricated. Deterministic seeds (per-provider sort + fixed random seed 42) make every training run reproducible from a fresh clone.