apishift-env / docs /DATA_PYRAMID.md
yaswanth169's picture
Initial APIShift env push
3040bf7 verified
# APIShift Data Pyramid
Three layers of training and evaluation data.
## Layer 1 β€” Real scraped OpenAPI version pairs
Scraped from public Git histories of five widely used vendor APIs using
`scripts/scrape_specs.py`. All spec files are extracted verbatim from
tagged commits in the vendors' official OpenAPI repos.
| Provider | Source repo | Spec format | Versions scraped | (v1,v2) pairs |
|----------|-------------|-------------|-----------------|---------------|
| Stripe | stripe/openapi | JSON (spec3.sdk.json) | ~80 | ~240 |
| GitHub | github/rest-api-description | YAML | ~50 | ~150 |
| Twilio | twilio/twilio-oai | YAML (3 product lines) | ~40Γ—3 | ~360 |
| Slack | slackapi/slack-api-specs | JSON | ~30 | ~80 |
| OpenAI | openai/openai-openapi | YAML | ~20 | ~50 |
**Total: ~200 spec versions β†’ ~840+ real (v1, v2) pairs**
Each pair is generated three ways:
- **Adjacent** (n→n+1): realistic "I stayed one version behind" case
- **Skip-one** (n→n+2): "I missed a release" case
- **Long-range** (random 5-20 gap): "I am very far behind" case
Spec files live in `scenarios/layer1_real/<provider>/`.
The global index is `scenarios/layer1_real/_global_index.json`.
To reproduce:
```bash
python scripts/scrape_specs.py --out scenarios/layer1_real
python scripts/extract_client_samples.py --out scenarios/layer1_real
python scripts/build_pair_index.py --out scenarios/layer1_real
```
## Layer 2 β€” Synthetic perturbation
`scenarios/layer2_synthetic/mutator.py` takes a real OpenAPI spec and
applies one or more typed mutations from a 12-class taxonomy:
| Mutation class | Severity |
|----------------|---------|
| field_renamed | low |
| type_narrowed | medium |
| required_field_added | medium |
| endpoint_removed | high |
| enum_narrowed | high |
| response_shape_changed | medium |
| auth_scheme_changed | high |
| field_removed | high |
| param_required_added | medium |
| default_changed | low |
| method_changed | high |
| status_code_removed | high |
Each mutation produces a `(mutated_spec, ground_truth_change_record)` pair
so the grader has reliable labels.
When `layer1_real/` is populated, the mutator picks a random real spec from
any of the five providers as the mutation base β€” giving 200+ unique starting
points instead of one hand-typed seed, for a combinatorial expansion to
40,000+ unique synthetic scenarios.
Used for: bulk training (difficulty weights sampled by the CurriculumAgent).
## Layer 3 β€” Held-out evaluation
A locked set of famous real migrations the agent never sees during training:
- Stripe v22 β†’ v23 (webhook signature change)
- GitHub PAT deprecation (token β†’ bearer)
- Twilio Messages.json β†’ /v2/messaging restructure
- Slack auth.revoke β†’ admin.tokens.revoke deprecation
- Stripe v18 invoice field rename
These are the five inline seed scenarios in `scenarios/library.py`, plus the
20 most migration-rich adjacent pairs from the scraped set that were carved
out into `scenarios/layer3_holdout/` after the scrape.
Used for: the final benchmark reported in the README.
## Real-data discipline
Every Layer 1 and Layer 3 scenario links back to a real public Git commit
in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the
documented OpenAPI breaking change taxonomy. Nothing is fabricated.
Deterministic seeds (per-provider sort + fixed random seed 42) make every
training run reproducible from a fresh clone.