Spaces:

yaswanth169
/

apishift-env

Running

App Files Files Community

apishift-env / docs /DATA_PYRAMID.md

yaswanth169

Initial APIShift env push

3040bf7 verified about 1 month ago

preview code

raw

history blame contribute delete

3.43 kB

	# APIShift Data Pyramid

	Three layers of training and evaluation data.

	## Layer 1 — Real scraped OpenAPI version pairs

	Scraped from public Git histories of five widely used vendor APIs using
	`scripts/scrape_specs.py`. All spec files are extracted verbatim from
	tagged commits in the vendors' official OpenAPI repos.

	\| Provider \| Source repo \| Spec format \| Versions scraped \| (v1,v2) pairs \|
	\|----------\|-------------\|-------------\|-----------------\|---------------\|
	\| Stripe \| stripe/openapi \| JSON (spec3.sdk.json) \| ~80 \| ~240 \|
	\| GitHub \| github/rest-api-description \| YAML \| ~50 \| ~150 \|
	\| Twilio \| twilio/twilio-oai \| YAML (3 product lines) \| ~40×3 \| ~360 \|
	\| Slack \| slackapi/slack-api-specs \| JSON \| ~30 \| ~80 \|
	\| OpenAI \| openai/openai-openapi \| YAML \| ~20 \| ~50 \|

	Total: ~200 spec versions → ~840+ real (v1, v2) pairs

	Each pair is generated three ways:
	- Adjacent (n→n+1): realistic "I stayed one version behind" case
	- Skip-one (n→n+2): "I missed a release" case
	- Long-range (random 5-20 gap): "I am very far behind" case

	Spec files live in `scenarios/layer1_real/<provider>/`.
	The global index is `scenarios/layer1_real/_global_index.json`.

	To reproduce:
	```bash
	python scripts/scrape_specs.py --out scenarios/layer1_real
	python scripts/extract_client_samples.py --out scenarios/layer1_real
	python scripts/build_pair_index.py --out scenarios/layer1_real
	```

	## Layer 2 — Synthetic perturbation

	`scenarios/layer2_synthetic/mutator.py` takes a real OpenAPI spec and
	applies one or more typed mutations from a 12-class taxonomy:

	\| Mutation class \| Severity \|
	\|----------------\|---------\|
	\| field_renamed \| low \|
	\| type_narrowed \| medium \|
	\| required_field_added \| medium \|
	\| endpoint_removed \| high \|
	\| enum_narrowed \| high \|
	\| response_shape_changed \| medium \|
	\| auth_scheme_changed \| high \|
	\| field_removed \| high \|
	\| param_required_added \| medium \|
	\| default_changed \| low \|
	\| method_changed \| high \|
	\| status_code_removed \| high \|

	Each mutation produces a `(mutated_spec, ground_truth_change_record)` pair
	so the grader has reliable labels.

	When `layer1_real/` is populated, the mutator picks a random real spec from
	any of the five providers as the mutation base — giving 200+ unique starting
	points instead of one hand-typed seed, for a combinatorial expansion to
	40,000+ unique synthetic scenarios.

	Used for: bulk training (difficulty weights sampled by the CurriculumAgent).

	## Layer 3 — Held-out evaluation

	A locked set of famous real migrations the agent never sees during training:

	- Stripe v22 → v23 (webhook signature change)
	- GitHub PAT deprecation (token → bearer)
	- Twilio Messages.json → /v2/messaging restructure
	- Slack auth.revoke → admin.tokens.revoke deprecation
	- Stripe v18 invoice field rename

	These are the five inline seed scenarios in `scenarios/library.py`, plus the
	20 most migration-rich adjacent pairs from the scraped set that were carved
	out into `scenarios/layer3_holdout/` after the scrape.

	Used for: the final benchmark reported in the README.

	## Real-data discipline

	Every Layer 1 and Layer 3 scenario links back to a real public Git commit
	in a vendor's official OpenAPI repo. Layer 2 mutations are typed against the
	documented OpenAPI breaking change taxonomy. Nothing is fabricated.
	Deterministic seeds (per-provider sort + fixed random seed 42) make every
	training run reproducible from a fresh clone.