autofarm / docs /data_source_registry.md
isabelku's picture
AutoFarm Space deploy
826dd96
# AutoFarm Data Source Registry
This document records the public sources used by the active data bootstrap
pipeline and the local cache that supports reproducible rebuilds.
## Provenance Status
- `exact`: the upstream endpoint or asset is known and documented.
- `page-only`: the upstream source page is known, but the exact downloaded file
URL is not preserved locally.
- `repo-derived`: the local file is derived from another source already present
in the repository.
- `unresolved`: the exact upstream acquisition path is not recoverable from the
current repository state.
## Assets Used By The Current Pipeline
| Local asset | Current use | Upstream source | Provenance status | Evidence |
|---|---|---|---|---|
| `data_local/downloads/usda_soil/` | `zone_state_bootstrap` cache | USDA Soil Data Access POST endpoint: <https://sdmdataaccess.sc.egov.usda.gov/Tabular/post.rest> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). |
| live Open-Meteo archive queries | `zone_state_bootstrap` | Archive API: <https://archive-api.open-meteo.com/v1/archive> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). |
| live Open-Meteo forecast queries | `zone_state_bootstrap` | Forecast API: <https://api.open-meteo.com/v1/forecast> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). |
| live Open-Meteo elevation queries | `zone_state_bootstrap` | Elevation API: <https://api.open-meteo.com/v1/elevation> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). |
| live SoilGrids fallback queries | `zone_state_bootstrap` fallback | SoilGrids REST query endpoint: <https://rest.isric.org/soilgrids/v2.0/properties/query> | `exact` | Written into the dataset card by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). |
## Rebuild Notes
When recreating the processed public-data outputs in a fresh environment:
1. run `python scripts/run_public_data_pipeline.py`,
2. verify that `data/processed/zone_state_bootstrap.parquet` exists,
3. confirm that `data/processed/zone_state_bootstrap.dataset_card.json`
records the active upstream sources.