| # AutoFarm Data Source Registry |
|
|
| This document records the public sources used by the active data bootstrap |
| pipeline and the local cache that supports reproducible rebuilds. |
|
|
| ## Provenance Status |
|
|
| - `exact`: the upstream endpoint or asset is known and documented. |
| - `page-only`: the upstream source page is known, but the exact downloaded file |
| URL is not preserved locally. |
| - `repo-derived`: the local file is derived from another source already present |
| in the repository. |
| - `unresolved`: the exact upstream acquisition path is not recoverable from the |
| current repository state. |
|
|
| ## Assets Used By The Current Pipeline |
|
|
| | Local asset | Current use | Upstream source | Provenance status | Evidence | |
| |---|---|---|---|---| |
| | `data_local/downloads/usda_soil/` | `zone_state_bootstrap` cache | USDA Soil Data Access POST endpoint: <https://sdmdataaccess.sc.egov.usda.gov/Tabular/post.rest> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). | |
| | live Open-Meteo archive queries | `zone_state_bootstrap` | Archive API: <https://archive-api.open-meteo.com/v1/archive> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). | |
| | live Open-Meteo forecast queries | `zone_state_bootstrap` | Forecast API: <https://api.open-meteo.com/v1/forecast> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). | |
| | live Open-Meteo elevation queries | `zone_state_bootstrap` | Elevation API: <https://api.open-meteo.com/v1/elevation> | `exact` | Used by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). | |
| | live SoilGrids fallback queries | `zone_state_bootstrap` fallback | SoilGrids REST query endpoint: <https://rest.isric.org/soilgrids/v2.0/properties/query> | `exact` | Written into the dataset card by [`build_zone_state_bootstrap.py`](../scripts/build_zone_state_bootstrap.py). | |
|
|
| ## Rebuild Notes |
|
|
| When recreating the processed public-data outputs in a fresh environment: |
|
|
| 1. run `python scripts/run_public_data_pipeline.py`, |
| 2. verify that `data/processed/zone_state_bootstrap.parquet` exists, |
| 3. confirm that `data/processed/zone_state_bootstrap.dataset_card.json` |
| records the active upstream sources. |
|
|