autofarm / docs /data_source_registry.md
isabelku's picture
AutoFarm Space deploy
826dd96

AutoFarm Data Source Registry

This document records the public sources used by the active data bootstrap pipeline and the local cache that supports reproducible rebuilds.

Provenance Status

  • exact: the upstream endpoint or asset is known and documented.
  • page-only: the upstream source page is known, but the exact downloaded file URL is not preserved locally.
  • repo-derived: the local file is derived from another source already present in the repository.
  • unresolved: the exact upstream acquisition path is not recoverable from the current repository state.

Assets Used By The Current Pipeline

Local asset Current use Upstream source Provenance status Evidence
data_local/downloads/usda_soil/ zone_state_bootstrap cache USDA Soil Data Access POST endpoint: https://sdmdataaccess.sc.egov.usda.gov/Tabular/post.rest exact Used by build_zone_state_bootstrap.py.
live Open-Meteo archive queries zone_state_bootstrap Archive API: https://archive-api.open-meteo.com/v1/archive exact Used by build_zone_state_bootstrap.py.
live Open-Meteo forecast queries zone_state_bootstrap Forecast API: https://api.open-meteo.com/v1/forecast exact Used by build_zone_state_bootstrap.py.
live Open-Meteo elevation queries zone_state_bootstrap Elevation API: https://api.open-meteo.com/v1/elevation exact Used by build_zone_state_bootstrap.py.
live SoilGrids fallback queries zone_state_bootstrap fallback SoilGrids REST query endpoint: https://rest.isric.org/soilgrids/v2.0/properties/query exact Written into the dataset card by build_zone_state_bootstrap.py.

Rebuild Notes

When recreating the processed public-data outputs in a fresh environment:

  1. run python scripts/run_public_data_pipeline.py,
  2. verify that data/processed/zone_state_bootstrap.parquet exists,
  3. confirm that data/processed/zone_state_bootstrap.dataset_card.json records the active upstream sources.