# data discovery & verification protocol

## purpose
To establish a rigorous, reproducible process for exploring, verifying, and documenting external data sources (Hugging Face Datasets, BIDS repos, etc.) before integrating them into the production codebase. This prevents "schema guessing" and ensures strict typing aligns with reality.

## principles
1.  **No Assumptions**: Never assume column names, file formats, or data types. Verify them programmatically.
2.  **Isolation**: Discovery scripts and their outputs must be isolated from production code and source control.
3.  **Reproducibility**: The discovery process must be scriptable and reproducible, not a series of manual CLI commands.

## standard locations

### scripts
All discovery logic resides in:
```
scripts/discovery/
├── __init__.py
├── inspect_hf_dataset.py   # e.g., Generic HF inspector
├── verify_bids_layout.py   # e.g., BIDS validator
└── ...
```

### data & artifacts
All downloaded samples, temporary outputs, and schema reports reside in:
```
data/
├── isles24/             # Extracted ISLES24 data (IGNORED)
└── discovery/           # Schema reports, samples (IGNORED)
```

## discovery workflow

### 1. implementation
Write a focused script in `scripts/discovery/` that:
- Connects to the data source (e.g., HF Hub).
- Fetches *metadata* or a *minimal sample* (streaming mode preferred).
- Prints/Logs:
    - Feature keys (column names).
    - Data types (Arrow types, Python types).
    - Non-null counts (if feasible).
    - A sample row structure.

### 2. execution
Run the script from the project root:
```bash
uv run scripts/discovery/inspect_hf_dataset.py > data/discovery/schema_report.txt
```

### 3. verification
Manually review `data/discovery/schema_report.txt`.
- **Check**: Do column names match `CaseAdapter` expectations?
- **Check**: Are file paths strings or objects?
- **Check**: Are required fields (DWI, ADC) actually present?

### 4. remediation
If the report contradicts the code/specs:
1.  Update the spec (`docs/specs/`) to reflect reality.
2.  Update the code (`src/.../adapter.py`) to handle the actual schema.
3.  Add a regression test if the edge case is complex.

## git configuration
Ensure `.gitignore` includes:
```gitignore
data/isles24/
data/discovery/
```