microbe-model / docs /hybrid_predictor.md
Miyu Horiuchi
Deploy app from main@a3254bf (no paper/ binaries)
0ed74db
# Hybrid Predictor
The hybrid predictor combines the strongest current model for each phenotype:
- Temperature: saved tabular XGBoost phenotype head
- pH: saved tabular XGBoost phenotype head
- Salt: saved tabular XGBoost phenotype head
- Oxygen: fold-0 all-task LoRA checkpoint
Script:
```bash
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
--features data/training_table.parquet \
--marker-sequences data/marker_sequences.jsonl \
--limit 25 \
--device mps \
--output artifacts/hybrid_predictions.parquet
```
Use `--device cuda` on a GPU host, `--device mps` on Apple Silicon, or omit the flag
to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is
slow because ESM-2 encodes multiple marker proteins per genome.
For larger uncultured-genome batches, use chunked output so progress is durable:
```bash
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
--features artifacts/uncultured_predictions.parquet \
--marker-sequences data/uncultured_marker_sequences.jsonl \
--join left \
--reuse-existing-tabular \
--device mps \
--batch-size 2 \
--chunk-size 250 \
--chunk-output-dir artifacts/hybrid_chunks \
--resume-chunks \
--progress-every 25 \
--output artifacts/hybrid_predictions.parquet
```
`--resume-chunks` skips existing chunk files and combines all expected chunks into
the final output when the run finishes. `--reuse-existing-tabular` keeps previously
materialized temperature, pH, salt, and media outputs while replacing oxygen with
LoRA where marker sequences are available.
## Inputs
`--features` accepts `.parquet`, `.csv`, `.json`, or `.jsonl` feature rows. Rows must
include `genome_accession` by default. The tabular heads load their feature column
order from `models/phenotype/feature_cols.json`.
`--marker-sequences` must be a JSONL file with the same schema as
`data/marker_sequences.jsonl`:
```json
{
"bacdive_id": 1,
"genome_accession": "GCF_004341595.1",
"by_category": {
"temperature": ["..."],
"ph": ["..."],
"oxygen": ["..."],
"salt": ["..."],
"vitamin": ["..."],
"nitrogen": ["..."],
"carbon": ["..."],
"special": ["..."]
}
}
```
For uncultured genomes, first prepare a marker-sequence JSONL for those same
`genome_accession` values. The existing `data/marker_sequences.jsonl` is keyed to
BacDive training rows, not the uncultured candidate table.
## Output
The output includes:
- `pred_optimal_temperature_c` plus 80% interval columns
- `pred_optimal_ph` plus 80% interval columns
- `pred_salt_tolerance_pct` plus 80% interval columns
- `pred_oxygen_requirement`
- `pred_oxygen_requirement_confidence`
- one probability column per LoRA oxygen class
The default output is `artifacts/hybrid_predictions.parquet`; `.csv`, `.json`, and
`.jsonl` outputs are also supported.