# Hybrid Predictor

The hybrid predictor combines the strongest current model for each phenotype:

- Temperature: saved tabular XGBoost phenotype head
- pH: saved tabular XGBoost phenotype head
- Salt: saved tabular XGBoost phenotype head
- Oxygen: fold-0 all-task LoRA checkpoint

Script:

```bash
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
  --features data/training_table.parquet \
  --marker-sequences data/marker_sequences.jsonl \
  --limit 25 \
  --device mps \
  --output artifacts/hybrid_predictions.parquet
```

Use `--device cuda` on a GPU host, `--device mps` on Apple Silicon, or omit the flag
to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is
slow because ESM-2 encodes multiple marker proteins per genome.

For larger uncultured-genome batches, use chunked output so progress is durable:

```bash
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
  --features artifacts/uncultured_predictions.parquet \
  --marker-sequences data/uncultured_marker_sequences.jsonl \
  --join left \
  --reuse-existing-tabular \
  --device mps \
  --batch-size 2 \
  --chunk-size 250 \
  --chunk-output-dir artifacts/hybrid_chunks \
  --resume-chunks \
  --progress-every 25 \
  --output artifacts/hybrid_predictions.parquet
```

`--resume-chunks` skips existing chunk files and combines all expected chunks into
the final output when the run finishes. `--reuse-existing-tabular` keeps previously
materialized temperature, pH, salt, and media outputs while replacing oxygen with
LoRA where marker sequences are available.

## Inputs

`--features` accepts `.parquet`, `.csv`, `.json`, or `.jsonl` feature rows. Rows must
include `genome_accession` by default. The tabular heads load their feature column
order from `models/phenotype/feature_cols.json`.

`--marker-sequences` must be a JSONL file with the same schema as
`data/marker_sequences.jsonl`:

```json
{
  "bacdive_id": 1,
  "genome_accession": "GCF_004341595.1",
  "by_category": {
    "temperature": ["..."],
    "ph": ["..."],
    "oxygen": ["..."],
    "salt": ["..."],
    "vitamin": ["..."],
    "nitrogen": ["..."],
    "carbon": ["..."],
    "special": ["..."]
  }
}
```

For uncultured genomes, first prepare a marker-sequence JSONL for those same
`genome_accession` values. The existing `data/marker_sequences.jsonl` is keyed to
BacDive training rows, not the uncultured candidate table.

## Output

The output includes:

- `pred_optimal_temperature_c` plus 80% interval columns
- `pred_optimal_ph` plus 80% interval columns
- `pred_salt_tolerance_pct` plus 80% interval columns
- `pred_oxygen_requirement`
- `pred_oxygen_requirement_confidence`
- one probability column per LoRA oxygen class

The default output is `artifacts/hybrid_predictions.parquet`; `.csv`, `.json`, and
`.jsonl` outputs are also supported.