microbe-model / docs /hybrid_predictor.md
Miyu Horiuchi
Deploy app from main@a3254bf (no paper/ binaries)
0ed74db

Hybrid Predictor

The hybrid predictor combines the strongest current model for each phenotype:

  • Temperature: saved tabular XGBoost phenotype head
  • pH: saved tabular XGBoost phenotype head
  • Salt: saved tabular XGBoost phenotype head
  • Oxygen: fold-0 all-task LoRA checkpoint

Script:

PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
  --features data/training_table.parquet \
  --marker-sequences data/marker_sequences.jsonl \
  --limit 25 \
  --device mps \
  --output artifacts/hybrid_predictions.parquet

Use --device cuda on a GPU host, --device mps on Apple Silicon, or omit the flag to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is slow because ESM-2 encodes multiple marker proteins per genome.

For larger uncultured-genome batches, use chunked output so progress is durable:

PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
  --features artifacts/uncultured_predictions.parquet \
  --marker-sequences data/uncultured_marker_sequences.jsonl \
  --join left \
  --reuse-existing-tabular \
  --device mps \
  --batch-size 2 \
  --chunk-size 250 \
  --chunk-output-dir artifacts/hybrid_chunks \
  --resume-chunks \
  --progress-every 25 \
  --output artifacts/hybrid_predictions.parquet

--resume-chunks skips existing chunk files and combines all expected chunks into the final output when the run finishes. --reuse-existing-tabular keeps previously materialized temperature, pH, salt, and media outputs while replacing oxygen with LoRA where marker sequences are available.

Inputs

--features accepts .parquet, .csv, .json, or .jsonl feature rows. Rows must include genome_accession by default. The tabular heads load their feature column order from models/phenotype/feature_cols.json.

--marker-sequences must be a JSONL file with the same schema as data/marker_sequences.jsonl:

{
  "bacdive_id": 1,
  "genome_accession": "GCF_004341595.1",
  "by_category": {
    "temperature": ["..."],
    "ph": ["..."],
    "oxygen": ["..."],
    "salt": ["..."],
    "vitamin": ["..."],
    "nitrogen": ["..."],
    "carbon": ["..."],
    "special": ["..."]
  }
}

For uncultured genomes, first prepare a marker-sequence JSONL for those same genome_accession values. The existing data/marker_sequences.jsonl is keyed to BacDive training rows, not the uncultured candidate table.

Output

The output includes:

  • pred_optimal_temperature_c plus 80% interval columns
  • pred_optimal_ph plus 80% interval columns
  • pred_salt_tolerance_pct plus 80% interval columns
  • pred_oxygen_requirement
  • pred_oxygen_requirement_confidence
  • one probability column per LoRA oxygen class

The default output is artifacts/hybrid_predictions.parquet; .csv, .json, and .jsonl outputs are also supported.