# Hybrid Predictor The hybrid predictor combines the strongest current model for each phenotype: - Temperature: saved tabular XGBoost phenotype head - pH: saved tabular XGBoost phenotype head - Salt: saved tabular XGBoost phenotype head - Oxygen: fold-0 all-task LoRA checkpoint Script: ```bash PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \ --features data/training_table.parquet \ --marker-sequences data/marker_sequences.jsonl \ --limit 25 \ --device mps \ --output artifacts/hybrid_predictions.parquet ``` Use `--device cuda` on a GPU host, `--device mps` on Apple Silicon, or omit the flag to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is slow because ESM-2 encodes multiple marker proteins per genome. For larger uncultured-genome batches, use chunked output so progress is durable: ```bash PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \ --features artifacts/uncultured_predictions.parquet \ --marker-sequences data/uncultured_marker_sequences.jsonl \ --join left \ --reuse-existing-tabular \ --device mps \ --batch-size 2 \ --chunk-size 250 \ --chunk-output-dir artifacts/hybrid_chunks \ --resume-chunks \ --progress-every 25 \ --output artifacts/hybrid_predictions.parquet ``` `--resume-chunks` skips existing chunk files and combines all expected chunks into the final output when the run finishes. `--reuse-existing-tabular` keeps previously materialized temperature, pH, salt, and media outputs while replacing oxygen with LoRA where marker sequences are available. ## Inputs `--features` accepts `.parquet`, `.csv`, `.json`, or `.jsonl` feature rows. Rows must include `genome_accession` by default. The tabular heads load their feature column order from `models/phenotype/feature_cols.json`. `--marker-sequences` must be a JSONL file with the same schema as `data/marker_sequences.jsonl`: ```json { "bacdive_id": 1, "genome_accession": "GCF_004341595.1", "by_category": { "temperature": ["..."], "ph": ["..."], "oxygen": ["..."], "salt": ["..."], "vitamin": ["..."], "nitrogen": ["..."], "carbon": ["..."], "special": ["..."] } } ``` For uncultured genomes, first prepare a marker-sequence JSONL for those same `genome_accession` values. The existing `data/marker_sequences.jsonl` is keyed to BacDive training rows, not the uncultured candidate table. ## Output The output includes: - `pred_optimal_temperature_c` plus 80% interval columns - `pred_optimal_ph` plus 80% interval columns - `pred_salt_tolerance_pct` plus 80% interval columns - `pred_oxygen_requirement` - `pred_oxygen_requirement_confidence` - one probability column per LoRA oxygen class The default output is `artifacts/hybrid_predictions.parquet`; `.csv`, `.json`, and `.jsonl` outputs are also supported.