Spaces:
Running
Hybrid Predictor
The hybrid predictor combines the strongest current model for each phenotype:
- Temperature: saved tabular XGBoost phenotype head
- pH: saved tabular XGBoost phenotype head
- Salt: saved tabular XGBoost phenotype head
- Oxygen: fold-0 all-task LoRA checkpoint
Script:
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
--features data/training_table.parquet \
--marker-sequences data/marker_sequences.jsonl \
--limit 25 \
--device mps \
--output artifacts/hybrid_predictions.parquet
Use --device cuda on a GPU host, --device mps on Apple Silicon, or omit the flag
to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is
slow because ESM-2 encodes multiple marker proteins per genome.
For larger uncultured-genome batches, use chunked output so progress is durable:
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
--features artifacts/uncultured_predictions.parquet \
--marker-sequences data/uncultured_marker_sequences.jsonl \
--join left \
--reuse-existing-tabular \
--device mps \
--batch-size 2 \
--chunk-size 250 \
--chunk-output-dir artifacts/hybrid_chunks \
--resume-chunks \
--progress-every 25 \
--output artifacts/hybrid_predictions.parquet
--resume-chunks skips existing chunk files and combines all expected chunks into
the final output when the run finishes. --reuse-existing-tabular keeps previously
materialized temperature, pH, salt, and media outputs while replacing oxygen with
LoRA where marker sequences are available.
Inputs
--features accepts .parquet, .csv, .json, or .jsonl feature rows. Rows must
include genome_accession by default. The tabular heads load their feature column
order from models/phenotype/feature_cols.json.
--marker-sequences must be a JSONL file with the same schema as
data/marker_sequences.jsonl:
{
"bacdive_id": 1,
"genome_accession": "GCF_004341595.1",
"by_category": {
"temperature": ["..."],
"ph": ["..."],
"oxygen": ["..."],
"salt": ["..."],
"vitamin": ["..."],
"nitrogen": ["..."],
"carbon": ["..."],
"special": ["..."]
}
}
For uncultured genomes, first prepare a marker-sequence JSONL for those same
genome_accession values. The existing data/marker_sequences.jsonl is keyed to
BacDive training rows, not the uncultured candidate table.
Output
The output includes:
pred_optimal_temperature_cplus 80% interval columnspred_optimal_phplus 80% interval columnspred_salt_tolerance_pctplus 80% interval columnspred_oxygen_requirementpred_oxygen_requirement_confidence- one probability column per LoRA oxygen class
The default output is artifacts/hybrid_predictions.parquet; .csv, .json, and
.jsonl outputs are also supported.