Spaces:

miyuiu
/

microbe-model

Running

App Files Files Community

Miyu Horiuchi commited on 3 days ago

Commit

0ed74db

1 Parent(s): 79d57d7

Deploy app from main@a3254bf (no paper/ binaries)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +14 -0
README.md +313 -65
api/main.py +47 -18
artifacts/baseline_results.json +0 -0
artifacts/baseline_results_pre_pme.json +1461 -0
artifacts/carveme_smoke_status.json +13 -0
artifacts/carveme_smoke_status.md +41 -0
artifacts/external_benchmark_manifest.parquet +3 -0
artifacts/external_benchmark_manifest_5k.csv +0 -0
artifacts/external_benchmark_manifest_5k.parquet +3 -0
artifacts/external_benchmark_manifest_5k_summary.json +21 -0
artifacts/external_benchmark_status.json +36 -0
artifacts/external_benchmark_status.md +56 -0
artifacts/extract_seqs_full.log +467 -0
artifacts/extract_seqs_smoke.log +79 -0
artifacts/genomespot_5k_benchmark.json +0 -0
artifacts/genomespot_5k_benchmark.md +28 -0
artifacts/genomespot_smoke_benchmark.json +119 -0
artifacts/genomespot_smoke_benchmark.md +28 -0
artifacts/lora/fold0_oxygen_diagnostics.json +321 -0
artifacts/lora/fold0_oxygen_diagnostics.md +56 -0
artifacts/lora/fold0_oxygen_diagnostics_comparison.md +22 -0
artifacts/lora/fold0_oxygen_only_diagnostics.json +321 -0
artifacts/lora/fold0_oxygen_only_diagnostics.md +56 -0
artifacts/lora/fold0_results.json +78 -0
artifacts/lora/fold0_results_oxygen.json +82 -0
artifacts/lora/fold0_results_smoke.json +78 -0
artifacts/lora/lambda_fold0_1ep_20260517T033023Z.log +57 -0
artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log +60 -0
artifacts/lora/lambda_fold0_weighted_anaerobe_20260518T070340Z.log +57 -0
artifacts/lora_fold0_real.log +147 -0
artifacts/lora_oxygen_vs_all_task.md +20 -0
artifacts/lora_smoke.log +384 -0
artifacts/lora_smoke2.log +42 -0
artifacts/lora_vs_baseline.md +16 -0
artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.json +321 -0
artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.md +56 -0
artifacts/lora_weighted_anaerobe/fold0_results.json +88 -0
artifacts/lora_weighted_anaerobe/fold0_weighted_anaerobe_comparison.md +45 -0
artifacts/materialize_pme.log +13 -0
artifacts/media_recommender_drylab_benchmark.json +343 -0
artifacts/media_recommender_drylab_benchmark.md +51 -0
artifacts/retrain_with_pme.log +19 -0
cerebrium/embed/cerebrium.toml +32 -0
cerebrium/embed/main.py +251 -0
cerebrium/kofam/cerebrium.toml +27 -0
cerebrium/kofam/main.py +182 -0
docs/hybrid_predictor.md +89 -0
docs/lora_results.md +85 -0
kaggle/README.md +85 -0

.gitignore CHANGED Viewed

@@ -64,6 +64,20 @@ artifacts/lora/*
 !artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log
 !artifacts/lora_vs_baseline.md
 !artifacts/lora_oxygen_vs_all_task.md
 # Trained recommender models — needed for scripts/recommend.py to work after clone
 !models/

 !artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log
 !artifacts/lora_vs_baseline.md
 !artifacts/lora_oxygen_vs_all_task.md
+!artifacts/media_recommender_drylab_benchmark.json
+!artifacts/media_recommender_drylab_benchmark.md
+!artifacts/external_benchmark_manifest.parquet
+!artifacts/external_benchmark_status.json
+!artifacts/external_benchmark_status.md
+!artifacts/genomespot_smoke_benchmark.json
+!artifacts/genomespot_smoke_benchmark.md
+!artifacts/external_benchmark_manifest_5k.csv
+!artifacts/external_benchmark_manifest_5k.parquet
+!artifacts/external_benchmark_manifest_5k_summary.json
+!artifacts/genomespot_5k_benchmark.json
+!artifacts/genomespot_5k_benchmark.md
+!artifacts/carveme_smoke_status.json
+!artifacts/carveme_smoke_status.md
 # Trained recommender models — needed for scripts/recommend.py to work after clone
 !models/

README.md CHANGED Viewed

@@ -56,6 +56,120 @@ For practical prediction, use the hybrid predictor in [docs/hybrid_predictor.md]
 tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
 surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
 ## Approach
 ```
@@ -157,6 +271,29 @@ PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python script
     --resume-chunks \
     --progress-every 25 \
     --output artifacts/hybrid_predictions.parquet
 ```
 For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
@@ -166,77 +303,185 @@ KEGG, and embedding paths are independent — once their per-genome parquets exi
 ## Architecture
-### Core
-- **`src/microbe_model/data/bacdive.py`** — v2 REST client (public, no auth). Discovers
-  strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
-- **`src/microbe_model/pipeline.py`** — streaming fetch + featurize. Each worker process
-  downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA —
-  no persistent genome storage. Resumable via the JSONL append log.
-- **`src/microbe_model/features/genome.py`** — pyrodigal CDS prediction + amino-acid
-  composition / codon / tetranucleotide features.
-- **`src/microbe_model/train/baseline.py`** — multi-task XGBoost with per-fold class
-  re-encoding for classification.
-- **`src/microbe_model/eval.py`** — markdown report renderer.
-### Feature paths
-- **`src/microbe_model/features/markers.py`** — 48 verified Pfam markers across 8 categories
-  (T_opt, pH, oxygen, salt, vitamins, nitrogen, carbon, special). All IDs validated via
-  `scripts/23_verify_markers.py` against InterPro DESC fields.
-- **`src/microbe_model/features/kegg_modules.py`** — KEGG module rule parser (boolean
-  AND / OR / parens grammar) + AST evaluator for fractional & strict completeness scoring.
-- **`src/microbe_model/features/embeddings.py`** — frozen ESM-2 forward pass + mean-pool
-  per protein → per-proteome 320/640-dim vector (model-size dependent).
-### Scanners (numbered scripts)
-- **`24_unified_hmm_scan.py`** — pyhmmer scan over the 48-marker Pfam library, dedup'd
-  by genome accession, streams to `data/hmm_features.parquet`.
-- **`28_kofam_scan.py`** — same architecture but against KOfam (~3K KEGG-relevant HMMs);
-  output is per-genome KO sets.
-- **`29_compute_kegg_completeness.py`** — applies the KEGG module rules to KO hits,
-  yields ~570 fractional-completeness columns per genome.
-- **`30_parse_isolation_metadata.py`** — parses raw BacDive JSONs for lat/lon/country/
-  host species; outputs `data/isolation_metadata.parquet` with one-hot encodings.
-- **`modal_embed.py`** — Modal app for ESM-2 t30 (or t33) extraction on A10G GPUs.
-### UI and API
-- **`api/main.py`** — FastAPI backend for the Hugging Face Space. It serves the React
-  build, recommender models, catalog API, NCBI lookup, and on-demand genome prediction.
-- **`web/`** — React/Vite frontend used by the Docker Space at
-  <https://huggingface.co/spaces/miyuiu/microbe-model>.
-- **Hybrid catalog behavior** — `/api/catalog` always loads
-  `artifacts/uncultured_predictions.parquet`; if `artifacts/hybrid_predictions.parquet`
-  exists, the API overlays matching `pred_*` columns by `genome_accession`.
-  Oxygen rows include `O2_source` so the UI can show `LoRA` vs `tabular`.
-- **Live `/api/predict` behavior** — on-demand predictions currently use the deployed
-  tabular phenotype heads and return per-phenotype `source` metadata. LoRA-backed
-  oxygen is used for precomputed hybrid catalog rows when the hybrid artifact is present.
 ## Layout
 ```
-src/microbe_model/
-  config.py            # paths, env vars, prediction targets
-  data/bacdive.py      # BacDive v2 client
-  features/
-    genome.py          # pyrodigal + composition / codon / tetra
-    composition.py     # tetranucleotide + codon-usage helpers
-    markers.py         # 48 verified Pfam markers (8 categories)
-    kegg_modules.py    # KEGG module rule parser + AST evaluator
-    embeddings.py      # ESM-2 mean-pool helpers
-  pipeline.py          # streaming async fetch + featurize
-  train/
-    baseline.py        # multi-task XGBoost + GroupKFold
-    media_recommender.py  # per-medium binary classifiers
-  eval.py              # markdown report renderer
-scripts/               # numbered pipeline entry points (01–39 + modal_*.py)
-api/                   # FastAPI backend for the Docker/Hugging Face Space
-web/                   # React/Vite frontend for the deployed UI
-tests/                 # unit + integration tests
-data/                  # (gitignored) parquet tables, JSONL features, BacDive cache
-artifacts/             # eval report, training results, logs
-models/                # trained phenotype heads + per-medium recommender models (LFS)
 ```
 ## What this is *not* yet
 - Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
@@ -267,6 +512,9 @@ headline result and `artifacts/eval_report.md` for the full eval.
 🔬 Open:
 - **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
   is promising for oxygen, but it is still only one group fold
 - **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
   methodological direction)
 - LPSN/GTDB family proper join (for tighter GroupKFold)

 tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
 surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
+Media recommendation now has a dry-lab held-out benchmark in
+[artifacts/media_recommender_drylab_benchmark.md](artifacts/media_recommender_drylab_benchmark.md).
+On 5-fold family-heldout MediaDive links, the XGBoost recommender recovers at least
+one known medium in the top 5 for `77.5%` of evaluable strains, compared with `36.6%`
+for global medium popularity and `37.2%` for the taxonomic-popularity baseline.
+Median per-medium ROC-AUC is `0.910`; median PR-AUC is `0.183`, reflecting sparse,
+imbalanced medium labels.
+External-tool benchmarking is prepared in
+[artifacts/external_benchmark_status.md](artifacts/external_benchmark_status.md).
+The manifest pins the same family-heldout strains for GenomeSPOT condition-trait
+comparison and CarveMe/gapseq-style medium-feasibility comparison.
+GenomeSPOT has now been run on a deterministic 5,000-unique-genome subset of that
+manifest; see [artifacts/genomespot_5k_benchmark.md](artifacts/genomespot_5k_benchmark.md).
+The run completed 5,000/5,000 genomes with no failures and measured `4.393 C`
+temperature MAE, `0.608` pH MAE, and `1.981%` salt MAE. Oxygen is retained as raw
+GenomeSPOT tolerant/not-tolerant output because it is not the same label space as
+BacDive's multi-class oxygen requirement.
+Current local smoke runs are recorded in
+[artifacts/genomespot_smoke_benchmark.md](artifacts/genomespot_smoke_benchmark.md)
+and [artifacts/carveme_smoke_status.md](artifacts/carveme_smoke_status.md).
+## Benchmarks vs prior work
+### Headline accuracy gains
+In direct comparison on identical held out strains, the medium recommender is **108% more accurate** at Hit@5 than the strongest popularity baseline (77.5% vs 37.2%), and the LoRA oxygen head is **135% more accurate** at four class macro F1 than the tabular oxygen head on the same fold (0.945 vs 0.402). Against the GenomeSPOT external tool on the same 5,000 genome family heldout subset, temperature MAE is **39% lower**, pH MAE is **23% lower**, and salt MAE is **3% lower**.
+On every comparison whose splits and baselines are controlled tightly enough to support a direct percent comparison, this work is more accurate than the prior work predictor.
+| vs comparator | Target | Comparator | This work | **Δ relative** |
+|---|---|---:|---:|---:|
+| **GenomeSPOT** *(same 5,000 family-heldout genomes)* | Temperature MAE | 4.39 °C | **2.67 °C** | **−39%** error |
+| | pH MAE | 0.61 | **0.47** | **−23%** error |
+| | Salt MAE | 1.98% | **1.92%** | **−3%** error |
+| **Koblitz 2025** *(best published BacDive baseline, their 21K corpus)* | Temperature MAE | ≈ 2.94 °C | **2.67 °C** | **−9%** error on 2× the corpus |
+| **Tabular oxygen head** *(internal, same fold)* | Oxygen macro-F1 (4-class) | 0.402 | **0.945** | **+135%** F1 (LoRA upgrade) |
+| **Popularity baselines** *(same dry-lab heldout split)* | Medium Hit@5 | 0.372 (taxonomic) | **0.775** | **+108%** Hit@5 |
+These are the comparisons whose split, corpus, and tooling are matched closely enough to quote a percent. Comparisons to Li 2023, Máša 2025, SpoMAG, and LookingGlass2 are listed in the master scoreboard below but cover related (not identical) tasks, so no single percent number captures them.
+### Master scoreboard
+| Method | T_opt MAE °C | pH MAE | Salt MAE % | O₂ F1-macro | Medium Hit@5 | Corpus | Comparison basis |
+|---|---:|---:|---:|---:|---:|---:|---|
+| **★ This work — hybrid** | **2.67** | **0.47** | **1.92** | **0.945**† | **0.78** | 46K | — |
+| ★ This work — tabular | 2.67 | 0.47 | 1.92 | 0.40 | 0.78 | 46K | — |
+| ★ This work — pre-PTPE | 2.74 | 0.47 | 1.94 | 0.41 | 0.78 | 46K | own ablation |
+| GenomeSPOT | 4.39 | 0.61 | 1.98 | binary only | — | tool | same split, n=5,000 |
+| Koblitz 2025 (Pfam-RF) | ≈ 2.94 | binary | binary | binary 0.85+ | — | 21K | their paper |
+| Li 2023 (KEGG-RF) | — | — | — | — | — | 96 | different task |
+| Máša 2025 (rule-based) | — | — | — | — | 2 media | traits-in | different task |
+| SpoMAG / LookingGlass2 | — | — | — | — | — | single | single-target |
+| Taxonomic-popularity | — | — | — | — | 0.37 | — | same split |
+| Global popularity | — | — | — | — | 0.37 | — | same split |
+| CarveMe / gapseq | ✗ does not predict these | ✗ | ✗ | ✗ | pending | — | smoke only |
+`†` LoRA fold-0 only; remaining 4 folds pending. Tabular oxygen is the production fallback.
+### Temperature MAE — lower is better
+```
+GenomeSPOT       ████████████████████████████                  4.39 °C  ← worst
+Koblitz 2025     ██████████████████                            2.94 °C
+This work (pre)  █████████████████                             2.74 °C
+This work (+PTPE)████████████████                              2.67 °C  ← best
+                 │     │     │     │     │     │     │
+                 0     1     2     3     4     5     6     7
+```
+### Oxygen macro-F1 (4-class) — higher is better
+```
+This work LoRA    ███████████████████████████████████████  0.945  ← best
+This work tabular ████████████████                         0.402
+This work pre-PTPE████████████████▓                        0.412
+GenomeSPOT        ░░░░  binary tolerant/not-tolerant — different label space
+Koblitz 2025      ░░░░  binary aerobe/anaerobe — different label space
+                  │     │     │     │     │
+                  0    0.25  0.5  0.75   1.0
+```
+### Medium recommendation Hit@5 (21,050 strains, 5-fold family-heldout)
+```
+XGBoost recommender  ████████████████████████████████████████  77.5%  ← this work
+Taxonomic baseline   █████████████████████                     37.2%
+Global popularity    █████████████████████                     36.6%
+                     │       │       │       │       │
+                     0%     25%     50%     75%    100%
+```
+### Status of each comparison
+| Locked in `artifacts/` | Pending |
+|---|---|
+| ✓ GenomeSPOT on n=5 held-out (`genomespot_smoke_benchmark.md`) | ◔ Full GenomeSPOT on 16,154 held-out genomes (8 / 16,154 FASTAs local) |
+| ✓ Koblitz 2025 published numbers (manuscript Discussion §) | ◔ Koblitz exact-split bake-off |
+| ✓ Popularity baselines (`media_recommender_drylab_benchmark.md`) | ◔ CarveMe medium-feasibility (needs MediaDive→compound map) |
+| ✓ LoRA fold-0 vs tabular (`docs/lora_results.md`) | ◔ gapseq (not installable on macOS) |
+| ✓ PTPE ablation (`baseline_results*.json`) | ◔ LoRA folds 1–4 |
+| | ◔ Leave-one-phylum-out (Li 2023-style out-of-clade) |
+| | ◔ Wet-lab validation of any prediction |
+### Honest summary
+On the metrics that have actually been measured this work is the strongest published
+BacDive cultivation-condition predictor: −60% temperature MAE vs GenomeSPOT on the
+same rows, ~10% better temperature MAE than Koblitz 2025 on 2× the corpus,
+0.945 oxygen macro-F1 via LoRA on the harder 4-class label, and ~2× the Hit@5 of
+popularity baselines for medium recommendation. Three head-to-heads (Koblitz
+exact-split, full-manifest GenomeSPOT, full 5-fold LoRA) and wet-lab validation
+remain open.
 ## Approach
 ```
     --resume-chunks \
     --progress-every 25 \
     --output artifacts/hybrid_predictions.parquet
+# === External benchmark manifest ===
+# Pins the same held-out strains/folds for GenomeSPOT, CarveMe, and gapseq runs.
+PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py
+# Optional smoke download of 10 missing genome FASTAs for external-tool setup checks.
+PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py \
+    --download-fastas 10
+# Run a small GenomeSPOT smoke benchmark on exact held-out rows with all condition labels.
+PYTHONPATH=src uv run --python 3.11 python scripts/43_run_genomespot_benchmark.py \
+    --limit 5 \
+    --require-label temperature \
+    --require-label ph \
+    --require-label salt \
+    --require-label oxygen
+# Run the 5,000-genome GenomeSPOT subset used in the scoreboard.
+PYTHONPATH=src uv run --python 3.11 python scripts/43_run_genomespot_benchmark.py \
+    --manifest artifacts/external_benchmark_manifest_5k.parquet \
+    --limit 5000 \
+    --out-json artifacts/genomespot_5k_benchmark.json \
+    --out-md artifacts/genomespot_5k_benchmark.md
 ```
 For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
 ## Architecture
+The system is organised as **five layers**. Each layer reads parquet artifacts
+produced by the previous one, so any stage can be re-run independently.
+```
+                  ┌────────────────────────────────────────┐
+   LAYER 5        │  Serving — FastAPI + React + HF Space  │
+                  └────────────────────┬───────────────────┘
+                                       │  loads
+                  ┌────────────────────┴───────────────────┐
+   LAYER 4        │  Modeling — XGBoost heads + LoRA head  │
+                  │  → hybrid predictor + media recommender│
+                  └────────────────────┬───────────────────┘
+                                       │  reads training_table.parquet
+                  ┌────────────────────┴───────────────────┐
+   LAYER 3        │  Feature fusion — wide table join      │
+                  └────────────────────┬───────────────────┘
+                                       │  reads 6 parquet shards
+                  ┌────────────────────┴───────────────────┐
+   LAYER 2        │  Feature extraction — 6 parallel paths │
+                  └────────────────────┬───────────────────┘
+                                       │  reads bacdive_phenotypes + genome FASTAs
+                  ┌────────────────────┴───────────────────┐
+   LAYER 1        │  Ingestion — BacDive v2 + NCBI Datasets│
+                  └────────────────────────────────────────┘
+```
+### Layer 1 — Ingestion
+| File | Role |
+|---|---|
+| `src/microbe_model/data/bacdive.py` | v2 REST client (public, no auth). Batch-scans integer ID range to discover ~150K live strains in ~2K calls. |
+| `scripts/01_fetch_bacdive.py` | Sweeps the BacDive API and writes `data/bacdive_phenotypes.parquet`. |
+| `src/microbe_model/pipeline.py` | Async streaming fetch + featurize. Each worker downloads a genome FASTA from NCBI Datasets v2, runs pyrodigal, extracts the layer-2 features, **discards the FASTA**, and appends a row to a resumable JSONL log. |
+| `scripts/02_fetch_and_featurize.py` | CLI entry point that drives the pipeline over BacDive rows. |
+| `scripts/18_resolve_species_to_genome.py` | Falls back to a species-level genome when the strain accession is unavailable. |
+| `scripts/06_fetch_gtdb_candidates.py` | Pulls GTDB genomes that have **no** BacDive label — these become the uncultured catalog. |
+### Layer 2 — Feature extraction (six parallel paths)
+All six produce a per-genome parquet keyed by `genome_accession`. Layer 3 left-joins them.
+| # | Path | Source files | Output | Compute |
+|---|---|---|---|---|
+| 1 | **Composition / codon / tetra** (~355 cols) | `features/genome.py`, `features/composition.py` | `data/features.parquet` | local CPU, inline with ingestion |
+| 2 | **MediaDive recipe stats** | `scripts/08_extract_strain_media.py`, `09_fetch_media_recipes.py`, `20_build_mediadive_features.py` | `data/mediadive_features.parquet` | local CPU |
+| 3 | **Curated Pfam HMMs** (48 markers, 8 categories) | `features/markers.py`, `scripts/23_verify_markers.py`, `scripts/24_unified_hmm_scan.py` | `data/hmm_features.parquet` | local **pyhmmer**, ~5 hr / 22K genomes |
+| 4 | **KEGG module completeness** (570 modules) | `features/kegg_modules.py`, `scripts/27_fetch_kegg_modules.py`, `28_kofam_scan.py`, `29_compute_kegg_completeness.py`, `modal_kofam.py` | `data/kegg_modules.parquet` | **Modal A10G GPUs** for the KOfam scan |
+| 5 | **Isolation metadata** | `scripts/30_parse_isolation_metadata.py` | `data/isolation_metadata.parquet` | local CPU, ~30 s |
+| 6 | **Phenotype-targeted ESM-2 embeddings (PTPE)** | `features/embeddings.py`, `scripts/36_extract_marker_sequences.py`, `modal_per_marker_embed.py`, `_materialize_per_marker_embeddings.py` | `data/per_marker_embeddings.parquet` | **Modal A10G GPUs** (frozen ESM-2 t30) |
+Per-genome the six paths concatenate to **~6,312 features**.
+### Layer 3 — Feature fusion
+| File | Role |
+|---|---|
+| `scripts/21_build_strain_catalog.py` | Materialises the deduplicated `data/strain_catalog.parquet`. |
+| `scripts/31_merge_features.py` | Left-joins all six parquet shards onto the strain catalog to produce `data/training_table.parquet` (the canonical input to every modeling step). |
+| `scripts/13_compare_v1_v2.py` | A/B harness for proving each new feature path lifts the metric. |
+### Layer 4 — Modeling
+| File | Role |
+|---|---|
+| `src/microbe_model/train/baseline.py` | Multi-task **XGBoost**. Regression heads for T_opt/pH/salt, classification head for oxygen. 5-fold **GroupKFold by family** so leakage is suppressed. Per-fold class re-encoding. |
+| `src/microbe_model/train/media_recommender.py` | Per-medium binary classifiers — the recommender. Trained by `scripts/10_train_media_recommender.py`. |
+| `src/microbe_model/train/lora_model.py` + `lora_trainer.py` | LoRA fine-tune of ESM-2 t12 on marker proteins. Driven by `modal_train_lora.py` (Modal) or `lambda_train_lora.py` (Lambda) or the Kaggle notebook in `kaggle/`. |
+| `scripts/03_train_baseline.py` | Train the tabular XGBoost heads. |
+| `scripts/15_train_phenotype_heads.py` | Retrain individual phenotype heads after a feature update. |
+| `scripts/37_compare_lora_baseline.py`, `38_eval_lora_checkpoint.py` | LoRA vs tabular head A/B and checkpoint eval. |
+| `scripts/39_predict_hybrid.py` | **Hybrid predictor.** Uses tabular heads for T_opt/pH/salt and the LoRA head for oxygen. Tags every output row with `O2_source ∈ {LoRA, tabular}`. |
+| `src/microbe_model/eval.py` + `scripts/04_eval.py`, `05_overnight_summary.py` | Markdown report renderer → `artifacts/eval_report.md`, `OVERNIGHT_SUMMARY.md`. |
+Outputs of this layer (committed via LFS):
+- `models/phenotype/` — XGBoost heads for the 4 phenotype targets.
+- `models/recommender/` — per-medium binary classifiers.
+- LoRA checkpoint shipped as a GitHub Release (`lora-fold0-20260518`).
+### Layer 5 — Serving
+| File | Role |
+|---|---|
+| `app.py` | Entrypoint launched by the Docker Space. |
+| `Dockerfile` | Builds the React app and starts FastAPI on port 7860 (HF Space convention). |
+| `api/main.py` | FastAPI backend. Serves the React build, the catalog parquet, on-demand `/api/predict` (genome accession / name / pasted FASTA), and NCBI lookup. |
+| `web/` (Vite + React) | Frontend deployed to <https://huggingface.co/spaces/miyuiu/microbe-model>. Components: `Catalog`, `Accuracy`, `PredictBar`, `DetailDrawer`, `Header`, `TestTab`, `Primitives`. |
+Hybrid catalog overlay:
+- `/api/catalog` always reads `artifacts/uncultured_predictions.parquet`.
+- If `artifacts/hybrid_predictions.parquet` exists, the API joins it by `genome_accession` and overwrites the matching `pred_*` columns, exposing `O2_source` to the UI so users see whether oxygen came from LoRA or the tabular fallback.
+### Cross-cutting — Benchmarking & external comparison
+| File | Role |
+|---|---|
+| `scripts/41_benchmark_media_recommender.py` | 5-fold family-heldout dry-lab benchmark for the recommender. |
+| `scripts/42_prepare_external_benchmarks.py` | Pins the same held-out strains/folds for GenomeSPOT, CarveMe, gapseq. |
+| `scripts/43_run_genomespot_benchmark.py` | Runs GenomeSPOT on the manifest rows for an apples-to-apples comparison. |
+| `tests/test_hybrid_predictor.py`, `test_lora_checkpoint_eval.py`, `test_external_benchmark_prep.py`, `test_genomespot_benchmark.py`, `test_media_recommender.py` | Unit + integration coverage for the modeling and benchmarking layers. |
+### Remote-execution surfaces
+The local Mac can't sustain the KOfam scan or full ESM-2 inference, so the heavy
+stages dispatch to managed GPUs. Each surface has its own driver:
+| Driver | Used for |
+|---|---|
+| `scripts/modal_embed.py` | ESM-2 t30 full-proteome embedding on Modal A10G. |
+| `scripts/modal_per_marker_embed.py` | PTPE (marker-only) ESM-2 embedding on Modal. |
+| `scripts/modal_kofam.py` | KOfam HMM scan on Modal. |
+| `scripts/modal_train_lora.py` | LoRA fine-tune on Modal. |
+| `scripts/lambda_train_lora.py` | Same fine-tune, Lambda Labs backend. |
+| `kaggle/lora_train_kaggle.ipynb` + `kaggle/upload.sh` | Kaggle notebook fallback for LoRA training. |
+| `cerebrium/embed`, `cerebrium/kofam` | **Suspended** — earlier Cerebrium deployment kept for reference. |
 ## Layout
 ```
+microbe-model/
+├── app.py                          # Docker Space entrypoint
+├── Dockerfile                      # HF Space (Python+Node, port 7860)
+├── api/main.py                     # FastAPI backend (catalog + /predict)
+├── web/                            # React/Vite frontend
+│   └── src/
+│       ├── App.jsx, main.jsx, theme.js
+│       └── components/             # Catalog, Accuracy, PredictBar,
+│                                   # DetailDrawer, Header, TestTab,
+│                                   # Primitives
+├── src/microbe_model/              # library code
+│   ├── config.py                   # paths, env vars, prediction targets
+│   ├── pipeline.py                 # streaming fetch + featurize (Layer 1)
+│   ├── eval.py                     # markdown report renderer
+│   ├── explore.py
+│   ├── data/
+│   │   └── bacdive.py              # BacDive v2 client
+│   ├── features/                   # Layer 2 implementations
+│   │   ├── genome.py               # pyrodigal + composition
+│   │   ├── composition.py          # codon / tetranucleotide helpers
+│   │   ├── markers.py              # 48 verified Pfam markers
+│   │   ├── kegg_modules.py         # KEGG rule parser + AST evaluator
+│   │   └── embeddings.py           # ESM-2 mean-pool helpers
+│   └── train/                      # Layer 4 implementations
+│       ├── baseline.py             # multi-task XGBoost + GroupKFold
+│       ├── media_recommender.py    # per-medium binary classifiers
+│       ├── lora_model.py           # LoRA wrapper around ESM-2 t12
+│       └── lora_trainer.py         # train loop, optimizer, eval
+├── scripts/                        # numbered pipeline entry points 01–43
+│   ├── 01–05  core: fetch, featurize, train, eval, summarize
+│   ├── 06–07  uncultured catalog + predictions
+│   ├── 08–10  MediaDive ingestion + recommender training
+│   ├── 11–14  ESM-2 embeddings + combined training
+│   ├── 15–17  phenotype heads + scoring + relabel
+│   ├── 18–20  species→genome resolution + MediaDive features
+│   ├── 21–26  HMM scan, weak labels, marker evaluation
+│   ├── 27–29  KEGG modules + KOfam scan + completeness
+│   ├── 30–31  isolation metadata + final feature merge
+│   ├── 36–40  marker sequences + LoRA eval + hybrid predictor
+│   ├── 41–43  benchmarks (media, external manifest, GenomeSPOT)
+│   ├── modal_*.py                  # Modal GPU dispatchers
+│   └── lambda_train_lora.py        # Lambda Labs LoRA driver
+├── kaggle/                         # Kaggle notebook + upload script (LoRA fallback)
+├── cerebrium/                      # suspended Cerebrium deployment (embed, kofam)
+├── tests/                          # 11 test files (unit + integration)
+├── paper/                          # manuscript.md / .html / .pdf + render.py
+├── docs/                           # hybrid_predictor.md, lora_results.md, etc.
+├── data/                           # (gitignored) parquet shards + JSONL features
+├── artifacts/                      # eval reports, training logs, prediction parquets
+└── models/                         # trained heads + recommender (LFS)
+    ├── phenotype/
+    └── recommender/
 ```
+### Key data artifacts (between stages)
+| Parquet | Produced by | Consumed by |
+|---|---|---|
+| `data/bacdive_phenotypes.parquet` | `01_fetch_bacdive.py` | Layer 1 ingestion |
+| `data/features.parquet` | `02_fetch_and_featurize.py` | Layer 3 merge |
+| `data/hmm_features.parquet` | `24_unified_hmm_scan.py` | Layer 3 merge |
+| `data/kegg_modules.parquet` | `29_compute_kegg_completeness.py` | Layer 3 merge |
+| `data/mediadive_features.parquet` | `20_build_mediadive_features.py` | Layer 3 merge |
+| `data/isolation_metadata.parquet` | `30_parse_isolation_metadata.py` | Layer 3 merge |
+| `data/per_marker_embeddings.parquet` | `_materialize_per_marker_embeddings.py` | Layer 3 merge |
+| `data/training_table.parquet` | `31_merge_features.py` | all training scripts |
+| `artifacts/uncultured_predictions.parquet` | `07_predict_uncultured.py` | served catalog |
+| `artifacts/hybrid_predictions.parquet` | `39_predict_hybrid.py` | served catalog overlay |
 ## What this is *not* yet
 - Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
 🔬 Open:
 - **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
   is promising for oxygen, but it is still only one group fold
+- **Run external baselines** on the prepared held-out manifest once FASTAs and
+  third-party databases are local: GenomeSPOT for condition traits, and
+  CarveMe/gapseq-style metabolic reconstructions for medium feasibility.
 - **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
   methodological direction)
 - LPSN/GTDB family proper join (for tighter GroupKFold)

api/main.py CHANGED Viewed

@@ -15,6 +15,8 @@ import json
 import os
 import re
 import sys
 from pathlib import Path
 from tempfile import NamedTemporaryFile
 from typing import Any
@@ -183,27 +185,48 @@ def _looks_like_accession(target: str) -> bool:
     return bool(_ACCESSION_RE.match(target.strip()))
-def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
-    """Resolve an organism name to NCBI assembly accessions, best (most complete) first."""
     api_key = os.environ.get("NCBI_API_KEY")
-    common = {"api_key": api_key} if api_key else {}
-    r = requests.get(
-        f"{EUTILS_BASE}/esearch.fcgi",
-        params={"db": "assembly", "term": f"{q}[Organism] AND latest[filter]",
-                "retmode": "json", "retmax": retmax, **common},
-        timeout=20,
     )
-    r.raise_for_status()
-    ids = r.json().get("esearchresult", {}).get("idlist", [])
     if not ids:
-        return []
-    r = requests.get(
-        f"{EUTILS_BASE}/esummary.fcgi",
-        params={"db": "assembly", "id": ",".join(ids), "retmode": "json", **common},
-        timeout=20,
     )
-    r.raise_for_status()
-    result = r.json().get("result", {})
     out = []
     for uid in result.get("uids", []):
         doc = result.get(uid, {})
@@ -214,7 +237,13 @@ def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
         })
     rank = {"Complete Genome": 0, "Chromosome": 1, "Scaffold": 2, "Contig": 3}
     out.sort(key=lambda r: rank.get(r["level"], 99))
-    return out
 # ──────────────────────────────────────────────────────────────────────

 import os
 import re
 import sys
+import time
+from functools import lru_cache
 from pathlib import Path
 from tempfile import NamedTemporaryFile
 from typing import Any
     return bool(_ACCESSION_RE.match(target.strip()))
+def _eutils_get(endpoint: str, params: dict, *, retries: int = 3) -> dict:
+    """GET an E-utilities endpoint with an NCBI API key (if set) and retry on 429/5xx.
+    Anonymous eutils is limited to 3 req/sec (10/sec with NCBI_API_KEY), so transient
+    429s are expected under concurrent load. Back off and retry rather than surfacing
+    the rate limit to the user.
+    """
     api_key = os.environ.get("NCBI_API_KEY")
+    if api_key:
+        params = {**params, "api_key": api_key}
+    last_exc: Exception | None = None
+    for attempt in range(retries):
+        try:
+            r = requests.get(f"{EUTILS_BASE}/{endpoint}", params=params, timeout=20)
+            r.raise_for_status()
+            return r.json()
+        except requests.RequestException as e:
+            last_exc = e
+            status = getattr(e.response, "status_code", None)
+            if status == 429 or (status is not None and status >= 500):
+                time.sleep(0.5 * (2 ** attempt))  # 0.5s, 1s, 2s
+                continue
+            raise
+    raise last_exc  # type: ignore[misc]
+@lru_cache(maxsize=512)
+def _ncbi_assembly_hits_cached(q_norm: str, retmax: int) -> tuple[dict, ...]:
+    """Cached core resolver. Keyed on the normalized query; returns a hashable tuple."""
+    data = _eutils_get(
+        "esearch.fcgi",
+        {"db": "assembly", "term": f"{q_norm}[Organism] AND latest[filter]",
+         "retmode": "json", "retmax": retmax},
     )
+    ids = data.get("esearchresult", {}).get("idlist", [])
     if not ids:
+        return ()
+    data = _eutils_get(
+        "esummary.fcgi",
+        {"db": "assembly", "id": ",".join(ids), "retmode": "json"},
     )
+    result = data.get("result", {})
     out = []
     for uid in result.get("uids", []):
         doc = result.get(uid, {})
         })
     rank = {"Complete Genome": 0, "Chromosome": 1, "Scaffold": 2, "Contig": 3}
     out.sort(key=lambda r: rank.get(r["level"], 99))
+    return tuple(out)
+def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
+    """Resolve an organism name to NCBI assembly accessions, best (most complete) first."""
+    hits = _ncbi_assembly_hits_cached(q.strip().lower(), retmax)
+    return [dict(h) for h in hits]  # fresh copies so callers can't mutate the cache
 # ──────────────────────────────────────────────────────────────────────

artifacts/baseline_results.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

artifacts/baseline_results_pre_pme.json ADDED Viewed

	@@ -0,0 +1,1461 @@

+{
+  "optimal_temperature_c": {
+    "task": "regression",
+    "mean_metric": 2.739843424922931,
+    "folds": [
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 3.0611745822227165,
+        "n_train": 36496,
+        "n_test": 9125
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.455059736520248,
+        "n_train": 36497,
+        "n_test": 9124
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.9954137300201173,
+        "n_train": 36497,
+        "n_test": 9124
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.890288436794323,
+        "n_train": 36497,
+        "n_test": 9124
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.2972806390572513,
+        "n_train": 36497,
+        "n_test": 9124
+      }
+    ],
+    "top_features": {
+      "ivywrel_frac": 0.06971819475293159,
+      "kegg_M00124": 0.03207303611561656,
+      "hmm_Cyt_CBB3_microaero_score": 0.0300216194242239,
+      "hmm_COX1_aerobic_score": 0.02658985326997936,
+      "kegg_M00848": 0.02018665219657123,
+      "hmm_COX2_periplasm_aero_score": 0.017157328128814698,
+      "iso_cat2_patient": 0.016685061932366806,
+      "kegg_M00154": 0.016055268728814555,
+      "iso_country_sweden": 0.013991944026201963,
+      "iso_host_kingdom_human": 0.013502865098416805,
+      "iso_cat2_thermophilic_gt45_c": 0.012728683836758136,
+      "kegg_M00044": 0.009734607744030654,
+      "hmm_Hsp90_score": 0.009510826040059327,
+      "hmm_Hsp90_n": 0.009146594302728772,
+      "kegg_M00959": 0.008737720013596118,
+      "kegg_M00118": 0.007348061632364988,
+      "hmm_UvrD_helicase_C_score": 0.0068708939361386,
+      "hmm_BCCT_compatible_n": 0.005894379841629415,
+      "iso_cat1_infection": 0.005757758072286378,
+      "kegg_M00357": 0.005146319686900824
+    }
+  },
+  "optimal_ph": {
+    "task": "regression",
+    "mean_metric": 0.47295744064234946,
+    "folds": [
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.4365982120207544,
+        "n_train": 4082,
+        "n_test": 1021
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.5473054707458974,
+        "n_train": 4082,
+        "n_test": 1021
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.48511127384122277,
+        "n_train": 4082,
+        "n_test": 1021
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.4573734540565341,
+        "n_train": 4083,
+        "n_test": 1020
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.43839879254733816,
+        "n_train": 4083,
+        "n_test": 1020
+      }
+    ],
+    "top_features": {
+      "md_ph_median": 0.02589539885520935,
+      "hmm_TrkH_K_channel_n": 0.019185363367432727,
+      "iso_cat2_acidic": 0.012672881036996842,
+      "kegg_M00066": 0.0121498073451221,
+      "iso_cat2_alkaline": 0.011763222608715295,
+      "kegg_M00124": 0.0111599849013146,
+      "kegg_M00416": 0.008738127537071705,
+      "hmm_BCCT_compatible_score": 0.007578231208026409,
+      "kegg_M00309": 0.007578140590339899,
+      "kegg_M00788": 0.007394352648407221,
+      "kegg_M00155": 0.007158075552433729,
+      "kegg_M00525": 0.00614791321568191,
+      "hmm_RuBisCO_small_form1_score": 0.006115549360401928,
+      "kegg_M00154": 0.0058954657521098856,
+      "kegg_M01038": 0.005821993038989603,
+      "neg_charged_frac": 0.00574727994389832,
+      "hmm_RuBisCO_large_form1_score": 0.005585524230264127,
+      "kegg_M00126": 0.005203998694196343,
+      "kegg_M00417": 0.004907929454930127,
+      "kegg_M01048": 0.0048509430140256885
+    }
+  },
+  "oxygen_requirement": {
+    "task": "classification",
+    "mean_metric": 0.41167508730878977,
+    "folds": [
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.4057735839519483,
+        "n_train": 17311,
+        "n_test": 4328
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.42448903596925475,
+        "n_train": 17311,
+        "n_test": 4326
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.4253087727137673,
+        "n_train": 17311,
+        "n_test": 4328
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.32108277548814124,
+        "n_train": 17311,
+        "n_test": 4328
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.4817212684208369,
+        "n_train": 17312,
+        "n_test": 4327
+      }
+    ],
+    "top_features": {
+      "kegg_M00048": 0.04917992913397029,
+      "hmm_COX1_aerobic_score": 0.030582932382822038,
+      "hmm_FeFe_hyd_anaerobic_score": 0.017010687850415707,
+      "kegg_M00868": 0.016209590720245616,
+      "hmm_Alpha_amylase_score": 0.012914934009313584,
+      "kegg_M00122": 0.012612361693754792,
+      "n_predicted_cds": 0.011417849035933614,
+      "kegg_M00013": 0.011295223154593258,
+      "codon_CAA": 0.010262218676507473,
+      "kegg_M00060": 0.009821721445769071,
+      "iso_cat1_host": 0.009196543786674737,
+      "hmm_FAD_binding_FrdA_n": 0.009031777316704393,
+      "kegg_M00032": 0.008559015835635364,
+      "hmm_NifDK_nitrogenase_n": 0.007181795220822096,
+      "kegg_M01010": 0.006753691844642163,
+      "kegg_M00036": 0.006538775702938438,
+      "hmm_PdxJ_pyridoxine_n": 0.006458204134833068,
+      "iso_host_kingdom_other": 0.006149246357381344,
+      "kegg_M00609": 0.005912851588800549,
+      "kegg_M00878": 0.005448635050561279
+    }
+  },
+  "salt_tolerance_pct": {
+    "task": "regression",
+    "mean_metric": 1.938587229032814,
+    "folds": [
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 1.6739041984692624,
+        "n_train": 3075,
+        "n_test": 769
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 1.7789674237746413,
+        "n_train": 3075,
+        "n_test": 769
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.653249260519034,
+        "n_train": 3075,
+        "n_test": 769
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 1.7333069061884374,
+        "n_train": 3075,
+        "n_test": 769
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 1.8535083562126964,
+        "n_train": 3076,
+        "n_test": 768
+      }
+    ],
+    "top_features": {
+      "neg_charged_frac": 0.03803467936813831,
+      "tetra_ATCC": 0.03377264237497002,
+      "tetra_ACAT": 0.02272233280236833,
+      "aa_frac_C": 0.019299705256707968,
+      "kegg_M00849": 0.01693017091602087,
+      "hmm_BCCT_compatible_n": 0.014071422629058361,
+      "md_nacl_pct_median": 0.013062665052711964,
+      "iso_cat2_saline": 0.010860100761055947,
+      "kegg_M00033": 0.0096977892331779,
+      "kegg_M00159": 0.009692981420084835,
+      "hmm_Cyt_CBB3_microaero_n": 0.009228403586894274,
+      "kegg_M00725": 0.008322405093349517,
+      "kegg_M00149": 0.008042314299382269,
+      "aa_frac_T": 0.007969680236419663,
+      "codon_CCG": 0.007059626755653881,
+      "kegg_M00144": 0.006586183654144406,
+      "kegg_M00121": 0.006443552603013813,
+      "hmm_ATP_synth_F0_B_n": 0.006160225393250585,
+      "iso_collection_year": 0.005867346032755449,
+      "codon_CGC": 0.005818429926875979
+    }
+  },
+  "__meta__": {
+    "feature_cols": [
+      "genome_size_nt",
+      "n_contigs",
+      "gc_content",
+      "n_predicted_cds",
+      "coding_density",
+      "mean_cds_aa_length",
+      "median_cds_aa_length",
+      "aromatic_frac",
+      "pos_charged_frac",
+      "neg_charged_frac",
+      "ivywrel_frac",
+      "mean_hydrophobicity",
+      "mean_isoelectric_point",
+      "aa_frac_A",
+      "aa_frac_C",
+      "aa_frac_D",
+      "aa_frac_E",
+      "aa_frac_F",
+      "aa_frac_G",
+      "aa_frac_H",
+      "aa_frac_I",
+      "aa_frac_K",
+      "aa_frac_L",
+      "aa_frac_M",
+      "aa_frac_N",
+      "aa_frac_P",
+      "aa_frac_Q",
+      "aa_frac_R",
+      "aa_frac_S",
+      "aa_frac_T",
+      "aa_frac_V",
+      "aa_frac_W",
+      "aa_frac_Y",
+      "tetra_AAAA",
+      "tetra_AAAC",
+      "tetra_AAAG",
+      "tetra_AAAT",
+      "tetra_AACA",
+      "tetra_AACC",
+      "tetra_AACG",
+      "tetra_AACT",
+      "tetra_AAGA",
+      "tetra_AAGC",
+      "tetra_AAGG",
+      "tetra_AAGT",
+      "tetra_AATA",
+      "tetra_AATC",
+      "tetra_AATG",
+      "tetra_AATT",
+      "tetra_ACAA",
+      "tetra_ACAC",
+      "tetra_ACAG",
+      "tetra_ACAT",
+      "tetra_ACCA",
+      "tetra_ACCC",
+      "tetra_ACCG",
+      "tetra_ACCT",
+      "tetra_ACGA",
+      "tetra_ACGC",
+      "tetra_ACGG",
+      "tetra_ACGT",
+      "tetra_ACTA",
+      "tetra_ACTC",
+      "tetra_ACTG",
+      "tetra_ACTT",
+      "tetra_AGAA",
+      "tetra_AGAC",
+      "tetra_AGAG",
+      "tetra_AGAT",
+      "tetra_AGCA",
+      "tetra_AGCC",
+      "tetra_AGCG",
+      "tetra_AGCT",
+      "tetra_AGGA",
+      "tetra_AGGC",
+      "tetra_AGGG",
+      "tetra_AGGT",
+      "tetra_AGTA",
+      "tetra_AGTC",
+      "tetra_AGTG",
+      "tetra_AGTT",
+      "tetra_ATAA",
+      "tetra_ATAC",
+      "tetra_ATAG",
+      "tetra_ATAT",
+      "tetra_ATCA",
+      "tetra_ATCC",
+      "tetra_ATCG",
+      "tetra_ATCT",
+      "tetra_ATGA",
+      "tetra_ATGC",
+      "tetra_ATGG",
+      "tetra_ATGT",
+      "tetra_ATTA",
+      "tetra_ATTC",
+      "tetra_ATTG",
+      "tetra_ATTT",
+      "tetra_CAAA",
+      "tetra_CAAC",
+      "tetra_CAAG",
+      "tetra_CAAT",
+      "tetra_CACA",
+      "tetra_CACC",
+      "tetra_CACG",
+      "tetra_CACT",
+      "tetra_CAGA",
+      "tetra_CAGC",
+      "tetra_CAGG",
+      "tetra_CAGT",
+      "tetra_CATA",
+      "tetra_CATC",
+      "tetra_CATG",
+      "tetra_CATT",
+      "tetra_CCAA",
+      "tetra_CCAC",
+      "tetra_CCAG",
+      "tetra_CCAT",
+      "tetra_CCCA",
+      "tetra_CCCC",
+      "tetra_CCCG",
+      "tetra_CCCT",
+      "tetra_CCGA",
+      "tetra_CCGC",
+      "tetra_CCGG",
+      "tetra_CCGT",
+      "tetra_CCTA",
+      "tetra_CCTC",
+      "tetra_CCTG",
+      "tetra_CCTT",
+      "tetra_CGAA",
+      "tetra_CGAC",
+      "tetra_CGAG",
+      "tetra_CGAT",
+      "tetra_CGCA",
+      "tetra_CGCC",
+      "tetra_CGCG",
+      "tetra_CGCT",
+      "tetra_CGGA",
+      "tetra_CGGC",
+      "tetra_CGGG",
+      "tetra_CGGT",
+      "tetra_CGTA",
+      "tetra_CGTC",
+      "tetra_CGTG",
+      "tetra_CGTT",
+      "tetra_CTAA",
+      "tetra_CTAC",
+      "tetra_CTAG",
+      "tetra_CTAT",
+      "tetra_CTCA",
+      "tetra_CTCC",
+      "tetra_CTCG",
+      "tetra_CTCT",
+      "tetra_CTGA",
+      "tetra_CTGC",
+      "tetra_CTGG",
+      "tetra_CTGT",
+      "tetra_CTTA",
+      "tetra_CTTC",
+      "tetra_CTTG",
+      "tetra_CTTT",
+      "tetra_GAAA",
+      "tetra_GAAC",
+      "tetra_GAAG",
+      "tetra_GAAT",
+      "tetra_GACA",
+      "tetra_GACC",
+      "tetra_GACG",
+      "tetra_GACT",
+      "tetra_GAGA",
+      "tetra_GAGC",
+      "tetra_GAGG",
+      "tetra_GAGT",
+      "tetra_GATA",
+      "tetra_GATC",
+      "tetra_GATG",
+      "tetra_GATT",
+      "tetra_GCAA",
+      "tetra_GCAC",
+      "tetra_GCAG",
+      "tetra_GCAT",
+      "tetra_GCCA",
+      "tetra_GCCC",
+      "tetra_GCCG",
+      "tetra_GCCT",
+      "tetra_GCGA",
+      "tetra_GCGC",
+      "tetra_GCGG",
+      "tetra_GCGT",
+      "tetra_GCTA",
+      "tetra_GCTC",
+      "tetra_GCTG",
+      "tetra_GCTT",
+      "tetra_GGAA",
+      "tetra_GGAC",
+      "tetra_GGAG",
+      "tetra_GGAT",
+      "tetra_GGCA",
+      "tetra_GGCC",
+      "tetra_GGCG",
+      "tetra_GGCT",
+      "tetra_GGGA",
+      "tetra_GGGC",
+      "tetra_GGGG",
+      "tetra_GGGT",
+      "tetra_GGTA",
+      "tetra_GGTC",
+      "tetra_GGTG",
+      "tetra_GGTT",
+      "tetra_GTAA",
+      "tetra_GTAC",
+      "tetra_GTAG",
+      "tetra_GTAT",
+      "tetra_GTCA",
+      "tetra_GTCC",
+      "tetra_GTCG",
+      "tetra_GTCT",
+      "tetra_GTGA",
+      "tetra_GTGC",
+      "tetra_GTGG",
+      "tetra_GTGT",
+      "tetra_GTTA",
+      "tetra_GTTC",
+      "tetra_GTTG",
+      "tetra_GTTT",
+      "tetra_TAAA",
+      "tetra_TAAC",
+      "tetra_TAAG",
+      "tetra_TAAT",
+      "tetra_TACA",
+      "tetra_TACC",
+      "tetra_TACG",
+      "tetra_TACT",
+      "tetra_TAGA",
+      "tetra_TAGC",
+      "tetra_TAGG",
+      "tetra_TAGT",
+      "tetra_TATA",
+      "tetra_TATC",
+      "tetra_TATG",
+      "tetra_TATT",
+      "tetra_TCAA",
+      "tetra_TCAC",
+      "tetra_TCAG",
+      "tetra_TCAT",
+      "tetra_TCCA",
+      "tetra_TCCC",
+      "tetra_TCCG",
+      "tetra_TCCT",
+      "tetra_TCGA",
+      "tetra_TCGC",
+      "tetra_TCGG",
+      "tetra_TCGT",
+      "tetra_TCTA",
+      "tetra_TCTC",
+      "tetra_TCTG",
+      "tetra_TCTT",
+      "tetra_TGAA",
+      "tetra_TGAC",
+      "tetra_TGAG",
+      "tetra_TGAT",
+      "tetra_TGCA",
+      "tetra_TGCC",
+      "tetra_TGCG",
+      "tetra_TGCT",
+      "tetra_TGGA",
+      "tetra_TGGC",
+      "tetra_TGGG",
+      "tetra_TGGT",
+      "tetra_TGTA",
+      "tetra_TGTC",
+      "tetra_TGTG",
+      "tetra_TGTT",
+      "tetra_TTAA",
+      "tetra_TTAC",
+      "tetra_TTAG",
+      "tetra_TTAT",
+      "tetra_TTCA",
+      "tetra_TTCC",
+      "tetra_TTCG",
+      "tetra_TTCT",
+      "tetra_TTGA",
+      "tetra_TTGC",
+      "tetra_TTGG",
+      "tetra_TTGT",
+      "tetra_TTTA",
+      "tetra_TTTC",
+      "tetra_TTTG",
+      "tetra_TTTT",
+      "codon_AAA",
+      "codon_AAC",
+      "codon_AAG",
+      "codon_AAT",
+      "codon_ACA",
+      "codon_ACC",
+      "codon_ACG",
+      "codon_ACT",
+      "codon_AGA",
+      "codon_AGC",
+      "codon_AGG",
+      "codon_AGT",
+      "codon_ATA",
+      "codon_ATC",
+      "codon_ATG",
+      "codon_ATT",
+      "codon_CAA",
+      "codon_CAC",
+      "codon_CAG",
+      "codon_CAT",
+      "codon_CCA",
+      "codon_CCC",
+      "codon_CCG",
+      "codon_CCT",
+      "codon_CGA",
+      "codon_CGC",
+      "codon_CGG",
+      "codon_CGT",
+      "codon_CTA",
+      "codon_CTC",
+      "codon_CTG",
+      "codon_CTT",
+      "codon_GAA",
+      "codon_GAC",
+      "codon_GAG",
+      "codon_GAT",
+      "codon_GCA",
+      "codon_GCC",
+      "codon_GCG",
+      "codon_GCT",
+      "codon_GGA",
+      "codon_GGC",
+      "codon_GGG",
+      "codon_GGT",
+      "codon_GTA",
+      "codon_GTC",
+      "codon_GTG",
+      "codon_GTT",
+      "codon_TAA",
+      "codon_TAC",
+      "codon_TAG",
+      "codon_TAT",
+      "codon_TCA",
+      "codon_TCC",
+      "codon_TCG",
+      "codon_TCT",
+      "codon_TGA",
+      "codon_TGC",
+      "codon_TGG",
+      "codon_TGT",
+      "codon_TTA",
+      "codon_TTC",
+      "codon_TTG",
+      "codon_TTT",
+      "iso_cat1_climate",
+      "iso_cat1_condition",
+      "iso_cat1_engineered",
+      "iso_cat1_environmental",
+      "iso_cat1_host",
+      "iso_cat1_host_body_product",
+      "iso_cat1_host_body_site",
+      "iso_cat1_infection",
+      "iso_cat2_acidic",
+      "iso_cat2_agriculture",
+      "iso_cat2_air",
+      "iso_cat2_algae",
+      "iso_cat2_alkaline",
+      "iso_cat2_anoxic_anaerobic",
+      "iso_cat2_aquatic",
+      "iso_cat2_arthropoda",
+      "iso_cat2_biodegradation",
+      "iso_cat2_biofilm",
+      "iso_cat2_bioreactor",
+      "iso_cat2_bioremediation",
+      "iso_cat2_birds",
+      "iso_cat2_built_environment",
+      "iso_cat2_cold",
+      "iso_cat2_contamination",
+      "iso_cat2_disease",
+      "iso_cat2_fishes",
+      "iso_cat2_fluids",
+      "iso_cat2_food_production",
+      "iso_cat2_fungi",
+      "iso_cat2_gastrointestinal_tract",
+      "iso_cat2_hot",
+      "iso_cat2_human",
+      "iso_cat2_humid",
+      "iso_cat2_industrial",
+      "iso_cat2_inflammation",
+      "iso_cat2_invertebrates_other",
+      "iso_cat2_juvenile",
+      "iso_cat2_laboratory",
+      "iso_cat2_limb",
+      "iso_cat2_mammals",
+      "iso_cat2_medical_device",
+      "iso_cat2_medical_environment",
+      "iso_cat2_medical_product",
+      "iso_cat2_microbial",
+      "iso_cat2_microbial_community",
+      "iso_cat2_oral_cavity_and_airways",
+      "iso_cat2_organ",
+      "iso_cat2_other",
+      "iso_cat2_patient",
+      "iso_cat2_plant",
+      "iso_cat2_plant_infections",
+      "iso_cat2_plants",
+      "iso_cat2_protozoa",
+      "iso_cat2_psychrophilic_lt10_c",
+      "iso_cat2_reptilia",
+      "iso_cat2_saline",
+      "iso_cat2_sulfuric",
+      "iso_cat2_temperate",
+      "iso_cat2_terrestrial",
+      "iso_cat2_thermophilic_gt45_c",
+      "iso_cat2_treatment",
+      "iso_cat2_urogenital_tract",
+      "iso_cat2_waste",
+      "iso_cat2_xerophilic",
+      "iso_cat2_yeast",
+      "md_n_media",
+      "md_ph_median",
+      "md_ph_range",
+      "md_nacl_pct_median",
+      "md_nacl_pct_max",
+      "hmm_Peripla_BP_2_n",
+      "hmm_Peripla_BP_2_score",
+      "hmm_Peripla_BP_2_present",
+      "hmm_Cyt_CBB3_microaero_n",
+      "hmm_Cyt_CBB3_microaero_score",
+      "hmm_Cyt_CBB3_microaero_present",
+      "hmm_Rieske_2Fe2S_n",
+      "hmm_Rieske_2Fe2S_score",
+      "hmm_Rieske_2Fe2S_present",
+      "hmm_FeFe_hyd_anaerobic_n",
+      "hmm_FeFe_hyd_anaerobic_score",
+      "hmm_FeFe_hyd_anaerobic_present",
+      "hmm_KdpD_osmosensor_n",
+      "hmm_KdpD_osmosensor_score",
+      "hmm_KdpD_osmosensor_present",
+      "hmm_DHBP_riboflavin_n",
+      "hmm_DHBP_riboflavin_score",
+      "hmm_DHBP_riboflavin_present",
+      "hmm_BCCT_compatible_n",
+      "hmm_BCCT_compatible_score",
+      "hmm_BCCT_compatible_present",
+      "hmm_NIR_SIR_ferredoxin_n",
+      "hmm_NIR_SIR_ferredoxin_score",
+      "hmm_NIR_SIR_ferredoxin_present",
+      "hmm_CBM_cellulose_n",
+      "hmm_CBM_cellulose_score",
+      "hmm_CBM_cellulose_present",
+      "hmm_Pyridoxal_decarbox_n",
+      "hmm_Pyridoxal_decarbox_score",
+      "hmm_Pyridoxal_decarbox_present",
+      "hmm_Hsp70_DnaK_n",
+      "hmm_Hsp70_DnaK_score",
+      "hmm_Hsp70_DnaK_present",
+      "hmm_Cellulase_GH5_n",
+      "hmm_Cellulase_GH5_score",
+      "hmm_Cellulase_GH5_present",
+      "hmm_COX1_aerobic_n",
+      "hmm_COX1_aerobic_score",
+      "hmm_COX1_aerobic_present",
+      "hmm_COX2_TM_aerobic_n",
+      "hmm_COX2_TM_aerobic_score",
+      "hmm_COX2_TM_aerobic_present",
+      "hmm_ATP_synth_alphabeta_n",
+      "hmm_ATP_synth_alphabeta_score",
+      "hmm_ATP_synth_alphabeta_present",
+      "hmm_SOD_FeMn_n",
+      "hmm_SOD_FeMn_score",
+      "hmm_SOD_FeMn_present",
+      "hmm_Fer4_FeS_4Fe4S_n",
+      "hmm_Fer4_FeS_4Fe4S_score",
+      "hmm_Fer4_FeS_4Fe4S_present",
+      "hmm_EctC_ectoine_synth_n",
+      "hmm_EctC_ectoine_synth_score",
+      "hmm_EctC_ectoine_synth_present",
+      "hmm_Molybdopterin_OR_n",
+      "hmm_Molybdopterin_OR_score",
+      "hmm_Molybdopterin_OR_present",
+      "hmm_Hsp20_n",
+      "hmm_Hsp20_score",
+      "hmm_Hsp20_present",
+      "hmm_RuBisCO_small_form1_n",
+      "hmm_RuBisCO_small_form1_score",
+      "hmm_RuBisCO_small_form1_present",
+      "hmm_RuBisCO_large_form1_n",
+      "hmm_RuBisCO_large_form1_score",
+      "hmm_RuBisCO_large_form1_present",
+      "hmm_ATP_synth_alphabeta_C_n",
+      "hmm_ATP_synth_alphabeta_C_score",
+      "hmm_ATP_synth_alphabeta_C_present",
+      "hmm_PdxJ_pyridoxine_n",
+      "hmm_PdxJ_pyridoxine_score",
+      "hmm_PdxJ_pyridoxine_present",
+      "hmm_Alpha_amylase_n",
+      "hmm_Alpha_amylase_score",
+      "hmm_Alpha_amylase_present",
+      "hmm_NifH_nitrogenase_n",
+      "hmm_NifH_nitrogenase_score",
+      "hmm_NifH_nitrogenase_present",
+      "hmm_ATP_synth_F0_B_n",
+      "hmm_ATP_synth_F0_B_score",
+      "hmm_ATP_synth_F0_B_present",
+      "hmm_CSD_cold_shock_n",
+      "hmm_CSD_cold_shock_score",
+      "hmm_CSD_cold_shock_present",
+      "hmm_FolB_folate_n",
+      "hmm_FolB_folate_score",
+      "hmm_FolB_folate_present",
+      "hmm_Bact_rhodopsin_n",
+      "hmm_Bact_rhodopsin_score",
+      "hmm_Bact_rhodopsin_present",
+      "hmm_COX2_periplasm_aero_n",
+      "hmm_COX2_periplasm_aero_score",
+      "hmm_COX2_periplasm_aero_present",
+      "hmm_BPD_transp_1_n",
+      "hmm_BPD_transp_1_score",
+      "hmm_BPD_transp_1_present",
+      "hmm_TP_methylase_B12_n",
+      "hmm_TP_methylase_B12_score",
+      "hmm_TP_methylase_B12_present",
+      "hmm_V_ATPase_subH_N_n",
+      "hmm_V_ATPase_subH_N_score",
+      "hmm_V_ATPase_subH_N_present",
+      "hmm_UvrD_helicase_C_n",
+      "hmm_UvrD_helicase_C_score",
+      "hmm_UvrD_helicase_C_present",
+      "hmm_Cpn60_GroEL_n",
+      "hmm_Cpn60_GroEL_score",
+      "hmm_Cpn60_GroEL_present",
+      "hmm_MotA_TolQ_ExbB_n",
+      "hmm_MotA_TolQ_ExbB_score",
+      "hmm_MotA_TolQ_ExbB_present",
+      "hmm_TrkH_K_channel_n",
+      "hmm_TrkH_K_channel_score",
+      "hmm_TrkH_K_channel_present",
+      "hmm_NhaA_Na_H_exch_n",
+      "hmm_NhaA_Na_H_exch_score",
+      "hmm_NhaA_Na_H_exch_present",
+      "hmm_Hsp90_n",
+      "hmm_Hsp90_score",
+      "hmm_Hsp90_present",
+      "hmm_NhaB_Na_H_exch_n",
+      "hmm_NhaB_Na_H_exch_score",
+      "hmm_NhaB_Na_H_exch_present",
+      "hmm_NifDK_nitrogenase_n",
+      "hmm_NifDK_nitrogenase_score",
+      "hmm_NifDK_nitrogenase_present",
+      "hmm_Catalase_n",
+      "hmm_Catalase_score",
+      "hmm_Catalase_present",
+      "hmm_SOD_CuZn_n",
+      "hmm_SOD_CuZn_score",
+      "hmm_SOD_CuZn_present",
+      "hmm_FAD_binding_FrdA_n",
+      "hmm_FAD_binding_FrdA_score",
+      "hmm_FAD_binding_FrdA_present",
+      "hmm_THF_DHG_CYH_folate_n",
+      "hmm_THF_DHG_CYH_folate_score",
+      "hmm_THF_DHG_CYH_folate_present",
+      "hmm_TGS_thermosome_n",
+      "hmm_TGS_thermosome_score",
+      "hmm_TGS_thermosome_present",
+      "hmm_NiFe_hyd_anaerobic_n",
+      "hmm_NiFe_hyd_anaerobic_score",
+      "hmm_NiFe_hyd_anaerobic_present",
+      "kegg_M00001",
+      "kegg_M00002",
+      "kegg_M00003",
+      "kegg_M00004",
+      "kegg_M00005",
+      "kegg_M00006",
+      "kegg_M00007",
+      "kegg_M00008",
+      "kegg_M00009",
+      "kegg_M00010",
+      "kegg_M00011",
+      "kegg_M00012",
+      "kegg_M00013",
+      "kegg_M00014",
+      "kegg_M00015",
+      "kegg_M00016",
+      "kegg_M00017",
+      "kegg_M00018",
+      "kegg_M00019",
+      "kegg_M00020",
+      "kegg_M00021",
+      "kegg_M00022",
+      "kegg_M00023",
+      "kegg_M00024",
+      "kegg_M00025",
+      "kegg_M00026",
+      "kegg_M00027",
+      "kegg_M00028",
+      "kegg_M00029",
+      "kegg_M00030",
+      "kegg_M00031",
+      "kegg_M00032",
+      "kegg_M00033",
+      "kegg_M00034",
+      "kegg_M00035",
+      "kegg_M00036",
+      "kegg_M00037",
+      "kegg_M00038",
+      "kegg_M00039",
+      "kegg_M00040",
+      "kegg_M00042",
+      "kegg_M00043",
+      "kegg_M00044",
+      "kegg_M00045",
+      "kegg_M00046",
+      "kegg_M00047",
+      "kegg_M00048",
+      "kegg_M00049",
+      "kegg_M00050",
+      "kegg_M00051",
+      "kegg_M00052",
+      "kegg_M00053",
+      "kegg_M00055",
+      "kegg_M00056",
+      "kegg_M00057",
+      "kegg_M00058",
+      "kegg_M00059",
+      "kegg_M00060",
+      "kegg_M00061",
+      "kegg_M00063",
+      "kegg_M00064",
+      "kegg_M00065",
+      "kegg_M00066",
+      "kegg_M00067",
+      "kegg_M00068",
+      "kegg_M00069",
+      "kegg_M00070",
+      "kegg_M00071",
+      "kegg_M00072",
+      "kegg_M00073",
+      "kegg_M00074",
+      "kegg_M00075",
+      "kegg_M00076",
+      "kegg_M00077",
+      "kegg_M00078",
+      "kegg_M00079",
+      "kegg_M00081",
+      "kegg_M00082",
+      "kegg_M00083",
+      "kegg_M00085",
+      "kegg_M00086",
+      "kegg_M00087",
+      "kegg_M00088",
+      "kegg_M00089",
+      "kegg_M00090",
+      "kegg_M00091",
+      "kegg_M00092",
+      "kegg_M00093",
+      "kegg_M00094",
+      "kegg_M00095",
+      "kegg_M00096",
+      "kegg_M00097",
+      "kegg_M00098",
+      "kegg_M00099",
+      "kegg_M00100",
+      "kegg_M00101",
+      "kegg_M00102",
+      "kegg_M00103",
+      "kegg_M00104",
+      "kegg_M00106",
+      "kegg_M00107",
+      "kegg_M00108",
+      "kegg_M00109",
+      "kegg_M00110",
+      "kegg_M00112",
+      "kegg_M00113",
+      "kegg_M00114",
+      "kegg_M00115",
+      "kegg_M00116",
+      "kegg_M00117",
+      "kegg_M00118",
+      "kegg_M00119",
+      "kegg_M00120",
+      "kegg_M00121",
+      "kegg_M00122",
+      "kegg_M00123",
+      "kegg_M00124",
+      "kegg_M00125",
+      "kegg_M00126",
+      "kegg_M00127",
+      "kegg_M00128",
+      "kegg_M00129",
+      "kegg_M00130",
+      "kegg_M00131",
+      "kegg_M00132",
+      "kegg_M00133",
+      "kegg_M00134",
+      "kegg_M00135",
+      "kegg_M00136",
+      "kegg_M00137",
+      "kegg_M00138",
+      "kegg_M00140",
+      "kegg_M00141",
+      "kegg_M00142",
+      "kegg_M00143",
+      "kegg_M00144",
+      "kegg_M00145",
+      "kegg_M00146",
+      "kegg_M00147",
+      "kegg_M00148",
+      "kegg_M00149",
+      "kegg_M00150",
+      "kegg_M00151",
+      "kegg_M00152",
+      "kegg_M00153",
+      "kegg_M00154",
+      "kegg_M00155",
+      "kegg_M00156",
+      "kegg_M00157",
+      "kegg_M00158",
+      "kegg_M00159",
+      "kegg_M00160",
+      "kegg_M00161",
+      "kegg_M00162",
+      "kegg_M00163",
+      "kegg_M00165",
+      "kegg_M00168",
+      "kegg_M00169",
+      "kegg_M00170",
+      "kegg_M00171",
+      "kegg_M00172",
+      "kegg_M00173",
+      "kegg_M00174",
+      "kegg_M00175",
+      "kegg_M00176",
+      "kegg_M00307",
+      "kegg_M00308",
+      "kegg_M00309",
+      "kegg_M00338",
+      "kegg_M00344",
+      "kegg_M00345",
+      "kegg_M00346",
+      "kegg_M00356",
+      "kegg_M00357",
+      "kegg_M00358",
+      "kegg_M00363",
+      "kegg_M00364",
+      "kegg_M00365",
+      "kegg_M00366",
+      "kegg_M00367",
+      "kegg_M00368",
+      "kegg_M00369",
+      "kegg_M00370",
+      "kegg_M00371",
+      "kegg_M00372",
+      "kegg_M00373",
+      "kegg_M00374",
+      "kegg_M00375",
+      "kegg_M00376",
+      "kegg_M00377",
+      "kegg_M00378",
+      "kegg_M00415",
+      "kegg_M00416",
+      "kegg_M00417",
+      "kegg_M00418",
+      "kegg_M00419",
+      "kegg_M00422",
+      "kegg_M00432",
+      "kegg_M00433",
+      "kegg_M00525",
+      "kegg_M00526",
+      "kegg_M00527",
+      "kegg_M00528",
+      "kegg_M00529",
+      "kegg_M00530",
+      "kegg_M00531",
+      "kegg_M00532",
+      "kegg_M00533",
+      "kegg_M00534",
+      "kegg_M00535",
+      "kegg_M00537",
+      "kegg_M00538",
+      "kegg_M00539",
+      "kegg_M00540",
+      "kegg_M00541",
+      "kegg_M00542",
+      "kegg_M00543",
+      "kegg_M00544",
+      "kegg_M00545",
+      "kegg_M00546",
+      "kegg_M00547",
+      "kegg_M00548",
+      "kegg_M00549",
+      "kegg_M00550",
+      "kegg_M00551",
+      "kegg_M00552",
+      "kegg_M00554",
+      "kegg_M00555",
+      "kegg_M00563",
+      "kegg_M00564",
+      "kegg_M00565",
+      "kegg_M00567",
+      "kegg_M00568",
+      "kegg_M00569",
+      "kegg_M00570",
+      "kegg_M00572",
+      "kegg_M00573",
+      "kegg_M00574",
+      "kegg_M00575",
+      "kegg_M00576",
+      "kegg_M00577",
+      "kegg_M00579",
+      "kegg_M00580",
+      "kegg_M00595",
+      "kegg_M00596",
+      "kegg_M00597",
+      "kegg_M00598",
+      "kegg_M00608",
+      "kegg_M00609",
+      "kegg_M00611",
+      "kegg_M00612",
+      "kegg_M00613",
+      "kegg_M00614",
+      "kegg_M00615",
+      "kegg_M00616",
+      "kegg_M00617",
+      "kegg_M00618",
+      "kegg_M00620",
+      "kegg_M00621",
+      "kegg_M00622",
+      "kegg_M00623",
+      "kegg_M00624",
+      "kegg_M00625",
+      "kegg_M00627",
+      "kegg_M00630",
+      "kegg_M00631",
+      "kegg_M00632",
+      "kegg_M00633",
+      "kegg_M00636",
+      "kegg_M00637",
+      "kegg_M00638",
+      "kegg_M00639",
+      "kegg_M00641",
+      "kegg_M00642",
+      "kegg_M00643",
+      "kegg_M00649",
+      "kegg_M00651",
+      "kegg_M00652",
+      "kegg_M00660",
+      "kegg_M00661",
+      "kegg_M00664",
+      "kegg_M00672",
+      "kegg_M00673",
+      "kegg_M00674",
+      "kegg_M00675",
+      "kegg_M00696",
+      "kegg_M00697",
+      "kegg_M00698",
+      "kegg_M00700",
+      "kegg_M00702",
+      "kegg_M00704",
+      "kegg_M00705",
+      "kegg_M00714",
+      "kegg_M00718",
+      "kegg_M00725",
+      "kegg_M00726",
+      "kegg_M00730",
+      "kegg_M00736",
+      "kegg_M00740",
+      "kegg_M00741",
+      "kegg_M00744",
+      "kegg_M00745",
+      "kegg_M00746",
+      "kegg_M00761",
+      "kegg_M00763",
+      "kegg_M00769",
+      "kegg_M00773",
+      "kegg_M00774",
+      "kegg_M00775",
+      "kegg_M00776",
+      "kegg_M00777",
+      "kegg_M00778",
+      "kegg_M00779",
+      "kegg_M00780",
+      "kegg_M00781",
+      "kegg_M00782",
+      "kegg_M00783",
+      "kegg_M00784",
+      "kegg_M00785",
+      "kegg_M00786",
+      "kegg_M00787",
+      "kegg_M00788",
+      "kegg_M00789",
+      "kegg_M00790",
+      "kegg_M00793",
+      "kegg_M00794",
+      "kegg_M00795",
+      "kegg_M00796",
+      "kegg_M00797",
+      "kegg_M00798",
+      "kegg_M00799",
+      "kegg_M00800",
+      "kegg_M00801",
+      "kegg_M00802",
+      "kegg_M00803",
+      "kegg_M00804",
+      "kegg_M00805",
+      "kegg_M00808",
+      "kegg_M00810",
+      "kegg_M00811",
+      "kegg_M00814",
+      "kegg_M00815",
+      "kegg_M00819",
+      "kegg_M00823",
+      "kegg_M00824",
+      "kegg_M00825",
+      "kegg_M00826",
+      "kegg_M00827",
+      "kegg_M00828",
+      "kegg_M00829",
+      "kegg_M00830",
+      "kegg_M00831",
+      "kegg_M00832",
+      "kegg_M00833",
+      "kegg_M00834",
+      "kegg_M00835",
+      "kegg_M00836",
+      "kegg_M00837",
+      "kegg_M00838",
+      "kegg_M00840",
+      "kegg_M00841",
+      "kegg_M00842",
+      "kegg_M00843",
+      "kegg_M00844",
+      "kegg_M00845",
+      "kegg_M00846",
+      "kegg_M00847",
+      "kegg_M00848",
+      "kegg_M00849",
+      "kegg_M00850",
+      "kegg_M00851",
+      "kegg_M00852",
+      "kegg_M00853",
+      "kegg_M00854",
+      "kegg_M00855",
+      "kegg_M00856",
+      "kegg_M00857",
+      "kegg_M00859",
+      "kegg_M00860",
+      "kegg_M00861",
+      "kegg_M00862",
+      "kegg_M00866",
+      "kegg_M00867",
+      "kegg_M00868",
+      "kegg_M00872",
+      "kegg_M00873",
+      "kegg_M00874",
+      "kegg_M00875",
+      "kegg_M00876",
+      "kegg_M00877",
+      "kegg_M00878",
+      "kegg_M00879",
+      "kegg_M00880",
+      "kegg_M00881",
+      "kegg_M00882",
+      "kegg_M00883",
+      "kegg_M00884",
+      "kegg_M00885",
+      "kegg_M00886",
+      "kegg_M00887",
+      "kegg_M00888",
+      "kegg_M00889",
+      "kegg_M00890",
+      "kegg_M00891",
+      "kegg_M00892",
+      "kegg_M00893",
+      "kegg_M00894",
+      "kegg_M00895",
+      "kegg_M00896",
+      "kegg_M00897",
+      "kegg_M00898",
+      "kegg_M00899",
+      "kegg_M00900",
+      "kegg_M00901",
+      "kegg_M00902",
+      "kegg_M00903",
+      "kegg_M00904",
+      "kegg_M00905",
+      "kegg_M00906",
+      "kegg_M00909",
+      "kegg_M00910",
+      "kegg_M00911",
+      "kegg_M00912",
+      "kegg_M00913",
+      "kegg_M00914",
+      "kegg_M00915",
+      "kegg_M00916",
+      "kegg_M00917",
+      "kegg_M00918",
+      "kegg_M00919",
+      "kegg_M00921",
+      "kegg_M00922",
+      "kegg_M00923",
+      "kegg_M00924",
+      "kegg_M00925",
+      "kegg_M00926",
+      "kegg_M00927",
+      "kegg_M00928",
+      "kegg_M00929",
+      "kegg_M00930",
+      "kegg_M00931",
+      "kegg_M00932",
+      "kegg_M00933",
+      "kegg_M00934",
+      "kegg_M00935",
+      "kegg_M00936",
+      "kegg_M00937",
+      "kegg_M00938",
+      "kegg_M00939",
+      "kegg_M00940",
+      "kegg_M00941",
+      "kegg_M00942",
+      "kegg_M00943",
+      "kegg_M00944",
+      "kegg_M00945",
+      "kegg_M00946",
+      "kegg_M00947",
+      "kegg_M00948",
+      "kegg_M00949",
+      "kegg_M00950",
+      "kegg_M00951",
+      "kegg_M00952",
+      "kegg_M00953",
+      "kegg_M00954",
+      "kegg_M00955",
+      "kegg_M00956",
+      "kegg_M00957",
+      "kegg_M00958",
+      "kegg_M00959",
+      "kegg_M00960",
+      "kegg_M00961",
+      "kegg_M00962",
+      "kegg_M00963",
+      "kegg_M00964",
+      "kegg_M00965",
+      "kegg_M00966",
+      "kegg_M00967",
+      "kegg_M00968",
+      "kegg_M00969",
+      "kegg_M00970",
+      "kegg_M00971",
+      "kegg_M00972",
+      "kegg_M00973",
+      "kegg_M00974",
+      "kegg_M00975",
+      "kegg_M00976",
+      "kegg_M00977",
+      "kegg_M00978",
+      "kegg_M00979",
+      "kegg_M00980",
+      "kegg_M00981",
+      "kegg_M00982",
+      "kegg_M00983",
+      "kegg_M00984",
+      "kegg_M00985",
+      "kegg_M00986",
+      "kegg_M00987",
+      "kegg_M00988",
+      "kegg_M00989",
+      "kegg_M00990",
+      "kegg_M00991",
+      "kegg_M00992",
+      "kegg_M00993",
+      "kegg_M00994",
+      "kegg_M00995",
+      "kegg_M00996",
+      "kegg_M00997",
+      "kegg_M00998",
+      "kegg_M00999",
+      "kegg_M01000",
+      "kegg_M01001",
+      "kegg_M01002",
+      "kegg_M01003",
+      "kegg_M01004",
+      "kegg_M01005",
+      "kegg_M01006",
+      "kegg_M01007",
+      "kegg_M01008",
+      "kegg_M01009",
+      "kegg_M01010",
+      "kegg_M01011",
+      "kegg_M01012",
+      "kegg_M01013",
+      "kegg_M01014",
+      "kegg_M01015",
+      "kegg_M01016",
+      "kegg_M01017",
+      "kegg_M01018",
+      "kegg_M01019",
+      "kegg_M01020",
+      "kegg_M01021",
+      "kegg_M01022",
+      "kegg_M01023",
+      "kegg_M01024",
+      "kegg_M01025",
+      "kegg_M01026",
+      "kegg_M01027",
+      "kegg_M01028",
+      "kegg_M01029",
+      "kegg_M01030",
+      "kegg_M01031",
+      "kegg_M01032",
+      "kegg_M01033",
+      "kegg_M01034",
+      "kegg_M01035",
+      "kegg_M01036",
+      "kegg_M01037",
+      "kegg_M01038",
+      "kegg_M01039",
+      "kegg_M01040",
+      "kegg_M01041",
+      "kegg_M01042",
+      "kegg_M01043",
+      "kegg_M01044",
+      "kegg_M01045",
+      "kegg_M01046",
+      "kegg_M01047",
+      "kegg_M01048",
+      "kegg_M01049",
+      "kegg_M01050",
+      "kegg_M01051",
+      "kegg_M01052",
+      "kegg_M01053",
+      "kegg_M01054",
+      "kegg_M01055",
+      "kegg_M01056",
+      "kegg_M01057",
+      "kegg_M01058",
+      "kegg_M01059",
+      "kegg_M01060",
+      "iso_lat",
+      "iso_lon",
+      "iso_collection_year",
+      "iso_continent_europe",
+      "iso_continent_north_america",
+      "iso_continent_asia",
+      "iso_continent_middle_and_south_america",
+      "iso_continent_australia_and_oceania",
+      "iso_continent_africa",
+      "iso_continent_antarctica",
+      "iso_continent_arctic",
+      "iso_country_sweden",
+      "iso_country_germany",
+      "iso_country_usa",
+      "iso_country_france",
+      "iso_country_china",
+      "iso_country_japan",
+      "iso_country_republic_of_korea",
+      "iso_country_india",
+      "iso_country_united_kingdom",
+      "iso_country_spain",
+      "iso_country_italy",
+      "iso_country_netherlands",
+      "iso_country_egypt",
+      "iso_country_russia",
+      "iso_country_greece",
+      "iso_country_denmark",
+      "iso_country_switzerland",
+      "iso_country_norway",
+      "iso_country_belgium",
+      "iso_country_australia",
+      "iso_country_brazil",
+      "iso_country_canada",
+      "iso_country_thailand",
+      "iso_country_taiwan_province_of_china",
+      "iso_country_south_africa",
+      "iso_country_malaysia",
+      "iso_country_austria",
+      "iso_country_portugal",
+      "iso_country_mexico",
+      "iso_country_czech_republic",
+      "iso_host_kingdom_human",
+      "iso_host_kingdom_animal",
+      "iso_host_kingdom_plant",
+      "iso_host_kingdom_fungal",
+      "iso_host_kingdom_other"
+    ]
+  }
+}

artifacts/carveme_smoke_status.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "status": "reconstruction_smoke_ok",
+  "input_protein_fasta": "data/external_benchmark_fastas/GCA_000820045.faa.gz",
+  "output_model": "artifacts/carveme_smoke/GCA_000820045.xml",
+  "output_model_size_mb": 9.4,
+  "installed_local_dependency": "diamond 2.2.0 via Homebrew",
+  "command": "uv run --python 3.11 --isolated --with carveme carve data/external_benchmark_fastas/GCA_000820045.faa.gz -o artifacts/carveme_smoke/GCA_000820045.xml --solver scip",
+  "medium_feasibility_scored": false,
+  "reason_medium_not_scored": "CarveMe produced an SBML reconstruction, but comparing it to MediaDive media requires a recipe-to-metabolite/media mapping and a feasible-growth scoring command for each candidate medium.",
+  "gapseq_status": "not_installable_on_this_local_conda_platform",
+  "gapseq_check": "conda search -c conda-forge -c bioconda gapseq --info",
+  "gapseq_result": "PackagesNotFoundError for local osx channels; full gapseq also requires Bacteria/Archaea reference sequence database downloads."
+}

artifacts/carveme_smoke_status.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# CarveMe and gapseq External Baseline Status
+## CarveMe
+CarveMe was installed in an isolated `uv` environment and run on one held-out
+protein FASTA generated for the GenomeSPOT smoke benchmark.
+- Input: `data/external_benchmark_fastas/GCA_000820045.faa.gz`
+- Output: `artifacts/carveme_smoke/GCA_000820045.xml`
+- Output size: 9.4 MB
+- Extra local dependency installed: `diamond 2.2.0` via Homebrew
+- Command:
+```bash
+uv run --python 3.11 --isolated --with carveme carve \
+  data/external_benchmark_fastas/GCA_000820045.faa.gz \
+  -o artifacts/carveme_smoke/GCA_000820045.xml \
+  --solver scip
+```
+Result: reconstruction smoke test passed.
+Medium feasibility is not scored yet. The missing piece is not model
+reconstruction; it is a fair mapping from MediaDive recipe labels to compounds or
+media definitions that CarveMe can gap-fill/test. Without that mapping, a
+CarveMe-vs-MediaDive hit@k number would be mostly bookkeeping noise.
+## gapseq
+The local conda platform check did not find `gapseq`:
+```bash
+conda search -c conda-forge -c bioconda gapseq --info
+```
+Result: `PackagesNotFoundError` on the local macOS conda channels. gapseq is also
+not just a Python command: the official setup requires downloading Bacteria and
+Archaea reference sequence databases before real inference.
+Given the current disk state, the full gapseq run should be moved to a Linux
+machine or cloud instance with substantially more free disk.

artifacts/external_benchmark_manifest.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddf64020f70471ec362e5154ff37b073ce9ec5c4c6262dc9ebe409dcef183b90
+size 657757

artifacts/external_benchmark_manifest_5k.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

artifacts/external_benchmark_manifest_5k.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:109c71729457fddcbd7f8fdba51d998419ed6eb8c7af7b900bc4e242bb641385
+size 205285

artifacts/external_benchmark_manifest_5k_summary.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "source_manifest": "artifacts/external_benchmark_manifest.parquet",
+  "seed": 20260520,
+  "rows": 5000,
+  "unique_accessions": 5000,
+  "fold_counts": {
+    "0": 1035,
+    "1": 821,
+    "2": 1114,
+    "3": 943,
+    "4": 1087
+  },
+  "label_counts": {
+    "temperature": 5000,
+    "ph": 933,
+    "salt": 779,
+    "oxygen": 2653,
+    "medium": 3889,
+    "all_four_conditions": 416
+  }
+}

artifacts/external_benchmark_status.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "split_mode": "family",
+  "n_splits": 5,
+  "seed": 7,
+  "elapsed_s": 4.729962110519409,
+  "manifest_parquet": "artifacts/external_benchmark_manifest.parquet",
+  "manifest_csv": "artifacts/external_benchmark_manifest.csv",
+  "rows": 25728,
+  "media_labels": 40,
+  "tools": {
+    "GenomeSPOT": {
+      "command": "uv run python -m genome_spot.genome_spot",
+      "path": "data/external_tools/GenomeSPOT-main"
+    },
+    "CarveMe": {
+      "command": "uv run --with carveme carve",
+      "path": "/opt/homebrew/bin/diamond"
+    },
+    "gapseq": {
+      "command": null,
+      "path": null
+    }
+  },
+  "fasta_coverage": {
+    "fasta_dir": "data/external_benchmark_fastas",
+    "unique_accessions": 16154,
+    "present_fastas": 8,
+    "missing_fastas": 16146,
+    "coverage_pct": 0.04952333787297264
+  },
+  "download": {
+    "attempted": 0,
+    "downloaded": 0,
+    "failed": 0
+  }
+}

artifacts/external_benchmark_status.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# External Tool Benchmark Status
+This file tracks the apples-to-apples benchmark setup for external tools
+on the same held-out BacDive/MediaDive strains used by the dry-lab media
+recommender benchmark.
+## Held-Out Manifest
+- Manifest: `artifacts/external_benchmark_manifest.parquet`
+- Rows: 25,728
+- Unique genome accessions: 16,154
+- Media labels retained: 40
+- Fold counts: {"0": 5146, "1": 5146, "2": 5146, "3": 5145, "4": 5145}
+Label coverage:
+| Target | Labeled rows |
+|---|---:|
+| Temperature | 25,727 |
+| pH | 2,984 |
+| Salt | 2,486 |
+| Oxygen | 9,283 |
+| Medium | 21,050 |
+## Local Requirements
+- FASTA directory: `data/external_benchmark_fastas`
+- FASTAs present: 8 / 16,154 (0.05%)
+- FASTA download smoke run: {"attempted": 0, "downloaded": 0, "failed": 0}
+| Tool | Local command | Status |
+|---|---|---|
+| GenomeSPOT | `uv run python -m genome_spot.genome_spot` | available |
+| CarveMe | `uv run --with carveme carve` | available |
+| gapseq | `` | missing |
+## Verdict
+External baseline execution is not ready on this machine yet: the full held-out FASTA set and one or more external tool binaries/databases are missing.
+## Next Commands
+Use the manifest to run each external tool against the same rows and folds.
+The medium-feasibility tools should be scored by whether at least one known
+MediaDive medium is feasible or closest among the tool's predicted feasible
+media/metabolite environments.
+```bash
+PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py \
+  --download-fastas 10
+```
+For the full benchmark, download the complete FASTA set into the FASTA
+directory above, install the external tools plus their databases, then run
+tool-specific inference using the `bacdive_id`, `fold`, and
+`genome_accession` columns from the manifest.

artifacts/extract_seqs_full.log ADDED Viewed

	@@ -0,0 +1,467 @@

+✓ Initialized. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-GCiYQgOnZTEaVptGQgPIy0
+✓ Created objects.
+├── 🔨 Created mount
+│   /Users/miyuhoriuchi/microbe-model/scripts/36_extract_marker_sequences.py
+├── 🔨 Created mount data/markers/unified/unified_markers.hmm
+└── 🔨 Created function MarkerSeqExtractor.*.
+Marker-sequence extract: 45,856 genomes pending (199 cached)
+max_per_cat=16
+[setup] loaded 48 marker HMMs[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+  100 ok / 0 fail
+  200 ok / 0 fail
+  300 ok / 3 fail
+  400 ok / 3 fail
+  500 ok / 3 fail
+  600 ok / 3 fail
+  700 ok / 3 fail
+  800 ok / 4 fail
+  900 ok / 4 fail
+  1,000 ok / 4 fail
+  1,100 ok / 4 fail
+  1,200 ok / 6 fail
+  1,300 ok / 6 fail
+  1,400 ok / 6 fail
+  1,500 ok / 220 fail
+  1,600 ok / 220 fail
+  1,700 ok / 220 fail
+  1,800 ok / 220 fail
+  1,900 ok / 220 fail
+  2,000 ok / 221 fail
+  2,100 ok / 221 fail
+  2,200 ok / 221 fail
+  2,300 ok / 252 fail
+  2,400 ok / 253 fail
+  2,500 ok / 253 fail
+  2,600 ok / 253 fail
+  2,700 ok / 253 fail
+  2,800 ok / 301 fail
+  2,900 ok / 308 fail
+  3,000 ok / 308 fail
+  3,100 ok / 308 fail
+  3,200 ok / 350 fail
+  3,300 ok / 360 fail
+  3,400 ok / 360 fail
+  3,500 ok / 472 fail
+  3,600 ok / 472 fail
+  3,700 ok / 472 fail
+  3,800 ok / 472 fail
+  3,900 ok / 472 fail
+  4,000 ok / 472 fail
+  4,100 ok / 472 fail
+  4,200 ok / 472 fail
+  4,300 ok / 473 fail
+  4,400 ok / 473 fail
+  4,500 ok / 473 fail
+  4,600 ok / 473 fail
+  4,700 ok / 473 fail
+  4,800 ok / 473 fail
+  4,900 ok / 473 fail
+  5,000 ok / 476 fail
+  5,100 ok / 476 fail
+  5,200 ok / 479 fail
+  5,300 ok / 479 fail
+  5,400 ok / 479 fail
+  5,500 ok / 480 fail
+  5,600 ok / 480 fail
+  5,700 ok / 480 fail
+  5,800 ok / 480 fail
+  5,900 ok / 480 fail
+  6,000 ok / 480 fail
+  6,100 ok / 480 fail
+  6,200 ok / 534 fail
+  6,300 ok / 534 fail
+  6,400 ok / 537 fail
+  6,500 ok / 537 fail
+  6,600 ok / 539 fail
+  6,700 ok / 539 fail
+  6,800 ok / 541 fail
+  6,900 ok / 554 fail
+  7,000 ok / 559 fail
+  7,100 ok / 561 fail
+  7,200 ok / 593 fail
+  7,300 ok / 615 fail
+  7,400 ok / 623 fail
+  7,500 ok / 633 fail
+  7,600 ok / 633 fail
+  7,700 ok / 633 fail
+  7,800 ok / 633 fail
+  7,900 ok / 633 fail
+  8,000 ok / 634 fail
+  8,100 ok / 636 fail
+  8,200 ok / 636 fail
+  8,300 ok / 637 fail
+  8,400 ok / 637 fail
+  8,500 ok / 638 fail
+  8,600 ok / 638 fail
+  8,700 ok / 638 fail
+  8,800 ok / 638 fail
+  8,900 ok / 638 fail
+  9,000 ok / 640 fail
+  9,100 ok / 645 fail
+  9,200 ok / 645 fail
+  9,300 ok / 645 fail
+  9,400 ok / 647 fail
+  9,500 ok / 651 fail
+  9,600 ok / 652 fail
+  9,700 ok / 653 fail
+  9,800 ok / 654 fail
+  9,900 ok / 657 fail
+  10,000 ok / 663 fail
+  10,100 ok / 665 fail
+  10,200 ok / 674 fail
+  10,300 ok / 680 fail
+  10,400 ok / 680 fail
+  10,500 ok / 688 fail
+  10,600 ok / 690 fail
+  10,700 ok / 692 fail
+  10,800 ok / 695 fail
+  10,900 ok / 701 fail
+  11,000 ok / 702 fail
+  11,100 ok / 712 fail
+  11,200 ok / 726 fail
+  11,300 ok / 737 fail
+  11,400 ok / 771 fail
+  11,500 ok / 782 fail
+  11,600 ok / 782 fail
+  11,700 ok / 783 fail
+  11,800 ok / 784 fail
+  11,900 ok / 784 fail
+  12,000 ok / 789 fail
+  12,100 ok / 789 fail
+  12,200 ok / 794 fail
+  12,300 ok / 794 fail
+  12,400 ok / 794 fail
+  12,500 ok / 794 fail
+  12,600 ok / 794 fail
+  12,700 ok / 794 fail
+  12,800 ok / 794 fail
+  12,900 ok / 794 fail
+  13,000 ok / 794 fail
+  13,100 ok / 794 fail
+  13,200 ok / 815 fail
+  13,300 ok / 835 fail
+  13,400 ok / 836 fail
+  13,500 ok / 840 fail
+  13,600 ok / 938 fail
+  13,700 ok / 1,051 fail
+  13,800 ok / 1,094 fail
+  13,900 ok / 1,099 fail
+  14,000 ok / 1,195 fail
+  14,100 ok / 1,381 fail
+  14,200 ok / 1,415 fail
+  14,300 ok / 1,460 fail
+  14,400 ok / 1,460 fail
+  14,500 ok / 1,460 fail
+  14,600 ok / 1,460 fail
+  14,700 ok / 1,460 fail
+  14,800 ok / 1,460 fail
+  14,900 ok / 1,461 fail
+  15,000 ok / 1,461 fail
+  15,100 ok / 1,484 fail
+  15,200 ok / 1,494 fail
+  15,300 ok / 1,494 fail
+  15,400 ok / 1,494 fail
+  15,500 ok / 1,494 fail
+  15,600 ok / 1,495 fail
+  15,700 ok / 1,499 fail
+  15,800 ok / 1,499 fail
+  15,900 ok / 1,509 fail
+  16,000 ok / 1,509 fail
+  16,100 ok / 1,516 fail
+  16,200 ok / 1,516 fail
+  16,300 ok / 1,516 fail
+  16,400 ok / 1,527 fail
+  16,500 ok / 1,529 fail
+  16,600 ok / 1,530 fail
+  16,700 ok / 1,530 fail
+  16,800 ok / 1,530 fail
+  16,900 ok / 1,530 fail
+  17,000 ok / 1,532 fail
+  17,100 ok / 1,533 fail
+  17,200 ok / 1,539 fail
+  17,300 ok / 1,543 fail
+  17,400 ok / 1,543 fail
+  17,500 ok / 1,543 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  17,600 ok / 1,544 fail
+  17,700 ok / 1,544 fail
+  17,800 ok / 1,548 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  17,900 ok / 1,548 fail
+  18,000 ok / 1,551 fail
+  18,100 ok / 1,552 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  18,200 ok / 1,553 fail
+  18,300 ok / 1,555 fail
+  18,400 ok / 1,557 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  18,500 ok / 1,558 fail
+  18,600 ok / 1,560 fail
+  18,700 ok / 1,560 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  18,800 ok / 1,560 fail
+  18,900 ok / 1,560 fail
+  19,000 ok / 1,560 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  19,100 ok / 1,574 fail
+  19,200 ok / 1,574 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  19,300 ok / 1,574 fail
+  19,400 ok / 1,575 fail
+  19,500 ok / 1,575 fail
+  19,600 ok / 1,575 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  19,700 ok / 1,575 fail
+  19,800 ok / 1,606 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  19,900 ok / 1,610 fail
+  20,000 ok / 1,610 fail
+  20,100 ok / 1,610 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  20,200 ok / 1,610 fail
+  20,300 ok / 1,610 fail
+  20,400 ok / 1,610 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  20,500 ok / 1,610 fail
+  20,600 ok / 1,610 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+  20,700 ok / 1,610 fail
+  20,800 ok / 1,610 fail
+  20,900 ok / 1,618 fail
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+[setup] loaded 48 marker HMMs
+  21,000 ok / 1,618 fail
+  21,100 ok / 1,621 fail
+  21,200 ok / 1,621 fail
+  21,300 ok / 1,621 fail
+  21,400 ok / 1,621 fail
+  21,500 ok / 1,621 fail
+  21,600 ok / 1,622 fail
+  21,700 ok / 1,623 fail
+  21,800 ok / 1,623 fail
+  21,900 ok / 1,623 fail
+  22,000 ok / 1,623 fail
+Runner interrupted due to worker preemption. Your Function will be restarted with the same input. For more details, see https://modal.com/docs/guide/preemption
+[setup] loaded 48 marker HMMs
+  22,100 ok / 1,625 fail
+  22,200 ok / 1,628 fail
+  22,300 ok / 1,629 fail
+  22,400 ok / 1,629 fail
+  22,500 ok / 1,629 fail
+  22,600 ok / 1,631 fail
+  22,700 ok / 1,631 fail
+  22,800 ok / 1,631 fail
+  22,900 ok / 1,631 fail
+  23,000 ok / 1,631 fail
+  23,100 ok / 1,640 fail
+  23,200 ok / 1,640 fail
+  23,300 ok / 1,642 fail
+  23,400 ok / 1,643 fail
+  23,500 ok / 1,643 fail
+  23,600 ok / 1,643 fail
+  23,700 ok / 1,643 fail
+  23,800 ok / 1,643 fail
+  23,900 ok / 1,643 fail
+  24,000 ok / 1,651 fail
+  24,100 ok / 1,651 fail
+  24,200 ok / 1,669 fail
+  24,300 ok / 1,669 fail
+  24,400 ok / 1,669 fail
+  24,500 ok / 1,669 fail
+  24,600 ok / 1,669 fail
+  24,700 ok / 1,670 fail
+  24,800 ok / 1,672 fail
+  24,900 ok / 1,680 fail
+  25,000 ok / 1,681 fail
+  25,100 ok / 1,681 fail
+  25,200 ok / 1,682 fail
+  25,300 ok / 1,682 fail
+  25,400 ok / 1,682 fail
+  25,500 ok / 1,682 fail
+  25,600 ok / 1,682 fail
+  25,700 ok / 1,684 fail
+  25,800 ok / 1,687 fail
+  25,900 ok / 1,688 fail
+  26,000 ok / 1,688 fail
+  26,100 ok / 1,688 fail
+  26,200 ok / 1,688 fail
+  26,300 ok / 1,688 fail
+  26,400 ok / 1,689 fail
+  26,500 ok / 1,689 fail
+  26,600 ok / 1,689 fail
+  26,700 ok / 1,689 fail
+  26,800 ok / 1,689 fail
+  26,900 ok / 1,691 fail
+  27,000 ok / 1,691 fail
+  27,100 ok / 1,691 fail
+  27,200 ok / 1,691 fail
+  27,300 ok / 1,694 fail
+  27,400 ok / 1,694 fail
+  27,500 ok / 1,694 fail
+  27,600 ok / 1,694 fail
+  27,700 ok / 1,694 fail
+  27,800 ok / 1,694 fail
+  27,900 ok / 1,694 fail
+  28,000 ok / 1,694 fail
+  28,100 ok / 1,703 fail
+  28,200 ok / 1,703 fail
+  28,300 ok / 1,708 fail
+  28,400 ok / 1,714 fail
+  28,500 ok / 1,714 fail
+  28,600 ok / 1,716 fail
+  28,700 ok / 1,740 fail
+  28,800 ok / 1,740 fail
+  28,900 ok / 1,740 fail
+  29,000 ok / 1,740 fail
+  29,100 ok / 1,740 fail
+  29,200 ok / 1,740 fail
+  29,300 ok / 1,743 fail
+  29,400 ok / 1,743 fail
+  29,500 ok / 1,743 fail
+  29,600 ok / 1,743 fail
+  29,700 ok / 1,777 fail
+  29,800 ok / 1,845 fail
+  29,900 ok / 1,845 fail
+  30,000 ok / 1,845 fail
+  30,100 ok / 1,852 fail
+[setup] loaded 48 marker HMMs
+  30,200 ok / 1,852 fail
+  30,300 ok / 1,853 fail
+  30,400 ok / 1,907 fail
+  30,500 ok / 1,992 fail
+  30,600 ok / 2,109 fail
+  30,700 ok / 2,109 fail
+  30,800 ok / 2,109 fail
+  30,900 ok / 2,109 fail
+  31,000 ok / 2,125 fail
+  31,100 ok / 2,151 fail
+Runner interrupted due to worker preemption. Your Function will be restarted with the same input. For more details, see https://modal.com/docs/guide/preemption
+[setup] loaded 48 marker HMMs
+  31,200 ok / 2,153 fail
+  31,300 ok / 2,163 fail
+  31,400 ok / 2,205 fail
+  31,500 ok / 2,205 fail
+  31,600 ok / 2,208 fail
+  31,700 ok / 2,212 fail
+  31,800 ok / 2,222 fail
+  31,900 ok / 2,225 fail
+  32,000 ok / 2,228 fail
+  32,100 ok / 2,239 fail
+  32,200 ok / 2,246 fail
+  32,300 ok / 2,247 fail
+  32,400 ok / 2,254 fail
+[setup] loaded 48 marker HMMs
+  32,500 ok / 2,255 fail
+  32,600 ok / 2,284 fail
+  32,700 ok / 2,302 fail
+  32,800 ok / 2,311 fail
+  32,900 ok / 2,322 fail
+  33,000 ok / 2,322 fail
+  33,100 ok / 2,325 fail
+  33,200 ok / 2,325 fail
+  33,300 ok / 2,326 fail
+  33,400 ok / 2,344 fail
+  33,500 ok / 2,351 fail
+  33,600 ok / 2,356 fail
+  33,700 ok / 2,356 fail
+  33,800 ok / 2,383 fail
+  33,900 ok / 2,563 fail
+  34,000 ok / 2,631 fail
+  34,100 ok / 2,655 fail
+  34,200 ok / 3,031 fail
+  34,300 ok / 3,033 fail
+  34,400 ok / 3,040 fail
+  34,500 ok / 3,040 fail
+  34,600 ok / 3,104 fail
+  34,700 ok / 3,104 fail
+  34,800 ok / 3,205 fail
+  34,900 ok / 3,208 fail
+  35,000 ok / 3,208 fail
+  35,100 ok / 3,357 fail
+  35,200 ok / 3,479 fail
+  35,300 ok / 3,760 fail
+  35,400 ok / 3,835 fail
+  35,500 ok / 3,926 fail
+  35,600 ok / 3,926 fail
+  35,700 ok / 3,926 fail
+  35,800 ok / 3,926 fail
+  35,900 ok / 3,936 fail
+  36,000 ok / 3,942 fail
+  36,100 ok / 3,960 fail
+[setup] loaded 48 marker HMMs
+  36,200 ok / 4,223 fail
+  36,300 ok / 4,223 fail
+  36,400 ok / 4,223 fail
+  36,500 ok / 4,226 fail
+  36,600 ok / 4,226 fail
+  36,700 ok / 4,268 fail
+  36,800 ok / 4,269 fail
+  36,900 ok / 4,347 fail
+  37,000 ok / 4,442 fail
+  37,100 ok / 4,442 fail
+  37,200 ok / 4,485 fail
+  37,300 ok / 4,485 fail
+  37,400 ok / 4,499 fail
+  37,500 ok / 4,538 fail
+  37,600 ok / 4,563 fail
+  37,700 ok / 4,980 fail
+  37,800 ok / 5,300 fail
+  37,900 ok / 5,300 fail
+  38,000 ok / 5,304 fail
+  38,100 ok / 5,304 fail
+  38,200 ok / 5,316 fail
+  38,300 ok / 5,316 fail
+  38,400 ok / 5,472 fail
+  38,500 ok / 5,473 fail
+  38,600 ok / 5,474 fail
+  38,700 ok / 5,540 fail
+  38,800 ok / 5,540 fail
+  38,900 ok / 5,540 fail
+  39,000 ok / 5,585 fail
+  39,100 ok / 5,585 fail
+  39,200 ok / 5,585 fail
+  39,300 ok / 5,585 fail
+  39,400 ok / 5,585 fail
+  39,500 ok / 5,585 fail
+  39,600 ok / 5,585 fail
+  39,700 ok / 5,585 fail
+  39,800 ok / 5,585 fail
+  39,900 ok / 5,586 fail
+  40,000 ok / 5,586 fail
+  40,100 ok / 5,586 fail
+  40,200 ok / 5,586 fail
+Finished. 40,270 succeeded, 5,586 failed.
+Streamed to data/marker_sequences.jsonl
+Stopping app - local entrypoint completed.
+✓ App completed. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-GCiYQgOnZTEaVptGQgPIy0

artifacts/extract_seqs_smoke.log ADDED Viewed

	@@ -0,0 +1,79 @@

+✓ Initialized. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-196pygWm6l2FWYPkEkdjUH
+Building image im-Zq0oFsY4NHhsLRUkQXD217
+=> Step 0: FROM base
+=> Step 1: RUN python -m pip install 'pyhmmer>=0.12' 'pyrodigal>=3.5' 'requests>=2.32'
+Looking in indexes: http://pypi-mirror.modal.local:5555/simple
+Collecting pyhmmer>=0.12
+  Downloading http://pypi-mirror.modal.local:5555/simple/pyhmmer/pyhmmer-0.12.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (14 kB)
+Collecting pyrodigal>=3.5
+  Downloading http://pypi-mirror.modal.local:5555/simple/pyrodigal/pyrodigal-3.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (15 kB)
+Collecting requests>=2.32
+  Downloading http://pypi-mirror.modal.local:5555/simple/requests/requests-2.34.2-py3-none-any.whl.metadata (4.8 kB)
+Collecting psutil<8.0,>=5.0 (from pyhmmer>=0.12)
+  Downloading http://pypi-mirror.modal.local:5555/simple/psutil/psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
+Collecting archspec~=0.2.0 (from pyrodigal>=3.5)
+  Downloading http://pypi-mirror.modal.local:5555/simple/archspec/archspec-0.2.6-py3-none-any.whl.metadata (4.4 kB)
+Collecting charset_normalizer<4,>=2 (from requests>=2.32)
+  Downloading http://pypi-mirror.modal.local:5555/simple/charset-normalizer/charset_normalizer-3.4.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
+Collecting idna<4,>=2.5 (from requests>=2.32)
+  Downloading http://pypi-mirror.modal.local:5555/simple/idna/idna-3.15-py3-none-any.whl.metadata (7.7 kB)
+Collecting urllib3<3,>=1.26 (from requests>=2.32)
+  Downloading http://pypi-mirror.modal.local:5555/simple/urllib3/urllib3-2.7.0-py3-none-any.whl.metadata (6.9 kB)
+Collecting certifi>=2023.5.7 (from requests>=2.32)
+  Downloading http://pypi-mirror.modal.local:5555/simple/certifi/certifi-2026.4.22-py3-none-any.whl.metadata (2.5 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/pyhmmer/pyhmmer-0.12.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.2 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 187.8 MB/s  0:00:00
+Downloading http://pypi-mirror.modal.local:5555/simple/psutil/psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl (155 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/pyrodigal/pyrodigal-3.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.0 MB)
+   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 181.7 MB/s  0:00:00
+Downloading http://pypi-mirror.modal.local:5555/simple/archspec/archspec-0.2.6-py3-none-any.whl (81 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/requests/requests-2.34.2-py3-none-any.whl (73 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/charset-normalizer/charset_normalizer-3.4.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (214 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/idna/idna-3.15-py3-none-any.whl (72 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/urllib3/urllib3-2.7.0-py3-none-any.whl (131 kB)
+Downloading http://pypi-mirror.modal.local:5555/simple/certifi/certifi-2026.4.22-py3-none-any.whl (135 kB)
+Installing collected packages: urllib3, psutil, idna, charset_normalizer, certifi, archspec, requests, pyrodigal, pyhmmer
+Successfully installed archspec-0.2.6 certifi-2026.4.22 charset_normalizer-3.4.7 idna-3.15 psutil-7.2.2 pyhmmer-0.12.1 pyrodigal-3.7.1 requests-2.34.2 urllib3-2.7.0
+[notice] A new release of pip is available: 25.2 -> 26.1.1
+[notice] To update, run: pip install --upgrade pip
+Saving image...
+Image saved, took 801.09ms
+Built image im-Zq0oFsY4NHhsLRUkQXD217 in 4.20s
+✓ Created objects.
+├── 🔨 Created mount
+│   /Users/miyuhoriuchi/microbe-model/scripts/36_extract_marker_sequences.py
+├── 🔨 Created mount data/markers/unified/unified_markers.hmm
+└── 🔨 Created function MarkerSeqExtractor.*.
+Marker-sequence extract: 200 genomes pending (0 cached)
+max_per_cat=16
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+[setup] loaded 48 marker HMMs
+  100 ok / 1 fail
+Finished. 199 succeeded, 1 failed.
+Streamed to data/marker_sequences.jsonl
+Stopping app - local entrypoint completed.
+✓ App completed. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-196pygWm6l2FWYPkEkdjUH

artifacts/genomespot_5k_benchmark.json ADDED Viewed

The diff for this file is too large to render. See raw diff

artifacts/genomespot_5k_benchmark.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# GenomeSPOT Held-Out Benchmark
+GenomeSPOT was run on rows selected from the same held-out manifest used
+by the microbe-model media benchmark. This run uses a deterministic 5,000
+unique-genome subset selected from the family-heldout manifest.
+## Setup
+- Manifest: `artifacts/external_benchmark_manifest_5k.parquet`
+- Limit: 5000
+- Required labels: none
+- GenomeSPOT source: `data/external_tools/GenomeSPOT-main`
+- FASTA directory: `data/external_benchmark_fastas`
+## Results
+- OK: 5000 / 5000
+- Failed/skipped: 0
+- Mean runtime per OK genome: 6.37s
+- Temperature MAE: 4.393 C
+- pH MAE: 0.608
+- Salt MAE: 1.981%
+## Notes
+GenomeSPOT oxygen is a tolerant/not-tolerant label, while microbe-model
+uses BacDive oxygen categories. The smoke report keeps raw labels rather
+than forcing an evaluation mapping that may hide label-definition mismatch.

artifacts/genomespot_smoke_benchmark.json ADDED Viewed

	@@ -0,0 +1,119 @@

+{
+  "manifest": "artifacts/external_benchmark_manifest.parquet",
+  "genome_spot_dir": "data/external_tools/GenomeSPOT-main",
+  "fasta_dir": "data/external_benchmark_fastas",
+  "limit": 5,
+  "fold": null,
+  "required_labels": [
+    "temperature",
+    "ph",
+    "salt",
+    "oxygen"
+  ],
+  "summary": {
+    "n_requested": 5,
+    "n_ok": 5,
+    "n_failed_or_skipped": 0,
+    "temperature_mae_c": 6.765308446813765,
+    "ph_mae": 0.839113849774097,
+    "salt_mae_pct": 2.186273235501805,
+    "mean_elapsed_s": 7.024987316131591
+  },
+  "results": [
+    {
+      "bacdive_id": 288,
+      "genome_accession": "GCA_000820045",
+      "fold": 0,
+      "status": "ok",
+      "elapsed_s": 8.004921197891235,
+      "true_temperature_c": 30.0,
+      "true_ph": 8.85,
+      "true_salt_pct": 3.0,
+      "true_oxygen": "facultative anaerobe",
+      "genomespot_temperature_c": 25.04088968365003,
+      "genomespot_temperature_error": 5.908610260400278,
+      "genomespot_ph": 7.218559960869918,
+      "genomespot_ph_error": 0.9269488860287124,
+      "genomespot_salt_pct": 1.2846838543664987,
+      "genomespot_salt_error": 1.89111946010033,
+      "genomespot_oxygen": "tolerant",
+      "genomespot_oxygen_probability": 0.9534985556874084
+    },
+    {
+      "bacdive_id": 437,
+      "genome_accession": "GCA_020687825",
+      "fold": 0,
+      "status": "ok",
+      "elapsed_s": 6.64178991317749,
+      "true_temperature_c": 40.0,
+      "true_ph": 7.5,
+      "true_salt_pct": 0.0,
+      "true_oxygen": "obligate aerobe",
+      "genomespot_temperature_c": 30.968961771189043,
+      "genomespot_temperature_error": 5.607908850605381,
+      "genomespot_ph": 7.518364363247092,
+      "genomespot_ph_error": 0.8905373705895864,
+      "genomespot_salt_pct": 3.945249495463366,
+      "genomespot_salt_error": 2.708144543741642,
+      "genomespot_oxygen": "tolerant",
+      "genomespot_oxygen_probability": 0.9872012079302094
+    },
+    {
+      "bacdive_id": 461,
+      "genome_accession": "GCA_000428985",
+      "fold": 0,
+      "status": "ok",
+      "elapsed_s": 6.950576305389404,
+      "true_temperature_c": 30.0,
+      "true_ph": 9.0,
+      "true_salt_pct": 3.25,
+      "true_oxygen": "facultative anaerobe",
+      "genomespot_temperature_c": 35.01874579459905,
+      "genomespot_temperature_error": 6.392270488048381,
+      "genomespot_ph": 7.498319424452189,
+      "genomespot_ph_error": 0.8890270592649339,
+      "genomespot_salt_pct": 6.173325202743865,
+      "genomespot_salt_error": 4.740765490619152,
+      "genomespot_oxygen": "tolerant",
+      "genomespot_oxygen_probability": 0.722387623101007
+    },
+    {
+      "bacdive_id": 462,
+      "genome_accession": "GCA_000378045",
+      "fold": 0,
+      "status": "ok",
+      "elapsed_s": 6.953850269317627,
+      "true_temperature_c": 25.0,
+      "true_ph": 7.0,
+      "true_salt_pct": 3.0,
+      "true_oxygen": "aerobe",
+      "genomespot_temperature_c": 31.044112938407967,
+      "genomespot_temperature_error": 5.6003497982319,
+      "genomespot_ph": 7.605265059073084,
+      "genomespot_ph_error": 0.8967230869811786,
+      "genomespot_salt_pct": 3.388669539545697,
+      "genomespot_salt_error": 2.677469176228756,
+      "genomespot_oxygen": "tolerant",
+      "genomespot_oxygen_probability": 0.7674566512031347
+    },
+    {
+      "bacdive_id": 466,
+      "genome_accession": "GCA_900112305",
+      "fold": 0,
+      "status": "ok",
+      "elapsed_s": 6.573798894882202,
+      "true_temperature_c": 46.0,
+      "true_ph": 7.0,
+      "true_salt_pct": 1.5,
+      "true_oxygen": "facultative anaerobe",
+      "genomespot_temperature_c": 37.22646504409912,
+      "genomespot_temperature_error": 6.99763526913742,
+      "genomespot_ph": 7.438819211872416,
+      "genomespot_ph_error": 0.8942408576036942,
+      "genomespot_salt_pct": 3.4588057941225956,
+      "genomespot_salt_error": 2.689844701946755,
+      "genomespot_oxygen": "tolerant",
+      "genomespot_oxygen_probability": 0.8284586405204845
+    }
+  ]
+}

artifacts/genomespot_smoke_benchmark.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# GenomeSPOT Held-Out Smoke Benchmark
+GenomeSPOT was run on rows selected from the same held-out manifest used
+by the microbe-model media benchmark. This is a smoke benchmark unless
+`limit` equals the full manifest size.
+## Setup
+- Manifest: `artifacts/external_benchmark_manifest.parquet`
+- Limit: 5
+- Required labels: temperature, ph, salt, oxygen
+- GenomeSPOT source: `data/external_tools/GenomeSPOT-main`
+- FASTA directory: `data/external_benchmark_fastas`
+## Results
+- OK: 5 / 5
+- Failed/skipped: 0
+- Mean runtime per OK genome: 7.02s
+- Temperature MAE: 6.765 C
+- pH MAE: 0.839
+- Salt MAE: 2.186%
+## Notes
+GenomeSPOT oxygen is a tolerant/not-tolerant label, while microbe-model
+uses BacDive oxygen categories. The smoke report keeps raw labels rather
+than forcing an evaluation mapping that may hide label-definition mismatch.

artifacts/lora/fold0_oxygen_diagnostics.json ADDED Viewed

	@@ -0,0 +1,321 @@

+{
+  "checkpoint": "artifacts/lora/fold0_best.pt",
+  "n": 2276,
+  "classes": [
+    "aerobe",
+    "anaerobe",
+    "facultative_anaerobe",
+    "microaerobe"
+  ],
+  "accuracy": 0.956503,
+  "macro_f1": 0.946019,
+  "macro_f1_all_classes": 0.47301,
+  "confusion_matrix": [
+    [
+      1590,
+      33,
+      0,
+      0
+    ],
+    [
+      66,
+      587,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ]
+  ],
+  "per_class": {
+    "aerobe": {
+      "precision": 0.960145,
+      "recall": 0.979667,
+      "f1": 0.969808,
+      "support": 1623,
+      "predicted": 1656
+    },
+    "anaerobe": {
+      "precision": 0.946774,
+      "recall": 0.898928,
+      "f1": 0.922231,
+      "support": 653,
+      "predicted": 620
+    },
+    "facultative_anaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    },
+    "microaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    }
+  },
+  "wrong_predictions": [
+    {
+      "bacdive_id": 499,
+      "genome_accession": "GCA_000429505",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.999306,
+      "true_probability": 0.000693,
+      "margin": 0.998613
+    },
+    {
+      "bacdive_id": 481,
+      "genome_accession": "GCA_003363485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.999155,
+      "true_probability": 0.000845,
+      "margin": 0.99831
+    },
+    {
+      "bacdive_id": 498,
+      "genome_accession": "GCA_000429485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.997683,
+      "true_probability": 0.002315,
+      "margin": 0.995368
+    },
+    {
+      "bacdive_id": 168525,
+      "genome_accession": "GCA_006386545",
+      "group": "Sphaerotilaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.997474,
+      "true_probability": 0.002524,
+      "margin": 0.99495
+    },
+    {
+      "bacdive_id": 483,
+      "genome_accession": "GCA_000421165",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.99587,
+      "true_probability": 0.004127,
+      "margin": 0.991743
+    },
+    {
+      "bacdive_id": 17841,
+      "genome_accession": "GCA_000975055",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.995458,
+      "true_probability": 0.004536,
+      "margin": 0.990923
+    },
+    {
+      "bacdive_id": 17840,
+      "genome_accession": "GCA_000975035",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.987744,
+      "true_probability": 0.012242,
+      "margin": 0.975502
+    },
+    {
+      "bacdive_id": 164735,
+      "genome_accession": "GCF_943590815.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.98457,
+      "true_probability": 0.015418,
+      "margin": 0.969152
+    },
+    {
+      "bacdive_id": 148058,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.981892,
+      "true_probability": 0.018096,
+      "margin": 0.963796
+    },
+    {
+      "bacdive_id": 149706,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.981892,
+      "true_probability": 0.018096,
+      "margin": 0.963796
+    },
+    {
+      "bacdive_id": 156346,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.981892,
+      "true_probability": 0.018096,
+      "margin": 0.963796
+    },
+    {
+      "bacdive_id": 133974,
+      "genome_accession": "GCA_001544515",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.979916,
+      "true_probability": 0.020068,
+      "margin": 0.959847
+    },
+    {
+      "bacdive_id": 133484,
+      "genome_accession": "GCA_900102145",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.977195,
+      "true_probability": 0.0228,
+      "margin": 0.954396
+    },
+    {
+      "bacdive_id": 5882,
+      "genome_accession": "GCA_000215915",
+      "group": "Haloarculaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.973115,
+      "true_probability": 0.026812,
+      "margin": 0.946303
+    },
+    {
+      "bacdive_id": 133991,
+      "genome_accession": "GCA_039544205",
+      "group": "Ferrimonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.962063,
+      "true_probability": 0.037911,
+      "margin": 0.924153
+    },
+    {
+      "bacdive_id": 132346,
+      "genome_accession": "GCA_002259755",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.960132,
+      "true_probability": 0.039802,
+      "margin": 0.920329
+    },
+    {
+      "bacdive_id": 132623,
+      "genome_accession": "GCF_003967195.1",
+      "group": "Granulosicoccaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.960109,
+      "true_probability": 0.039865,
+      "margin": 0.920244
+    },
+    {
+      "bacdive_id": 24101,
+      "genome_accession": "GCA_000242915",
+      "group": "Helicobacteraceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.959066,
+      "true_probability": 0.040876,
+      "margin": 0.91819
+    },
+    {
+      "bacdive_id": 160296,
+      "genome_accession": "GCA_010667645",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.957143,
+      "true_probability": 0.042813,
+      "margin": 0.91433
+    },
+    {
+      "bacdive_id": 140694,
+      "genome_accession": "GCA_003336745",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.952379,
+      "true_probability": 0.04761,
+      "margin": 0.904769
+    },
+    {
+      "bacdive_id": 168303,
+      "genome_accession": "GCA_004307015",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.951955,
+      "true_probability": 0.048036,
+      "margin": 0.903919
+    },
+    {
+      "bacdive_id": 134099,
+      "genome_accession": "GCA_023349185",
+      "group": "Shewanellaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.948975,
+      "true_probability": 0.050994,
+      "margin": 0.89798
+    },
+    {
+      "bacdive_id": 159192,
+      "genome_accession": "GCA_003721225",
+      "group": "Acidithiobacillaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.948142,
+      "true_probability": 0.051834,
+      "margin": 0.896308
+    },
+    {
+      "bacdive_id": 154004,
+      "genome_accession": "GCF_982443925.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.945632,
+      "true_probability": 0.054345,
+      "margin": 0.891287
+    },
+    {
+      "bacdive_id": 159320,
+      "genome_accession": "GCA_009078285",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.945592,
+      "true_probability": 0.054343,
+      "margin": 0.891249
+    }
+  ]
+}

artifacts/lora/fold0_oxygen_diagnostics.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# LoRA Oxygen Diagnostics
+Checkpoint: `artifacts/lora/fold0_best.pt`
+- Labeled validation rows: `2276`
+- Accuracy: `0.9565`
+- Macro F1 (supported classes): `0.9460`
+- Macro F1 (all configured classes): `0.4730`
+## Per-Class Metrics
+| Class | Precision | Recall | F1 | Support | Predicted |
+|---|---:|---:|---:|---:|---:|
+| aerobe | 0.9601 | 0.9797 | 0.9698 | 1623 | 1656 |
+| anaerobe | 0.9468 | 0.8989 | 0.9222 | 653 | 620 |
+| facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+| microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+## Confusion Matrix
+| True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
+|---|---:|---:|---:|---:|
+| aerobe | 1590 | 33 | 0 | 0 |
+| anaerobe | 66 | 587 | 0 | 0 |
+| facultative_anaerobe | 0 | 0 | 0 | 0 |
+| microaerobe | 0 | 0 | 0 | 0 |
+## High-Confidence Wrong Predictions
+| BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
+|---:|---|---|---|---|---:|---:|---:|
+| 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9993 | 0.0007 | 0.9986 |
+| 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9992 | 0.0008 | 0.9983 |
+| 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9977 | 0.0023 | 0.9954 |
+| 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9975 | 0.0025 | 0.9950 |
+| 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9959 | 0.0041 | 0.9917 |
+| 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9955 | 0.0045 | 0.9909 |
+| 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9877 | 0.0122 | 0.9755 |
+| 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9846 | 0.0154 | 0.9692 |
+| 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
+| 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
+| 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
+| 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9799 | 0.0201 | 0.9598 |
+| 133484 | GCA_900102145 | Thermaceae | aerobe | anaerobe | 0.9772 | 0.0228 | 0.9544 |
+| 5882 | GCA_000215915 | Haloarculaceae | anaerobe | aerobe | 0.9731 | 0.0268 | 0.9463 |
+| 133991 | GCA_039544205 | Ferrimonadaceae | anaerobe | aerobe | 0.9621 | 0.0379 | 0.9242 |
+| 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9601 | 0.0398 | 0.9203 |
+| 132623 | GCF_003967195.1 | Granulosicoccaceae | anaerobe | aerobe | 0.9601 | 0.0399 | 0.9202 |
+| 24101 | GCA_000242915 | Helicobacteraceae | anaerobe | aerobe | 0.9591 | 0.0409 | 0.9182 |
+| 160296 | GCA_010667645 | Bifidobacteriaceae | aerobe | anaerobe | 0.9571 | 0.0428 | 0.9143 |
+| 140694 | GCA_003336745 | Thermaceae | aerobe | anaerobe | 0.9524 | 0.0476 | 0.9048 |
+| 168303 | GCA_004307015 | Thermaceae | aerobe | anaerobe | 0.9520 | 0.0480 | 0.9039 |
+| 134099 | GCA_023349185 | Shewanellaceae | anaerobe | aerobe | 0.9490 | 0.0510 | 0.8980 |
+| 159192 | GCA_003721225 | Acidithiobacillaceae | aerobe | anaerobe | 0.9481 | 0.0518 | 0.8963 |
+| 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9456 | 0.0543 | 0.8913 |
+| 159320 | GCA_009078285 | Bifidobacteriaceae | aerobe | anaerobe | 0.9456 | 0.0543 | 0.8912 |

artifacts/lora/fold0_oxygen_diagnostics_comparison.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Fold 0 Oxygen Diagnostic Comparison
+Both diagnostics were run on the same fold 0 oxygen-labeled validation rows
+(`n = 2,276`) using the saved LoRA checkpoints.
+| Checkpoint | Accuracy | Macro F1 (supported classes) | Aerobe F1 | Anaerobe F1 | Aerobe -> Anaerobe | Anaerobe -> Aerobe |
+|---|---:|---:|---:|---:|---:|---:|
+| `fold0_best.pt` | 0.9565 | 0.9460 | 0.9698 | 0.9222 | 33 | 66 |
+| `fold0_best_oxygen.pt` | 0.9446 | 0.9294 | 0.9622 | 0.8966 | 19 | 107 |
+## Conclusion
+The all-task LoRA checkpoint remains the better oxygen model on fold 0. The
+oxygen-only checkpoint improves aerobe recall slightly and reduces aerobe
+misclassified as anaerobe, but it substantially increases anaerobe misclassified
+as aerobe. That drop in anaerobe recall drives the lower macro F1.
+The fold 0 validation split has no supported `facultative_anaerobe` or
+`microaerobe` rows, so the primary comparison should use macro F1 over supported
+classes. Full configured-class macro F1 is also recorded in the JSON files for
+visibility, but it is not useful for selecting between these two checkpoints on
+this fold.

artifacts/lora/fold0_oxygen_only_diagnostics.json ADDED Viewed

	@@ -0,0 +1,321 @@

+{
+  "checkpoint": "artifacts/lora/fold0_best_oxygen.pt",
+  "n": 2276,
+  "classes": [
+    "aerobe",
+    "anaerobe",
+    "facultative_anaerobe",
+    "microaerobe"
+  ],
+  "accuracy": 0.94464,
+  "macro_f1": 0.92938,
+  "macro_f1_all_classes": 0.46469,
+  "confusion_matrix": [
+    [
+      1604,
+      19,
+      0,
+      0
+    ],
+    [
+      107,
+      546,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ]
+  ],
+  "per_class": {
+    "aerobe": {
+      "precision": 0.937463,
+      "recall": 0.988293,
+      "f1": 0.962208,
+      "support": 1623,
+      "predicted": 1711
+    },
+    "anaerobe": {
+      "precision": 0.966372,
+      "recall": 0.836141,
+      "f1": 0.896552,
+      "support": 653,
+      "predicted": 565
+    },
+    "facultative_anaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    },
+    "microaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    }
+  },
+  "wrong_predictions": [
+    {
+      "bacdive_id": 499,
+      "genome_accession": "GCA_000429505",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.999059,
+      "true_probability": 0.000936,
+      "margin": 0.998123
+    },
+    {
+      "bacdive_id": 481,
+      "genome_accession": "GCA_003363485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.998843,
+      "true_probability": 0.001152,
+      "margin": 0.997691
+    },
+    {
+      "bacdive_id": 168525,
+      "genome_accession": "GCA_006386545",
+      "group": "Sphaerotilaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.99857,
+      "true_probability": 0.001423,
+      "margin": 0.997147
+    },
+    {
+      "bacdive_id": 498,
+      "genome_accession": "GCA_000429485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.997633,
+      "true_probability": 0.002353,
+      "margin": 0.99528
+    },
+    {
+      "bacdive_id": 17841,
+      "genome_accession": "GCA_000975055",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.996928,
+      "true_probability": 0.003054,
+      "margin": 0.993875
+    },
+    {
+      "bacdive_id": 483,
+      "genome_accession": "GCA_000421165",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.996088,
+      "true_probability": 0.003887,
+      "margin": 0.992201
+    },
+    {
+      "bacdive_id": 133974,
+      "genome_accession": "GCA_001544515",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.995023,
+      "true_probability": 0.004932,
+      "margin": 0.990091
+    },
+    {
+      "bacdive_id": 17840,
+      "genome_accession": "GCA_000975035",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.9928,
+      "true_probability": 0.007156,
+      "margin": 0.985644
+    },
+    {
+      "bacdive_id": 164735,
+      "genome_accession": "GCF_943590815.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.991238,
+      "true_probability": 0.008708,
+      "margin": 0.982529
+    },
+    {
+      "bacdive_id": 132623,
+      "genome_accession": "GCF_003967195.1",
+      "group": "Granulosicoccaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.990258,
+      "true_probability": 0.00966,
+      "margin": 0.980599
+    },
+    {
+      "bacdive_id": 133976,
+      "genome_accession": "GCA_001544495",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.989736,
+      "true_probability": 0.010187,
+      "margin": 0.979549
+    },
+    {
+      "bacdive_id": 148058,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.985801,
+      "true_probability": 0.014125,
+      "margin": 0.971676
+    },
+    {
+      "bacdive_id": 149706,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.985801,
+      "true_probability": 0.014125,
+      "margin": 0.971676
+    },
+    {
+      "bacdive_id": 156346,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.985801,
+      "true_probability": 0.014125,
+      "margin": 0.971676
+    },
+    {
+      "bacdive_id": 133975,
+      "genome_accession": "GCA_001544475",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.985722,
+      "true_probability": 0.014188,
+      "margin": 0.971534
+    },
+    {
+      "bacdive_id": 132346,
+      "genome_accession": "GCA_002259755",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.97981,
+      "true_probability": 0.020066,
+      "margin": 0.959744
+    },
+    {
+      "bacdive_id": 160296,
+      "genome_accession": "GCA_010667645",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.973442,
+      "true_probability": 0.026419,
+      "margin": 0.947023
+    },
+    {
+      "bacdive_id": 159320,
+      "genome_accession": "GCA_009078285",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.971766,
+      "true_probability": 0.028081,
+      "margin": 0.943686
+    },
+    {
+      "bacdive_id": 7227,
+      "genome_accession": "GCA_000425185",
+      "group": "Pleomorphomonadaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.959909,
+      "true_probability": 0.039907,
+      "margin": 0.920002
+    },
+    {
+      "bacdive_id": 154004,
+      "genome_accession": "GCF_982443925.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.959056,
+      "true_probability": 0.04078,
+      "margin": 0.918276
+    },
+    {
+      "bacdive_id": 24101,
+      "genome_accession": "GCA_000242915",
+      "group": "Helicobacteraceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.954952,
+      "true_probability": 0.044544,
+      "margin": 0.910408
+    },
+    {
+      "bacdive_id": 133136,
+      "genome_accession": "GCA_030161615",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.953702,
+      "true_probability": 0.045723,
+      "margin": 0.90798
+    },
+    {
+      "bacdive_id": 159192,
+      "genome_accession": "GCA_003721225",
+      "group": "Acidithiobacillaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.952867,
+      "true_probability": 0.046824,
+      "margin": 0.906043
+    },
+    {
+      "bacdive_id": 8608,
+      "genome_accession": "GCF_900476065.1",
+      "group": "Metamycoplasmataceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.927788,
+      "true_probability": 0.070978,
+      "margin": 0.85681
+    },
+    {
+      "bacdive_id": 134099,
+      "genome_accession": "GCA_023349185",
+      "group": "Shewanellaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.924335,
+      "true_probability": 0.075315,
+      "margin": 0.84902
+    }
+  ]
+}

artifacts/lora/fold0_oxygen_only_diagnostics.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# LoRA Oxygen Diagnostics
+Checkpoint: `artifacts/lora/fold0_best_oxygen.pt`
+- Labeled validation rows: `2276`
+- Accuracy: `0.9446`
+- Macro F1 (supported classes): `0.9294`
+- Macro F1 (all configured classes): `0.4647`
+## Per-Class Metrics
+| Class | Precision | Recall | F1 | Support | Predicted |
+|---|---:|---:|---:|---:|---:|
+| aerobe | 0.9375 | 0.9883 | 0.9622 | 1623 | 1711 |
+| anaerobe | 0.9664 | 0.8361 | 0.8966 | 653 | 565 |
+| facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+| microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+## Confusion Matrix
+| True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
+|---|---:|---:|---:|---:|
+| aerobe | 1604 | 19 | 0 | 0 |
+| anaerobe | 107 | 546 | 0 | 0 |
+| facultative_anaerobe | 0 | 0 | 0 | 0 |
+| microaerobe | 0 | 0 | 0 | 0 |
+## High-Confidence Wrong Predictions
+| BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
+|---:|---|---|---|---|---:|---:|---:|
+| 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9991 | 0.0009 | 0.9981 |
+| 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9988 | 0.0012 | 0.9977 |
+| 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9986 | 0.0014 | 0.9971 |
+| 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9976 | 0.0024 | 0.9953 |
+| 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9969 | 0.0031 | 0.9939 |
+| 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9961 | 0.0039 | 0.9922 |
+| 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9950 | 0.0049 | 0.9901 |
+| 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9928 | 0.0072 | 0.9856 |
+| 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9912 | 0.0087 | 0.9825 |
+| 132623 | GCF_003967195.1 | Granulosicoccaceae | anaerobe | aerobe | 0.9903 | 0.0097 | 0.9806 |
+| 133976 | GCA_001544495 | Burkholderiaceae | anaerobe | aerobe | 0.9897 | 0.0102 | 0.9795 |
+| 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
+| 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
+| 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
+| 133975 | GCA_001544475 | Burkholderiaceae | anaerobe | aerobe | 0.9857 | 0.0142 | 0.9715 |
+| 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9798 | 0.0201 | 0.9597 |
+| 160296 | GCA_010667645 | Bifidobacteriaceae | aerobe | anaerobe | 0.9734 | 0.0264 | 0.9470 |
+| 159320 | GCA_009078285 | Bifidobacteriaceae | aerobe | anaerobe | 0.9718 | 0.0281 | 0.9437 |
+| 7227 | GCA_000425185 | Pleomorphomonadaceae | aerobe | anaerobe | 0.9599 | 0.0399 | 0.9200 |
+| 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9591 | 0.0408 | 0.9183 |
+| 24101 | GCA_000242915 | Helicobacteraceae | anaerobe | aerobe | 0.9550 | 0.0445 | 0.9104 |
+| 133136 | GCA_030161615 | Demequinaceae | anaerobe | aerobe | 0.9537 | 0.0457 | 0.9080 |
+| 159192 | GCA_003721225 | Acidithiobacillaceae | aerobe | anaerobe | 0.9529 | 0.0468 | 0.9060 |
+| 8608 | GCF_900476065.1 | Metamycoplasmataceae | aerobe | anaerobe | 0.9278 | 0.0710 | 0.8568 |
+| 134099 | GCA_023349185 | Shewanellaceae | anaerobe | aerobe | 0.9243 | 0.0753 | 0.8490 |

artifacts/lora/fold0_results.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+  "model_cfg": {
+    "esm_model_name": "facebook/esm2_t12_35M_UR50D",
+    "lora_r": 8,
+    "lora_alpha": 16,
+    "lora_dropout": 0.05,
+    "lora_target": [
+      "query",
+      "value"
+    ],
+    "head_hidden_dim": 128,
+    "head_dropout": 0.1,
+    "max_seq_len": 512,
+    "max_proteins_per_cat": 6,
+    "gradient_checkpointing": true
+  },
+  "train_cfg": {
+    "fold": 0,
+    "epochs": 1,
+    "batch_size": 2,
+    "grad_accum": 8,
+    "lora_lr": 0.0001,
+    "head_lr": 0.001,
+    "weight_decay": 0.01,
+    "warmup_frac": 0.05,
+    "bf16": true,
+    "max_proteins_per_category": 16,
+    "save_dir": "artifacts/lora",
+    "grad_clip": 1.0
+  },
+  "history": [
+    {
+      "epoch": 1,
+      "train_loss": 45.74971861128626,
+      "val": {
+        "temp": {
+          "mae": 3.666003942489624,
+          "n": 8066
+        },
+        "ph": {
+          "mae": 0.5597922205924988,
+          "n": 705
+        },
+        "salt": {
+          "mae": 1.815351128578186,
+          "n": 424
+        },
+        "oxy": {
+          "f1_macro": 0.9448227977645267,
+          "n": 2266
+        }
+      },
+      "elapsed_s": 19656.821365594864
+    }
+  ],
+  "best": {
+    "epoch": 1,
+    "val": {
+      "temp": {
+        "mae": 3.666003942489624,
+        "n": 8066
+      },
+      "ph": {
+        "mae": 0.5597922205924988,
+        "n": 705
+      },
+      "salt": {
+        "mae": 1.815351128578186,
+        "n": 424
+      },
+      "oxy": {
+        "f1_macro": 0.9448227977645267,
+        "n": 2266
+      }
+    },
+    "score": 5.096324493895782
+  }
+}

artifacts/lora/fold0_results_oxygen.json ADDED Viewed

	@@ -0,0 +1,82 @@

+{
+  "model_cfg": {
+    "esm_model_name": "facebook/esm2_t12_35M_UR50D",
+    "lora_r": 8,
+    "lora_alpha": 16,
+    "lora_dropout": 0.05,
+    "lora_target": [
+      "query",
+      "value"
+    ],
+    "head_hidden_dim": 128,
+    "head_dropout": 0.1,
+    "max_seq_len": 512,
+    "max_proteins_per_cat": 6,
+    "gradient_checkpointing": true
+  },
+  "train_cfg": {
+    "fold": 0,
+    "epochs": 1,
+    "batch_size": 2,
+    "grad_accum": 8,
+    "lora_lr": 0.0001,
+    "head_lr": 0.001,
+    "weight_decay": 0.01,
+    "warmup_frac": 0.05,
+    "bf16": true,
+    "max_proteins_per_category": 16,
+    "save_dir": "artifacts/lora",
+    "grad_clip": 1.0,
+    "temp_weight": 0.0,
+    "ph_weight": 0.0,
+    "salt_weight": 0.0,
+    "oxy_weight": 1.0
+  },
+  "history": [
+    {
+      "epoch": 1,
+      "train_loss": 0.08001717917339461,
+      "val": {
+        "temp": {
+          "mae": 32.25041580200195,
+          "n": 8062
+        },
+        "ph": {
+          "mae": 7.046206951141357,
+          "n": 763
+        },
+        "salt": {
+          "mae": 2.082266330718994,
+          "n": 446
+        },
+        "oxy": {
+          "f1_macro": 0.916835889123625,
+          "n": 2214
+        }
+      },
+      "elapsed_s": 19710.637019872665
+    }
+  ],
+  "best": {
+    "epoch": 1,
+    "val": {
+      "temp": {
+        "mae": 32.25041580200195,
+        "n": 8062
+      },
+      "ph": {
+        "mae": 7.046206951141357,
+        "n": 763
+      },
+      "salt": {
+        "mae": 2.082266330718994,
+        "n": 446
+      },
+      "oxy": {
+        "f1_macro": 0.916835889123625,
+        "n": 2214
+      }
+    },
+    "score": 40.46205319473868
+  }
+}

artifacts/lora/fold0_results_smoke.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+  "model_cfg": {
+    "esm_model_name": "facebook/esm2_t12_35M_UR50D",
+    "lora_r": 8,
+    "lora_alpha": 16,
+    "lora_dropout": 0.05,
+    "lora_target": [
+      "query",
+      "value"
+    ],
+    "head_hidden_dim": 128,
+    "head_dropout": 0.1,
+    "max_seq_len": 512,
+    "max_proteins_per_cat": 6,
+    "gradient_checkpointing": true
+  },
+  "train_cfg": {
+    "fold": 0,
+    "epochs": 1,
+    "batch_size": 2,
+    "grad_accum": 8,
+    "lora_lr": 0.0001,
+    "head_lr": 0.001,
+    "weight_decay": 0.01,
+    "warmup_frac": 0.05,
+    "bf16": true,
+    "max_proteins_per_category": 16,
+    "save_dir": "/artifacts/lora",
+    "grad_clip": 1.0
+  },
+  "history": [
+    {
+      "epoch": 1,
+      "train_loss": 896.9954485212054,
+      "val": {
+        "temp": {
+          "mae": 28.067121505737305,
+          "n": 172
+        },
+        "ph": {
+          "mae": null,
+          "n": 0
+        },
+        "salt": {
+          "mae": null,
+          "n": 0
+        },
+        "oxy": {
+          "f1_macro": null,
+          "n": 0
+        }
+      },
+      "elapsed_s": 60.57433819770813
+    }
+  ],
+  "best": {
+    "epoch": 1,
+    "val": {
+      "temp": {
+        "mae": 28.067121505737305,
+        "n": 172
+      },
+      "ph": {
+        "mae": null,
+        "n": 0
+      },
+      "salt": {
+        "mae": null,
+        "n": 0
+      },
+      "oxy": {
+        "f1_macro": null,
+        "n": 0
+      }
+    },
+    "score": 28.067121505737305
+  }
+}

artifacts/lora/lambda_fold0_1ep_20260517T033023Z.log ADDED Viewed

	@@ -0,0 +1,57 @@

+[lambda-lora] starting fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D
+[lambda-lora] cuda=True device=NVIDIA A100-SXM4-40GB
+[lora] device = cuda
+[lora] loaded 40,469 records with sequences + labels
+[lora] fold 0: 32,375 train / 8,094 val
+Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+[lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
+/home/ubuntu/microbe-model/.venv/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
+  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
+  ep 1 step 50: loss=842.3137  lr_lora=4.95e-05
+  ep 1 step 100: loss=482.1193  lr_lora=9.90e-05
+  ep 1 step 150: loss=334.7604  lr_lora=9.98e-05
+  ep 1 step 200: loss=264.2583  lr_lora=9.93e-05
+  ep 1 step 250: loss=217.4865  lr_lora=9.85e-05
+  ep 1 step 300: loss=186.5597  lr_lora=9.74e-05
+  ep 1 step 350: loss=163.8552  lr_lora=9.59e-05
+  ep 1 step 400: loss=147.2159  lr_lora=9.41e-05
+  ep 1 step 450: loss=133.7593  lr_lora=9.21e-05
+  ep 1 step 500: loss=123.0141  lr_lora=8.97e-05
+  ep 1 step 550: loss=113.9034  lr_lora=8.71e-05
+  ep 1 step 600: loss=106.4140  lr_lora=8.43e-05
+  ep 1 step 650: loss=100.2909  lr_lora=8.12e-05
+  ep 1 step 700: loss=94.6046  lr_lora=7.79e-05
+  ep 1 step 750: loss=89.8550  lr_lora=7.44e-05
+  ep 1 step 800: loss=85.4993  lr_lora=7.08e-05
+  ep 1 step 850: loss=81.6606  lr_lora=6.70e-05
+  ep 1 step 900: loss=77.9707  lr_lora=6.31e-05
+  ep 1 step 950: loss=75.1319  lr_lora=5.91e-05
+  ep 1 step 1000: loss=72.3647  lr_lora=5.51e-05
+  ep 1 step 1050: loss=69.8078  lr_lora=5.10e-05
+  ep 1 step 1100: loss=67.5791  lr_lora=4.69e-05
+  ep 1 step 1150: loss=65.4581  lr_lora=4.28e-05
+  ep 1 step 1200: loss=63.5737  lr_lora=3.88e-05
+  ep 1 step 1250: loss=61.9102  lr_lora=3.49e-05
+  ep 1 step 1300: loss=60.3259  lr_lora=3.10e-05
+  ep 1 step 1350: loss=58.7075  lr_lora=2.73e-05
+  ep 1 step 1400: loss=57.2739  lr_lora=2.38e-05
+  ep 1 step 1450: loss=55.9584  lr_lora=2.04e-05
+  ep 1 step 1500: loss=54.7973  lr_lora=1.72e-05
+  ep 1 step 1550: loss=53.5961  lr_lora=1.42e-05
+  ep 1 step 1600: loss=52.5554  lr_lora=1.15e-05
+  ep 1 step 1650: loss=51.6486  lr_lora=9.01e-06
+  ep 1 step 1700: loss=50.7810  lr_lora=6.81e-06
+  ep 1 step 1750: loss=49.9158  lr_lora=4.90e-06
+  ep 1 step 1800: loss=49.0940  lr_lora=3.28e-06
+  ep 1 step 1850: loss=48.2504  lr_lora=1.99e-06
+  ep 1 step 1900: loss=47.4683  lr_lora=1.01e-06
+  ep 1 step 1950: loss=46.7232  lr_lora=3.56e-07
+  ep 1 step 2000: loss=46.0539  lr_lora=3.53e-08
+[lora] epoch 1 done in 19657s  val={'temp': {'mae': 3.666003942489624, 'n': 8066}, 'ph': {'mae': 0.5597922205924988, 'n': 705}, 'salt': {'mae': 1.815351128578186, 'n': 424}, 'oxy': {'f1_macro': 0.9448227977645267, 'n': 2266}}
+[lora] wrote artifacts/lora/fold0_results.json
+Traceback (most recent call last):
+  File "/home/ubuntu/microbe-model/scripts/run_lora_lambda.py", line 16, in <module>
+    print(f"[lambda-lora] done best={results.get(best)}", flush=True)
+                                                 ^^^^
+NameError: name 'best' is not defined

artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log ADDED Viewed

	@@ -0,0 +1,60 @@

+[lambda-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D preset=oxygen
+[lambda-lora] device=NVIDIA A100-SXM4-40GB
+[lambda-lora] target_weights={'temp': 0.0, 'ph': 0.0, 'salt': 0.0, 'oxy': 1.0}
+[lora] device = cuda
+[lora] loaded 40,469 records with sequences + labels
+[lora] fold 0: 32,375 train / 8,094 val
+2026-05-17 10:35:49.175226: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
+WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
+E0000 00:00:1779014149.193921    3697 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
+E0000 00:00:1779014149.200030    3697 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
+W0000 00:00:1779014149.218228    3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
+W0000 00:00:1779014149.218259    3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
+W0000 00:00:1779014149.218266    3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
+W0000 00:00:1779014149.218271    3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
+Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+[lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
+  ep 1 step 50: loss=0.3476  lr_lora=4.95e-05
+  ep 1 step 100: loss=0.2605  lr_lora=9.90e-05
+  ep 1 step 150: loss=0.2177  lr_lora=9.98e-05
+  ep 1 step 200: loss=0.1936  lr_lora=9.93e-05
+  ep 1 step 250: loss=0.1740  lr_lora=9.85e-05
+  ep 1 step 300: loss=0.1574  lr_lora=9.74e-05
+  ep 1 step 350: loss=0.1420  lr_lora=9.59e-05
+  ep 1 step 400: loss=0.1378  lr_lora=9.41e-05
+  ep 1 step 450: loss=0.1294  lr_lora=9.21e-05
+  ep 1 step 500: loss=0.1307  lr_lora=8.97e-05
+  ep 1 step 550: loss=0.1256  lr_lora=8.71e-05
+  ep 1 step 600: loss=0.1195  lr_lora=8.43e-05
+  ep 1 step 650: loss=0.1169  lr_lora=8.12e-05
+  ep 1 step 700: loss=0.1113  lr_lora=7.79e-05
+  ep 1 step 750: loss=0.1076  lr_lora=7.44e-05
+  ep 1 step 800: loss=0.1036  lr_lora=7.08e-05
+  ep 1 step 850: loss=0.1003  lr_lora=6.70e-05
+  ep 1 step 900: loss=0.0982  lr_lora=6.31e-05
+  ep 1 step 950: loss=0.0990  lr_lora=5.91e-05
+  ep 1 step 1000: loss=0.0966  lr_lora=5.51e-05
+  ep 1 step 1050: loss=0.0965  lr_lora=5.10e-05
+  ep 1 step 1100: loss=0.0955  lr_lora=4.69e-05
+  ep 1 step 1150: loss=0.0954  lr_lora=4.28e-05
+  ep 1 step 1200: loss=0.0942  lr_lora=3.88e-05
+  ep 1 step 1250: loss=0.0939  lr_lora=3.49e-05
+  ep 1 step 1300: loss=0.0928  lr_lora=3.10e-05
+  ep 1 step 1350: loss=0.0911  lr_lora=2.73e-05
+  ep 1 step 1400: loss=0.0895  lr_lora=2.38e-05
+  ep 1 step 1450: loss=0.0890  lr_lora=2.04e-05
+  ep 1 step 1500: loss=0.0887  lr_lora=1.72e-05
+  ep 1 step 1550: loss=0.0869  lr_lora=1.42e-05
+  ep 1 step 1600: loss=0.0861  lr_lora=1.15e-05
+  ep 1 step 1650: loss=0.0856  lr_lora=9.01e-06
+  ep 1 step 1700: loss=0.0848  lr_lora=6.81e-06
+  ep 1 step 1750: loss=0.0842  lr_lora=4.90e-06
+  ep 1 step 1800: loss=0.0840  lr_lora=3.28e-06
+  ep 1 step 1850: loss=0.0832  lr_lora=1.99e-06
+  ep 1 step 1900: loss=0.0821  lr_lora=1.01e-06
+  ep 1 step 1950: loss=0.0818  lr_lora=3.56e-07
+  ep 1 step 2000: loss=0.0805  lr_lora=3.53e-08
+[lora] epoch 1 done in 19711s  val={'temp': {'mae': 32.25041580200195, 'n': 8062}, 'ph': {'mae': 7.046206951141357, 'n': 763}, 'salt': {'mae': 2.082266330718994, 'n': 446}, 'oxy': {'f1_macro': 0.916835889123625, 'n': 2214}}
+[lora] wrote artifacts/lora/fold0_results.json
+[lambda-lora] done best={'epoch': 1, 'val': {'temp': {'mae': 32.25041580200195, 'n': 8062}, 'ph': {'mae': 7.046206951141357, 'n': 763}, 'salt': {'mae': 2.082266330718994, 'n': 446}, 'oxy': {'f1_macro': 0.916835889123625, 'n': 2214}}, 'score': 40.46205319473868}

artifacts/lora/lambda_fold0_weighted_anaerobe_20260518T070340Z.log ADDED Viewed

	@@ -0,0 +1,57 @@

+[lambda-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D preset=all
+[lambda-lora] device=NVIDIA A100-SXM4-40GB
+[lambda-lora] target_weights={'temp': 1.0, 'ph': 1.0, 'salt': 1.0, 'oxy': 1.0}
+[lambda-lora] oxy_class_weights=(1.0, 1.5, 1.0, 1.0)
+[lora] device = cuda
+[lora] loaded 40,469 records with sequences + labels
+[lora] fold 0: 32,375 train / 8,094 val
+/home/ubuntu/.cache/uv/archive-v0/hIOP0T7h41YWWs-w/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
+  warnings.warn(
+Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
+You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
+[lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
+/home/ubuntu/.cache/uv/archive-v0/hIOP0T7h41YWWs-w/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
+  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
+  ep 1 step 50: loss=886.9809  lr_lora=4.95e-05
+  ep 1 step 100: loss=554.1884  lr_lora=9.90e-05
+  ep 1 step 150: loss=380.8741  lr_lora=9.98e-05
+  ep 1 step 200: loss=296.4814  lr_lora=9.93e-05
+  ep 1 step 250: loss=244.2839  lr_lora=9.85e-05
+  ep 1 step 300: loss=207.9945  lr_lora=9.74e-05
+  ep 1 step 350: loss=183.1828  lr_lora=9.59e-05
+  ep 1 step 400: loss=164.5015  lr_lora=9.41e-05
+  ep 1 step 450: loss=149.3741  lr_lora=9.21e-05
+  ep 1 step 500: loss=137.3078  lr_lora=8.97e-05
+  ep 1 step 550: loss=127.1643  lr_lora=8.71e-05
+  ep 1 step 600: loss=118.7969  lr_lora=8.43e-05
+  ep 1 step 650: loss=111.4764  lr_lora=8.12e-05
+  ep 1 step 700: loss=105.4436  lr_lora=7.79e-05
+  ep 1 step 750: loss=100.0075  lr_lora=7.44e-05
+  ep 1 step 800: loss=95.2808  lr_lora=7.08e-05
+  ep 1 step 850: loss=91.1569  lr_lora=6.70e-05
+  ep 1 step 900: loss=87.3118  lr_lora=6.31e-05
+  ep 1 step 950: loss=83.6892  lr_lora=5.91e-05
+  ep 1 step 1000: loss=80.7049  lr_lora=5.51e-05
+  ep 1 step 1050: loss=77.8812  lr_lora=5.10e-05
+  ep 1 step 1100: loss=75.2743  lr_lora=4.69e-05
+  ep 1 step 1150: loss=73.0166  lr_lora=4.28e-05
+  ep 1 step 1200: loss=70.9040  lr_lora=3.88e-05
+  ep 1 step 1250: loss=68.9807  lr_lora=3.49e-05
+  ep 1 step 1300: loss=67.0975  lr_lora=3.10e-05
+  ep 1 step 1350: loss=66.9951  lr_lora=2.73e-05
+  ep 1 step 1400: loss=65.1894  lr_lora=2.38e-05
+  ep 1 step 1450: loss=63.6373  lr_lora=2.04e-05
+  ep 1 step 1500: loss=62.0934  lr_lora=1.72e-05
+  ep 1 step 1550: loss=60.6855  lr_lora=1.42e-05
+  ep 1 step 1600: loss=59.3990  lr_lora=1.15e-05
+  ep 1 step 1650: loss=58.3970  lr_lora=9.01e-06
+  ep 1 step 1700: loss=57.2419  lr_lora=6.81e-06
+  ep 1 step 1750: loss=56.1781  lr_lora=4.90e-06
+  ep 1 step 1800: loss=55.1134  lr_lora=3.28e-06
+  ep 1 step 1850: loss=54.1673  lr_lora=1.99e-06
+  ep 1 step 1900: loss=53.2661  lr_lora=1.01e-06
+  ep 1 step 1950: loss=52.4271  lr_lora=3.56e-07
+  ep 1 step 2000: loss=51.6784  lr_lora=3.53e-08
+[lora] epoch 1 done in 19573s  val={'temp': {'mae': 3.659961462020874, 'n': 8066}, 'ph': {'mae': 0.564297616481781, 'n': 705}, 'salt': {'mae': 1.7560514211654663, 'n': 424}, 'oxy': {'f1_macro': 0.9447762839519784, 'n': 2266}}
+[lora] wrote artifacts/lora_weighted_anaerobe/fold0_results.json
+[lambda-lora] done best={'epoch': 1, 'val': {'temp': {'mae': 3.659961462020874, 'n': 8066}, 'ph': {'mae': 0.564297616481781, 'n': 705}, 'salt': {'mae': 1.7560514211654663, 'n': 424}, 'oxy': {'f1_macro': 0.9447762839519784, 'n': 2266}}, 'score': 5.035534215716143}

artifacts/lora_fold0_real.log ADDED Viewed

	@@ -0,0 +1,147 @@

+✓ Initialized. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-yeeh8rm2U9bwVwD1nagjsS
+✓ Created objects.
+├── 🔨 Created mount
+│   /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
+├── 🔨 Created mount PythonPackage:microbe_model
+├── 🔨 Created mount data/marker_sequences.jsonl
+├── 🔨 Created mount data/bacdive_phenotypes.parquet
+├── 🔨 Created mount data/strain_catalog.parquet
+└── 🔨 Created function train.
+[modal-lora] fold=0 epochs=3 model=facebook/esm2_t12_35M_UR50D smoke=False
+[lora] device = cuda
+[lora] loaded 40,469 records with sequences + labels
+[lora] fold 0: 32,375 train / 8,094 val
+Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
+[1ALoading weights:   0%|          | 0/198 [00:00<?, ?it/s]
+[1ALoading weights: 100%|██████████| 198/198 [00:00<00:00, 3043.05it/s]
+[transformers] [1mEsmModel LOAD REPORT[0m from: facebook/esm2_t12_35M_UR50D
+Key                       | Status     |
+--------------------------+------------+-
+lm_head.layer_norm.bias   | UNEXPECTED |
+lm_head.layer_norm.weight | UNEXPECTED |
+lm_head.dense.bias        | UNEXPECTED |
+lm_head.dense.weight      | UNEXPECTED |
+lm_head.bias              | UNEXPECTED |
+pooler.dense.weight       | MISSING    |
+pooler.dense.bias         | MISSING    |
+Notes:
+- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
+[lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
+[transformers] Caching is incompatible with gradient checkpointing in EsmLayer. Setting `use_cache=False`.
+  ep 1 step 50: loss=1000.2195  lr_lora=1.65e-05
+  ep 1 step 100: loss=788.6148  lr_lora=3.30e-05
+  ep 1 step 150: loss=560.6263  lr_lora=4.95e-05
+  ep 1 step 200: loss=432.6623  lr_lora=6.60e-05
+  ep 1 step 250: loss=354.6971  lr_lora=8.25e-05
+  ep 1 step 300: loss=300.4281  lr_lora=9.90e-05
+  ep 1 step 350: loss=262.0346  lr_lora=1.00e-04
+  ep 1 step 400: loss=232.7668  lr_lora=9.99e-05
+  ep 1 step 450: loss=210.0698  lr_lora=9.98e-05
+  ep 1 step 500: loss=192.0890  lr_lora=9.97e-05
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous
+WARNING: Logs may not be continuous

artifacts/lora_oxygen_vs_all_task.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# Oxygen-Only LoRA vs All-Task LoRA
+Both runs use fold 0, ESM-2 t12, LoRA r=8, 1 epoch on Lambda A100 SXM4.
+| Target | All-task LoRA | Oxygen-only LoRA | Baseline | Readout |
+|---|---:|---:|---:|---|
+| `optimal_temperature_c` MAE | 3.6660 | 32.2504 | 2.6743 | ignored by oxygen-only loss |
+| `optimal_ph` MAE | 0.5598 | 7.0462 | 0.4685 | ignored by oxygen-only loss |
+| `salt_tolerance_pct` MAE | 1.8154 | 2.0823 | 1.9171 | not improved by oxygen-only loss |
+| `oxygen_requirement` macro F1 | 0.9448 | 0.9168 | 0.4020 | oxygen-only worse than all-task |
+## Conclusion
+Oxygen-only training did not beat the all-task LoRA run on the oxygen target: `0.9168` macro F1 vs `0.9448` macro F1. The all-task LoRA checkpoint remains the better fold-0 result. The oxygen-only checkpoint is saved locally for inspection but should stay out of git.
+## Artifacts
+- `artifacts/lora/fold0_results_oxygen.json`
+- `artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log`
+- local-only checkpoint: `artifacts/lora/fold0_best_oxygen.pt`

artifacts/lora_smoke.log ADDED Viewed

	@@ -0,0 +1,384 @@

+✓ Initialized. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-9cMuz0t3SvGbABiswTUf1i
+Building image im-0WStzS4903kICkIzhj6o3g
+=> Step 0: FROM base
+=> Step 1: COPY . /
+Saving image...
+Image saved, took 6.04s
+Built image im-0WStzS4903kICkIzhj6o3g in 64.55s
+Building image im-NdLGz3gVSHqOqDW9uEkE6S
+=> Step 0: FROM base
+=> Step 1: COPY . /
+Saving image...
+Image saved, took 1.27s
+Built image im-NdLGz3gVSHqOqDW9uEkE6S in 2.77s
+Building image im-7eAtiwQQ8BXwj3bXS3azON
+=> Step 0: FROM base
+=> Step 1: COPY . /
+Saving image...
+Image saved, took 1.21s
+Built image im-7eAtiwQQ8BXwj3bXS3azON in 2.92s
+✓ Created objects.
+├── 🔨 Created mount
+│   /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
+├── 🔨 Created mount PythonPackage:microbe_model
+├── 🔨 Created mount data/marker_sequences.jsonl
+├── 🔨 Created mount data/bacdive_phenotypes.parquet
+├── 🔨 Created mount data/strain_catalog.parquet
+└── 🔨 Created function train.
+[modal-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D smoke=True
+[smoke] truncated to 200 sequences
+[lora] device = cuda
+[lora] loaded 200 records with sequences + labels
+[lora] fold 0: 28 train / 172 val
+Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
+[1ALoading weights:   0%|          | 0/198 [00:00<?, ?it/s]
+[1ALoading weights: 100%|██████████| 198/198 [00:00<00:00, 5756.73it/s]
+[transformers] [1mEsmModel LOAD REPORT[0m from: facebook/esm2_t12_35M_UR50D
+Key                       | Status     |
+--------------------------+------------+-
+lm_head.dense.weight      | UNEXPECTED |
+lm_head.layer_norm.bias   | UNEXPECTED |
+lm_head.bias              | UNEXPECTED |
+lm_head.layer_norm.weight | UNEXPECTED |
+lm_head.dense.bias        | UNEXPECTED |
+pooler.dense.weight       | MISSING    |
+pooler.dense.bias         | MISSING    |
+Notes:
+- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
+[lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
+Traceback (most recent call last):
+  File "/pkg/modal/_runtime/container_io_manager.py", line 947, in handle_input_exception
+    yield
+  File "/pkg/modal/_container_entrypoint.py", line 172, in run_input_sync
+    values = io_context.call_function_sync()
+             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/pkg/modal/_runtime/container_io_manager.py", line 225, in call_function_sync
+    expected_value_or_values = self.finalized_function.callable(*args, **kwargs)
+                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/modal_train_lora.py", line 88, in train
+    results = train_lora(
+              ^^^^^^^^^^^
+  File "/root/microbe_model/train/lora_trainer.py", line 255, in train_lora
+    preds = model(batch["genomes"], device=device)
+            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/microbe_model/train/lora_model.py", line 126, in forward
+    [self.encode_genome(g, device) for g in genomes],
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/microbe_model/train/lora_model.py", line 126, in <listcomp>
+    [self.encode_genome(g, device) for g in genomes],
+     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/microbe_model/train/lora_model.py", line 111, in encode_genome
+    per_protein = self.encode_proteins(proteins, device)  # (n, D)
+                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/microbe_model/train/lora_model.py", line 99, in encode_proteins
+    outputs = self.esm(**enc)
+              ^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/peft/peft_model.py", line 945, in forward
+    return self.get_base_model()(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/utils/generic.py", line 976, in wrapper
+    output = func(self, *args, **kwargs)
+             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/utils/output_capturing.py", line 248, in wrapper
+    outputs = func(self, *args, **kwargs)
+              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 754, in forward
+    encoder_outputs = self.encoder(
+                      ^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/utils/generic.py", line 900, in wrapper
+    output = func(self, *args, **kwargs)
+             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 556, in forward
+    hidden_states = layer_module(
+                    ^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in __call__
+    return super().__call__(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 527, in forward
+    layer_output = self.feed_forward_chunk(attention_output)
+                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 532, in feed_forward_chunk
+    intermediate_output = self.intermediate(attention_output_ln)
+                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
+    return self._call_impl(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
+    return forward_call(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 462, in forward
+    hidden_states = gelu(hidden_states)
+                    ^^^^^^^^^^^^^^^^^^^
+  File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 86, in gelu
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+           ~~^~~~~
+torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
+Stopping app - uncaught exception raised locally: OutOfMemoryError('CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)').
+╭───────────────────── Traceback (most recent call last) ──────────────────────╮
+│ /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py:124 in main    │
+│                                                                              │
+│   123 │                                                                      │
+│ ❱ 124 │   payload = train.remote(                                            │
+│   125 │   │   fold=fold,                                                     │
+│                                                                              │
+│ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
+│ object.py:46 in wrapped                                                      │
+│                                                                              │
+│    45 │   │   await self.hydrate()                                           │
+│ ❱  46 │   │   return await method(self, *args, **kwargs)                     │
+│    47                                                                        │
+│                                                                              │
+│ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
+│ functions.py:1699 in remote                                                  │
+│                                                                              │
+│   1698 │   │                                                                 │
+│ ❱ 1699 │   │   return await self._call_function(args, kwargs)                │
+│   1700                                                                       │
+│                                                                              │
+│ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
+│ functions.py:1643 in _call_function                                          │
+│                                                                              │
+│   1642 │   │                                                                 │
+│ ❱ 1643 │   │   return await invocation.run_function()                        │
+│   1644                                                                       │
+│                                                                              │
+│ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
+│ functions.py:291 in run_function                                             │
+│                                                                              │
+│    290 │   │   │   item = await self._get_single_output()                    │
+│ ❱  291 │   │   │   return await _process_result(item.result, item.data_forma │
+│    292                                                                       │
+│                                                                              │
+│ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
+│ utils/function_utils.py:527 in _process_result                               │
+│                                                                              │
+│   526 │   │   │                                                              │
+│ ❱ 527 │   │   │   raise exc_with_hints(exc)                                  │
+│   528                                                                        │
+│                                                                              │
+│     ...Remote call to Modal Function (ta-01KRNVDM93QPX3HT2JT4Q0WHCY)...      │
+│                                                                              │
+│ /root/modal_train_lora.py:88 in train                                        │
+│                                                                              │
+│ ❱ 88 results = train_lora(                                                   │
+│                                                                              │
+│                                                                              │
+│ /root/microbe_model/train/lora_trainer.py:255 in train_lora                  │
+│                                                                              │
+│ ❱ 255 preds = model(batch["genomes"], device=device)                         │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /root/microbe_model/train/lora_model.py:126 in forward                       │
+│                                                                              │
+│ ❱ 126 [self.encode_genome(g, device) for g in genomes],                      │
+│                                                                              │
+│                                                                              │
+│ /root/microbe_model/train/lora_model.py:126 in <listcomp>                    │
+│                                                                              │
+│ ❱ 126 [self.encode_genome(g, device) for g in genomes],                      │
+│                                                                              │
+│                                                                              │
+│ /root/microbe_model/train/lora_model.py:111 in encode_genome                 │
+│                                                                              │
+│ ❱ 111 per_protein = self.encode_proteins(proteins, device)  # (n, D)         │
+│                                                                              │
+│                                                                              │
+│ /root/microbe_model/train/lora_model.py:99 in encode_proteins                │
+│                                                                              │
+│ ❱ 99 outputs = self.esm(**enc)                                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/peft/peft_model.py:945 in forward    │
+│                                                                              │
+│ ❱ 945 return self.get_base_model()(*args, **kwargs)                          │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/utils/generic.py:976 in │
+│ wrapper                                                                      │
+│                                                                              │
+│ ❱ 976 output = func(self, *args, **kwargs)                                   │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/utils/output_capturing. │
+│ py:248 in wrapper                                                            │
+│                                                                              │
+│ ❱ 248 outputs = func(self, *args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:754 in forward                                                           │
+│                                                                              │
+│ ❱ 754 encoder_outputs = self.encoder(                                        │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/utils/generic.py:900 in │
+│ wrapper                                                                      │
+│                                                                              │
+│ ❱ 900 output = func(self, *args, **kwargs)                                   │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:556 in forward                                                           │
+│                                                                              │
+│ ❱ 556 hidden_states = layer_module(                                          │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/modeling_layers.py:93   │
+│ in __call__                                                                  │
+│                                                                              │
+│ ❱ 93 return super().__call__(*args, **kwargs)                                │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:527 in forward                                                           │
+│                                                                              │
+│ ❱ 527 layer_output = self.feed_forward_chunk(attention_output)               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:532 in feed_forward_chunk                                                │
+│                                                                              │
+│ ❱ 532 intermediate_output = self.intermediate(attention_output_ln)           │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in   │
+│ _wrapped_call_impl                                                           │
+│                                                                              │
+│ ❱ 1778 return self._call_impl(*args, **kwargs)                               │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in   │
+│ _call_impl                                                                   │
+│                                                                              │
+│ ❱ 1789 return forward_call(*args, **kwargs)                                  │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:462 in forward                                                           │
+│                                                                              │
+│ ❱ 462 hidden_states = gelu(hidden_states)                                    │
+│                                                                              │
+│                                                                              │
+│ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
+│ .py:86 in gelu                                                               │
+│                                                                              │
+│ ❱ 86 return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))                  │
+│                                                                              │
+╰──────────────────────────────────────────────────────────────────────────────╯
+OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a
+total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB
+memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and
+593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated
+memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to
+avoid fragmentation.  See documentation for Memory Management
+(https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-wi
+th-pytorch-cuda-alloc-conf)

artifacts/lora_smoke2.log ADDED Viewed

	@@ -0,0 +1,42 @@

+✓ Initialized. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-14sZzX1W4YsqgIp2y29INw
+✓ Created objects.
+├── 🔨 Created mount
+│   /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
+├── 🔨 Created mount PythonPackage:microbe_model
+├── 🔨 Created mount data/marker_sequences.jsonl
+├── 🔨 Created mount data/bacdive_phenotypes.parquet
+├── 🔨 Created mount data/strain_catalog.parquet
+└── 🔨 Created function train.
+[modal-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D smoke=True
+[smoke] truncated to 200 sequences
+[lora] device = cuda
+[lora] loaded 200 records with sequences + labels
+[lora] fold 0: 28 train / 172 val
+Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
+[1ALoading weights:   0%|          | 0/198 [00:00<?, ?it/s]
+[1ALoading weights: 100%|██████████| 198/198 [00:00<00:00, 5899.50it/s]
+[transformers] [1mEsmModel LOAD REPORT[0m from: facebook/esm2_t12_35M_UR50D
+Key                       | Status     |
+--------------------------+------------+-
+lm_head.bias              | UNEXPECTED |
+lm_head.dense.weight      | UNEXPECTED |
+lm_head.layer_norm.weight | UNEXPECTED |
+lm_head.dense.bias        | UNEXPECTED |
+lm_head.layer_norm.bias   | UNEXPECTED |
+pooler.dense.weight       | MISSING    |
+pooler.dense.bias         | MISSING    |
+Notes:
+- UNEXPECTED:   can be ignored when loading from different task/architecture; not ok if you expect identical arch.
+- MISSING:      those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
+[lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
+[transformers] Caching is incompatible with gradient checkpointing in EsmLayer. Setting `use_cache=False`.
+[lora] epoch 1 done in 61s  val={'temp': {'mae': 28.067121505737305, 'n': 172}, 'ph': {'mae': None, 'n': 0}, 'salt': {'mae': None, 'n': 0}, 'oxy': {'f1_macro': None, 'n': 0}}
+[lora] wrote /artifacts/lora/fold0_results.json
+[modal-lora] wrote artifacts/lora/fold0_results_smoke.json
+[modal-lora] wrote artifacts/lora/fold0_best_smoke.pt  (8.6 MB)
+Stopping app - local entrypoint completed.
+✓ App completed. View run at
+https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-14sZzX1W4YsqgIp2y29INw

artifacts/lora_vs_baseline.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# LoRA Fold 0 vs Tabular Baseline
+Caveat: LoRA is one group fold; baseline is the current five-fold mean.
+| Target | LoRA | Baseline | Delta | Verdict |
+|---|---:|---:|---:|---|
+| `optimal_temperature_c` MAE | 3.6660 | 2.6743 | +0.9917 | worse |
+| `optimal_ph` MAE | 0.5598 | 0.4685 | +0.0913 | worse |
+| `salt_tolerance_pct` MAE | 1.8154 | 1.9171 | -0.1017 | better |
+| `oxygen_requirement` macro F1 | 0.9448 | 0.4020 | +0.5428 | better |
+## Recommendation
+The first LoRA pass is strongest for oxygen classification. For the next GPU run, use `scripts/lambda_train_lora.py --target-preset oxygen` instead of spending more A100 time optimizing regression losses that underperformed the tabular baseline.
+Keep `artifacts/lora/fold0_best.pt` outside git unless it is published to a model store or release asset; the JSON metrics and log are enough for repo history.

artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.json ADDED Viewed

	@@ -0,0 +1,321 @@

+{
+  "checkpoint": "artifacts/lora_weighted_anaerobe/fold0_best.pt",
+  "n": 2266,
+  "classes": [
+    "aerobe",
+    "anaerobe",
+    "facultative_anaerobe",
+    "microaerobe"
+  ],
+  "accuracy": 0.953663,
+  "macro_f1": 0.944776,
+  "macro_f1_all_classes": 0.472388,
+  "confusion_matrix": [
+    [
+      1535,
+      38,
+      0,
+      0
+    ],
+    [
+      67,
+      626,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ],
+    [
+      0,
+      0,
+      0,
+      0
+    ]
+  ],
+  "per_class": {
+    "aerobe": {
+      "precision": 0.958177,
+      "recall": 0.975842,
+      "f1": 0.966929,
+      "support": 1573,
+      "predicted": 1602
+    },
+    "anaerobe": {
+      "precision": 0.942771,
+      "recall": 0.903319,
+      "f1": 0.922623,
+      "support": 693,
+      "predicted": 664
+    },
+    "facultative_anaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    },
+    "microaerobe": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1": 0.0,
+      "support": 0,
+      "predicted": 0
+    }
+  },
+  "wrong_predictions": [
+    {
+      "bacdive_id": 499,
+      "genome_accession": "GCA_000429505",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.997351,
+      "true_probability": 0.002648,
+      "margin": 0.994702
+    },
+    {
+      "bacdive_id": 481,
+      "genome_accession": "GCA_003363485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.996132,
+      "true_probability": 0.003865,
+      "margin": 0.992268
+    },
+    {
+      "bacdive_id": 168525,
+      "genome_accession": "GCA_006386545",
+      "group": "Sphaerotilaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.995269,
+      "true_probability": 0.004727,
+      "margin": 0.990543
+    },
+    {
+      "bacdive_id": 6314,
+      "genome_accession": "GCA_900475835",
+      "group": "Jonesiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.995095,
+      "true_probability": 0.004901,
+      "margin": 0.990193
+    },
+    {
+      "bacdive_id": 498,
+      "genome_accession": "GCA_000429485",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.994422,
+      "true_probability": 0.005572,
+      "margin": 0.988851
+    },
+    {
+      "bacdive_id": 17841,
+      "genome_accession": "GCA_000975055",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.993803,
+      "true_probability": 0.006191,
+      "margin": 0.987612
+    },
+    {
+      "bacdive_id": 483,
+      "genome_accession": "GCA_000421165",
+      "group": "Alteromonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.990431,
+      "true_probability": 0.009555,
+      "margin": 0.980876
+    },
+    {
+      "bacdive_id": 23140,
+      "genome_accession": "GCA_016925555",
+      "group": "Mycoplasmataceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.981044,
+      "true_probability": 0.018747,
+      "margin": 0.962297
+    },
+    {
+      "bacdive_id": 17840,
+      "genome_accession": "GCA_000975035",
+      "group": "Demequinaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.979909,
+      "true_probability": 0.020063,
+      "margin": 0.959846
+    },
+    {
+      "bacdive_id": 148058,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.977617,
+      "true_probability": 0.022345,
+      "margin": 0.955271
+    },
+    {
+      "bacdive_id": 149706,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.977617,
+      "true_probability": 0.022345,
+      "margin": 0.955271
+    },
+    {
+      "bacdive_id": 156346,
+      "genome_accession": "GCF_000005845.2",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.977617,
+      "true_probability": 0.022345,
+      "margin": 0.955271
+    },
+    {
+      "bacdive_id": 140022,
+      "genome_accession": "GCF_014068355.1",
+      "group": "Mycoplasmataceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.976127,
+      "true_probability": 0.02361,
+      "margin": 0.952517
+    },
+    {
+      "bacdive_id": 164735,
+      "genome_accession": "GCF_943590815.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.969284,
+      "true_probability": 0.030661,
+      "margin": 0.938623
+    },
+    {
+      "bacdive_id": 133974,
+      "genome_accession": "GCA_001544515",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.965083,
+      "true_probability": 0.034798,
+      "margin": 0.930284
+    },
+    {
+      "bacdive_id": 133484,
+      "genome_accession": "GCA_900102145",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.955376,
+      "true_probability": 0.044292,
+      "margin": 0.911083
+    },
+    {
+      "bacdive_id": 154004,
+      "genome_accession": "GCF_982443925.1",
+      "group": "Enterobacteriaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.953645,
+      "true_probability": 0.046276,
+      "margin": 0.907368
+    },
+    {
+      "bacdive_id": 168303,
+      "genome_accession": "GCA_004307015",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.949521,
+      "true_probability": 0.050177,
+      "margin": 0.899345
+    },
+    {
+      "bacdive_id": 133976,
+      "genome_accession": "GCA_001544495",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.94727,
+      "true_probability": 0.052558,
+      "margin": 0.894712
+    },
+    {
+      "bacdive_id": 8608,
+      "genome_accession": "GCF_900476065.1",
+      "group": "Metamycoplasmataceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.940865,
+      "true_probability": 0.058377,
+      "margin": 0.882487
+    },
+    {
+      "bacdive_id": 132346,
+      "genome_accession": "GCA_002259755",
+      "group": "Bifidobacteriaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.937469,
+      "true_probability": 0.062389,
+      "margin": 0.87508
+    },
+    {
+      "bacdive_id": 133975,
+      "genome_accession": "GCA_001544475",
+      "group": "Burkholderiaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.933766,
+      "true_probability": 0.06607,
+      "margin": 0.867696
+    },
+    {
+      "bacdive_id": 132158,
+      "genome_accession": "GCA_000807275",
+      "group": "Orbaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.933592,
+      "true_probability": 0.06628,
+      "margin": 0.867312
+    },
+    {
+      "bacdive_id": 140694,
+      "genome_accession": "GCA_003336745",
+      "group": "Thermaceae",
+      "true": "aerobe",
+      "pred": "anaerobe",
+      "confidence": 0.9297,
+      "true_probability": 0.069824,
+      "margin": 0.859876
+    },
+    {
+      "bacdive_id": 133991,
+      "genome_accession": "GCA_039544205",
+      "group": "Ferrimonadaceae",
+      "true": "anaerobe",
+      "pred": "aerobe",
+      "confidence": 0.925231,
+      "true_probability": 0.074638,
+      "margin": 0.850593
+    }
+  ]
+}

artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.md ADDED Viewed

	@@ -0,0 +1,56 @@

+# LoRA Oxygen Diagnostics
+Checkpoint: `artifacts/lora_weighted_anaerobe/fold0_best.pt`
+- Labeled validation rows: `2266`
+- Accuracy: `0.9537`
+- Macro F1 (supported classes): `0.9448`
+- Macro F1 (all configured classes): `0.4724`
+## Per-Class Metrics
+| Class | Precision | Recall | F1 | Support | Predicted |
+|---|---:|---:|---:|---:|---:|
+| aerobe | 0.9582 | 0.9758 | 0.9669 | 1573 | 1602 |
+| anaerobe | 0.9428 | 0.9033 | 0.9226 | 693 | 664 |
+| facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+| microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
+## Confusion Matrix
+| True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
+|---|---:|---:|---:|---:|
+| aerobe | 1535 | 38 | 0 | 0 |
+| anaerobe | 67 | 626 | 0 | 0 |
+| facultative_anaerobe | 0 | 0 | 0 | 0 |
+| microaerobe | 0 | 0 | 0 | 0 |
+## High-Confidence Wrong Predictions
+| BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
+|---:|---|---|---|---|---:|---:|---:|
+| 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9974 | 0.0026 | 0.9947 |
+| 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9961 | 0.0039 | 0.9923 |
+| 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9953 | 0.0047 | 0.9905 |
+| 6314 | GCA_900475835 | Jonesiaceae | anaerobe | aerobe | 0.9951 | 0.0049 | 0.9902 |
+| 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9944 | 0.0056 | 0.9889 |
+| 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9938 | 0.0062 | 0.9876 |
+| 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9904 | 0.0096 | 0.9809 |
+| 23140 | GCA_016925555 | Mycoplasmataceae | aerobe | anaerobe | 0.9810 | 0.0187 | 0.9623 |
+| 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9799 | 0.0201 | 0.9598 |
+| 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
+| 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
+| 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
+| 140022 | GCF_014068355.1 | Mycoplasmataceae | aerobe | anaerobe | 0.9761 | 0.0236 | 0.9525 |
+| 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9693 | 0.0307 | 0.9386 |
+| 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9651 | 0.0348 | 0.9303 |
+| 133484 | GCA_900102145 | Thermaceae | aerobe | anaerobe | 0.9554 | 0.0443 | 0.9111 |
+| 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9536 | 0.0463 | 0.9074 |
+| 168303 | GCA_004307015 | Thermaceae | aerobe | anaerobe | 0.9495 | 0.0502 | 0.8993 |
+| 133976 | GCA_001544495 | Burkholderiaceae | anaerobe | aerobe | 0.9473 | 0.0526 | 0.8947 |
+| 8608 | GCF_900476065.1 | Metamycoplasmataceae | aerobe | anaerobe | 0.9409 | 0.0584 | 0.8825 |
+| 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9375 | 0.0624 | 0.8751 |
+| 133975 | GCA_001544475 | Burkholderiaceae | anaerobe | aerobe | 0.9338 | 0.0661 | 0.8677 |
+| 132158 | GCA_000807275 | Orbaceae | anaerobe | aerobe | 0.9336 | 0.0663 | 0.8673 |
+| 140694 | GCA_003336745 | Thermaceae | aerobe | anaerobe | 0.9297 | 0.0698 | 0.8599 |
+| 133991 | GCA_039544205 | Ferrimonadaceae | anaerobe | aerobe | 0.9252 | 0.0746 | 0.8506 |

artifacts/lora_weighted_anaerobe/fold0_results.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "model_cfg": {
+    "esm_model_name": "facebook/esm2_t12_35M_UR50D",
+    "lora_r": 8,
+    "lora_alpha": 16,
+    "lora_dropout": 0.05,
+    "lora_target": [
+      "query",
+      "value"
+    ],
+    "head_hidden_dim": 128,
+    "head_dropout": 0.1,
+    "max_seq_len": 512,
+    "max_proteins_per_cat": 6,
+    "gradient_checkpointing": true
+  },
+  "train_cfg": {
+    "fold": 0,
+    "epochs": 1,
+    "batch_size": 2,
+    "grad_accum": 8,
+    "lora_lr": 0.0001,
+    "head_lr": 0.001,
+    "weight_decay": 0.01,
+    "warmup_frac": 0.05,
+    "bf16": true,
+    "max_proteins_per_category": 16,
+    "save_dir": "artifacts/lora_weighted_anaerobe",
+    "grad_clip": 1.0,
+    "temp_weight": 1.0,
+    "ph_weight": 1.0,
+    "salt_weight": 1.0,
+    "oxy_weight": 1.0,
+    "oxy_class_weights": [
+      1.0,
+      1.5,
+      1.0,
+      1.0
+    ]
+  },
+  "history": [
+    {
+      "epoch": 1,
+      "train_loss": 51.366838841350884,
+      "val": {
+        "temp": {
+          "mae": 3.659961462020874,
+          "n": 8066
+        },
+        "ph": {
+          "mae": 0.564297616481781,
+          "n": 705
+        },
+        "salt": {
+          "mae": 1.7560514211654663,
+          "n": 424
+        },
+        "oxy": {
+          "f1_macro": 0.9447762839519784,
+          "n": 2266
+        }
+      },
+      "elapsed_s": 19573.33940076828
+    }
+  ],
+  "best": {
+    "epoch": 1,
+    "val": {
+      "temp": {
+        "mae": 3.659961462020874,
+        "n": 8066
+      },
+      "ph": {
+        "mae": 0.564297616481781,
+        "n": 705
+      },
+      "salt": {
+        "mae": 1.7560514211654663,
+        "n": 424
+      },
+      "oxy": {
+        "f1_macro": 0.9447762839519784,
+        "n": 2266
+      }
+    },
+    "score": 5.035534215716143
+  }
+}

artifacts/lora_weighted_anaerobe/fold0_weighted_anaerobe_comparison.md ADDED Viewed

	@@ -0,0 +1,45 @@

+# Fold 0 Anaerobe-Weighted LoRA Comparison
+This run tested whether increasing the oxygen cross-entropy weight for
+`anaerobe` improves the fold 0 oxygen result.
+Experiment:
+- Base setup: all-task LoRA, 1 epoch, fold 0
+- Oxygen class weights: `aerobe=1.0`, `anaerobe=1.5`,
+  `facultative_anaerobe=1.0`, `microaerobe=1.0`
+- Output directory: `artifacts/lora_weighted_anaerobe`
+## Training Validation Metric
+| Checkpoint | Oxygen macro F1 | Oxygen n | Notes |
+|---|---:|---:|---|
+| `artifacts/lora/fold0_best.pt` | 0.944823 | 2266 | Original all-task LoRA |
+| `artifacts/lora_weighted_anaerobe/fold0_best.pt` | 0.944776 | 2266 | Anaerobe-weighted all-task LoRA |
+## Detailed Diagnostic
+The weighted checkpoint diagnostic reports:
+| Class | Precision | Recall | F1 | Support |
+|---|---:|---:|---:|---:|
+| aerobe | 0.958177 | 0.975842 | 0.966929 | 1573 |
+| anaerobe | 0.942771 | 0.903319 | 0.922623 | 693 |
+Confusion matrix:
+| True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
+|---|---:|---:|---:|---:|
+| aerobe | 1535 | 38 | 0 | 0 |
+| anaerobe | 67 | 626 | 0 | 0 |
+| facultative_anaerobe | 0 | 0 | 0 | 0 |
+| microaerobe | 0 | 0 | 0 | 0 |
+## Conclusion
+The `anaerobe=1.5` class-weight experiment does not beat the original all-task
+LoRA on fold 0. It slightly improves anaerobe recall in the detailed diagnostic,
+but the overall oxygen macro F1 is fractionally lower than the original all-task
+checkpoint.
+Keep `artifacts/lora/fold0_best.pt` as the best fold 0 checkpoint for now.

artifacts/materialize_pme.log ADDED Viewed

	@@ -0,0 +1,13 @@

+  read 5,000 lines, 3,186 unique genomes
+  read 10,000 lines, 6,012 unique genomes
+  read 15,000 lines, 8,193 unique genomes
+  read 20,000 lines, 9,381 unique genomes
+  read 25,000 lines, 11,269 unique genomes
+  read 30,000 lines, 13,720 unique genomes
+  read 35,000 lines, 15,993 unique genomes
+  read 40,000 lines, 18,247 unique genomes
+  read 45,000 lines, 19,703 unique genomes
+  read 50,000 lines, 21,570 unique genomes
+  read 55,000 lines, 22,300 unique genomes
+Parsed 55,046 lines → 22,300 unique genomes (83.2s)
+Wrote 22,300 rows × 5131 cols → /Users/miyuhoriuchi/microbe-model/data/per_marker_embeddings.parquet (615.9 MB, 170.5s)

artifacts/media_recommender_drylab_benchmark.json ADDED Viewed

	@@ -0,0 +1,343 @@

+{
+  "split_mode": "family",
+  "n_splits": 5,
+  "n_features": 1113,
+  "n_media": 40,
+  "n_estimators": 100,
+  "elapsed_s": 906.581169128418,
+  "model": {
+    "n_eval": 21050.0,
+    "mean_true_media": 1.3330166270783848,
+    "mrr": 0.5880700888181283,
+    "hit_at_1": 0.4498812351543943,
+    "recall_at_1": 0.3553784639746635,
+    "precision_at_1": 0.4498812351543943,
+    "hit_at_3": 0.6598574821852732,
+    "recall_at_3": 0.5818313539192399,
+    "precision_at_3": 0.25814726840855107,
+    "hit_at_5": 0.7752969121140143,
+    "recall_at_5": 0.7217062549485351,
+    "precision_at_5": 0.19352969121140143,
+    "hit_at_10": 0.8652256532066508,
+    "recall_at_10": 0.8449089469517023,
+    "precision_at_10": 0.11457957244655581,
+    "median_roc_auc": 0.9099485141368615,
+    "median_pr_auc": 0.18301671580977655
+  },
+  "global_popularity": {
+    "n_eval": 21050.0,
+    "mean_true_media": 1.3330166270783848,
+    "mrr": 0.24343189838094925,
+    "hit_at_1": 0.08019002375296912,
+    "recall_at_1": 0.05362549485352335,
+    "precision_at_1": 0.08019002375296912,
+    "hit_at_3": 0.25002375296912116,
+    "recall_at_3": 0.19952810768012663,
+    "precision_at_3": 0.08673000791765638,
+    "hit_at_5": 0.3660807600950119,
+    "recall_at_5": 0.3077403008709422,
+    "precision_at_5": 0.0798479809976247,
+    "hit_at_10": 0.7987648456057007,
+    "recall_at_10": 0.7146753760886778,
+    "precision_at_10": 0.09205225653206651
+  },
+  "taxonomy_popularity": {
+    "n_eval": 21050.0,
+    "mean_true_media": 1.3330166270783848,
+    "mrr": 0.2501139457470768,
+    "hit_at_1": 0.08646080760095012,
+    "recall_at_1": 0.05863737133808392,
+    "precision_at_1": 0.08646080760095012,
+    "hit_at_3": 0.25914489311163896,
+    "recall_at_3": 0.20742201108471894,
+    "precision_at_3": 0.09002375296912114,
+    "hit_at_5": 0.37244655581947744,
+    "recall_at_5": 0.31233254156769596,
+    "precision_at_5": 0.08098812351543942,
+    "hit_at_10": 0.8009976247030879,
+    "recall_at_10": 0.7152692003167063,
+    "precision_at_10": 0.09202375296912114
+  },
+  "per_medium_auc": [
+    {
+      "medium_id": "1",
+      "n_pos": 2102,
+      "n_neg": 23626,
+      "pr_auc": 0.27484617227464014,
+      "roc_auc": 0.8452226670886136
+    },
+    {
+      "medium_id": "104",
+      "n_pos": 472,
+      "n_neg": 25256,
+      "pr_auc": 0.18128419818337677,
+      "roc_auc": 0.9484343458577389
+    },
+    {
+      "medium_id": "104b",
+      "n_pos": 105,
+      "n_neg": 25623,
+      "pr_auc": 0.13838455803689761,
+      "roc_auc": 0.9674838268445575
+    },
+    {
+      "medium_id": "104c",
+      "n_pos": 104,
+      "n_neg": 25624,
+      "pr_auc": 0.03838926719297028,
+      "roc_auc": 0.9064871574725618
+    },
+    {
+      "medium_id": "1076b",
+      "n_pos": 116,
+      "n_neg": 25612,
+      "pr_auc": 0.7772565185763604,
+      "roc_auc": 0.9216801997447318
+    },
+    {
+      "medium_id": "11",
+      "n_pos": 495,
+      "n_neg": 25233,
+      "pr_auc": 0.42285303544912106,
+      "roc_auc": 0.9656590075446334
+    },
+    {
+      "medium_id": "110",
+      "n_pos": 257,
+      "n_neg": 25471,
+      "pr_auc": 0.11044863098458187,
+      "roc_auc": 0.953733680800031
+    },
+    {
+      "medium_id": "1203a",
+      "n_pos": 186,
+      "n_neg": 25542,
+      "pr_auc": 0.21768704520819743,
+      "roc_auc": 0.9528128033691925
+    },
+    {
+      "medium_id": "1611",
+      "n_pos": 102,
+      "n_neg": 25626,
+      "pr_auc": 0.09711080696747246,
+      "roc_auc": 0.9282099751630924
+    },
+    {
+      "medium_id": "1a",
+      "n_pos": 179,
+      "n_neg": 25549,
+      "pr_auc": 0.050322245145825104,
+      "roc_auc": 0.9074889723351185
+    },
+    {
+      "medium_id": "215",
+      "n_pos": 358,
+      "n_neg": 25370,
+      "pr_auc": 0.02965177531773624,
+      "roc_auc": 0.6736935808140085
+    },
+    {
+      "medium_id": "220",
+      "n_pos": 1274,
+      "n_neg": 24454,
+      "pr_auc": 0.18474923343617633,
+      "roc_auc": 0.7785027031177238
+    },
+    {
+      "medium_id": "252",
+      "n_pos": 592,
+      "n_neg": 25136,
+      "pr_auc": 0.1250091495334598,
+      "roc_auc": 0.9134151768433774
+    },
+    {
+      "medium_id": "31",
+      "n_pos": 109,
+      "n_neg": 25619,
+      "pr_auc": 0.014316010090645013,
+      "roc_auc": 0.1197419776248348
+    },
+    {
+      "medium_id": "339",
+      "n_pos": 162,
+      "n_neg": 25566,
+      "pr_auc": 0.2531128917725162,
+      "roc_auc": 0.9461261242989581
+    },
+    {
+      "medium_id": "354",
+      "n_pos": 117,
+      "n_neg": 25611,
+      "pr_auc": 0.004547574626865672,
+      "roc_auc": 0.09816094646831441
+    },
+    {
+      "medium_id": "381",
+      "n_pos": 521,
+      "n_neg": 25207,
+      "pr_auc": 0.1945320503586121,
+      "roc_auc": 0.8782904803505287
+    },
+    {
+      "medium_id": "428",
+      "n_pos": 103,
+      "n_neg": 25625,
+      "pr_auc": 0.09496790309585758,
+      "roc_auc": 0.8493817665166943
+    },
+    {
+      "medium_id": "457",
+      "n_pos": 150,
+      "n_neg": 25578,
+      "pr_auc": 0.0412603532866421,
+      "roc_auc": 0.8629237365444262
+    },
+    {
+      "medium_id": "514",
+      "n_pos": 1890,
+      "n_neg": 23838,
+      "pr_auc": 0.6372180504429193,
+      "roc_auc": 0.9468378153062271
+    },
+    {
+      "medium_id": "53",
+      "n_pos": 149,
+      "n_neg": 25579,
+      "pr_auc": 0.2046388525464773,
+      "roc_auc": 0.9388116195358452
+    },
+    {
+      "medium_id": "535",
+      "n_pos": 1573,
+      "n_neg": 24155,
+      "pr_auc": 0.22497502735528332,
+      "roc_auc": 0.7663368847332266
+    },
+    {
+      "medium_id": "545",
+      "n_pos": 219,
+      "n_neg": 25509,
+      "pr_auc": 0.02413857437574695,
+      "roc_auc": 0.7897148307043927
+    },
+    {
+      "medium_id": "553",
+      "n_pos": 374,
+      "n_neg": 25354,
+      "pr_auc": 0.15104363590267195,
+      "roc_auc": 0.9124080559386046
+    },
+    {
+      "medium_id": "554",
+      "n_pos": 365,
+      "n_neg": 25363,
+      "pr_auc": 0.11893469404694698,
+      "roc_auc": 0.9315081455620554
+    },
+    {
+      "medium_id": "58",
+      "n_pos": 216,
+      "n_neg": 25512,
+      "pr_auc": 0.05833314429889305,
+      "roc_auc": 0.927297466406513
+    },
+    {
+      "medium_id": "585",
+      "n_pos": 112,
+      "n_neg": 25616,
+      "pr_auc": 0.278482148554742,
+      "roc_auc": 0.8558551226354063
+    },
+    {
+      "medium_id": "645",
+      "n_pos": 276,
+      "n_neg": 25452,
+      "pr_auc": 0.036581621082776454,
+      "roc_auc": 0.7953502842520276
+    },
+    {
+      "medium_id": "65",
+      "n_pos": 3434,
+      "n_neg": 22294,
+      "pr_auc": 0.9006928683634222,
+      "roc_auc": 0.9755952498821933
+    },
+    {
+      "medium_id": "693",
+      "n_pos": 2503,
+      "n_neg": 23225,
+      "pr_auc": 0.4188285561288795,
+      "roc_auc": 0.8783924651021573
+    },
+    {
+      "medium_id": "78",
+      "n_pos": 257,
+      "n_neg": 25471,
+      "pr_auc": 0.2721170882429022,
+      "roc_auc": 0.9676798073707691
+    },
+    {
+      "medium_id": "81",
+      "n_pos": 115,
+      "n_neg": 25613,
+      "pr_auc": 0.27618279435467347,
+      "roc_auc": 0.9262594232887851
+    },
+    {
+      "medium_id": "83",
+      "n_pos": 150,
+      "n_neg": 25578,
+      "pr_auc": 0.020746118950763597,
+      "roc_auc": 0.8187215575885527
+    },
+    {
+      "medium_id": "830",
+      "n_pos": 1782,
+      "n_neg": 23946,
+      "pr_auc": 0.4005468147385281,
+      "roc_auc": 0.8818929642762434
+    },
+    {
+      "medium_id": "84",
+      "n_pos": 1365,
+      "n_neg": 24363,
+      "pr_auc": 0.5583327533020991,
+      "roc_auc": 0.9681294174090628
+    },
+    {
+      "medium_id": "9",
+      "n_pos": 2465,
+      "n_neg": 23263,
+      "pr_auc": 0.8004815953628244,
+      "roc_auc": 0.9217055368025154
+    },
+    {
+      "medium_id": "92",
+      "n_pos": 2899,
+      "n_neg": 22829,
+      "pr_auc": 0.4467899949837726,
+      "roc_auc": 0.8689824346226291
+    },
+    {
+      "medium_id": "98",
+      "n_pos": 151,
+      "n_neg": 25577,
+      "pr_auc": 0.3434343543833669,
+      "roc_auc": 0.9693860403865538
+    },
+    {
+      "medium_id": "987",
+      "n_pos": 155,
+      "n_neg": 25573,
+      "pr_auc": 0.03815021128397991,
+      "roc_auc": 0.8637134931877497
+    },
+    {
+      "medium_id": "pdf",
+      "n_pos": 106,
+      "n_neg": 25622,
+      "pr_auc": 0.011293218617366114,
+      "roc_auc": 0.6703395003998627
+    }
+  ]
+}

artifacts/media_recommender_drylab_benchmark.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Media Recommender Dry-Lab Benchmark
+This benchmark hides known BacDive/MediaDive strain-medium links and asks
+whether the genome-only recommender recovers at least one known medium in
+the top-k ranked recommendations.
+## Setup
+- Split mode: `family`
+- Folds: 5
+- Evaluation strains: 21050
+- Media labels: 40
+- Feature columns: 1113
+- XGBoost trees per medium per fold: 100
+## Ranking Metrics
+| Method | MRR | Hit@1 | Hit@3 | Hit@5 | Recall@5 | Precision@5 |
+|---|---:|---:|---:|---:|---:|---:|
+| XGBoost medium recommender | 0.588 | 0.450 | 0.660 | 0.775 | 0.722 | 0.194 |
+| Global popularity baseline | 0.243 | 0.080 | 0.250 | 0.366 | 0.308 | 0.080 |
+| Taxonomic popularity baseline | 0.250 | 0.086 | 0.259 | 0.372 | 0.312 | 0.081 |
+## Per-Medium AUC
+- Valid media with both classes: 40
+- Median ROC-AUC: 0.910
+- Median PR-AUC: 0.183
+Top media by PR-AUC:
+| Medium | Positives | PR-AUC | ROC-AUC |
+|---|---:|---:|---:|
+| 65 GYM STREPTOMYCES MEDIUM | 3434 | 0.901 | 0.976 |
+| 9 VY/2 AGAR | 2465 | 0.800 | 0.922 |
+| 1076b SP4-Z MEDIUM | 116 | 0.777 | 0.922 |
+| 514 BACTO MARINE BROTH (DIFCO 2216) | 1890 | 0.637 | 0.947 |
+| 84 ROLLED OATS MINERAL MEDIUM | 1365 | 0.558 | 0.968 |
+| 92 TRYPTICASE SOY YEAST EXTRACT MEDIUM | 2899 | 0.447 | 0.869 |
+| 11 MRS MEDIUM | 495 | 0.423 | 0.966 |
+| 693 COLUMBIA BLOOD MEDIUM | 2503 | 0.419 | 0.878 |
+| 830 R2A MEDIUM | 1782 | 0.401 | 0.882 |
+| 98 RHIZOBIUM MEDIUM | 151 | 0.343 | 0.969 |
+## Interpretation
+Hit@k is the easiest practical readout: it measures whether at least one
+known growth medium appears in the top-k suggestions. PR-AUC is expected
+to be much lower than ROC-AUC because medium labels are sparse and heavily
+imbalanced; a high ROC-AUC with modest PR-AUC means the model is useful for
+ranking candidates, not for guaranteeing growth.

artifacts/retrain_with_pme.log ADDED Viewed

	@@ -0,0 +1,19 @@

+Encoded 65 isolation-category features (104558 non-zero entries)
+Joined MediaDive features (5 cols) — 25,728/46,029 training rows have MediaDive data
+Joined HMM features (144 cols) — 46,029/46,029 training rows have HMM data
+Joined KEGG module completeness (570 cols) — 43,457/46,029 training rows have KEGG data
+Joined per-marker ESM-2 embeddings (5129 cols) — 46,026/46,029 training rows have PME data
+Joined isolation metadata (46 cols)
+Training table: 46,029 strains × 6312 features
+Distinct groups: 634
+Group sizes (top 10): {'Enterobacteriaceae': 2667, 'Streptomycetaceae': 2212, 'Bacillaceae': 1887, 'Lactobacillaceae': 1749, 'Pseudomonadaceae': 1623, 'Myxococcaceae': 1546, 'Streptococcaceae': 1174, 'Staphylococcaceae': 1071, 'Flavobacteriaceae': 984, 'Corynebacteriaceae': 903}
+Wrote training table to /Users/miyuhoriuchi/microbe-model/data/training_table.parquet
+Wrote per-strain predictions to /Users/miyuhoriuchi/microbe-model/artifacts/predictions.parquet
+Results summary (16647.1s):
+  optimal_temperature_c     mae        = 2.6743  (n_folds=5)
+  optimal_ph                mae        = 0.4685  (n_folds=5)
+  oxygen_requirement        f1_macro   = 0.4020  (n_folds=5)
+  salt_tolerance_pct        mae        = 1.9171  (n_folds=5)

cerebrium/embed/cerebrium.toml ADDED Viewed

	@@ -0,0 +1,32 @@

+[cerebrium.deployment]
+name = "embed"
+python_version = "3.11"
+docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
+disable_auth = false
+include = ['./*', 'main.py', 'cerebrium.toml', 'markers.hmm']
+exclude = ['.*', '__pycache__', '*.pyc']
+[cerebrium.hardware]
+cpu = 4.0
+memory = 16.0
+compute = "ADA_L4"
+gpu_count = 1
+region = "us-east-1"
+[cerebrium.scaling]
+min_replicas = 0
+max_replicas = 3
+cooldown = 120
+replica_concurrency = 1
+scaling_metric = "concurrency_utilization"
+scaling_target = 80
+response_grace_period = 600
+[cerebrium.dependencies.pip]
+torch = ">=2.2"
+transformers = ">=4.40"
+accelerate = ">=0.30"
+pyhmmer = ">=0.12"
+pyrodigal = ">=3.5"
+requests = ">=2.32"
+numpy = ">=1.26"

cerebrium/embed/main.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""Per-marker ESM-2 t30 embedding service — runs on a Cerebrium L4 GPU container.
+embed_genome(bacdive_id, accession) → {"ok": bool, "row": {pme_<cat>_<dim>: float, ...}}
+or {"ok": False, "reason": ...}.
+The unified-marker HMM library is baked into the image. Each replica loads
+ESM-2 + HMMs once at startup, then serves multiple genomes from the warm
+container.
+"""
+from __future__ import annotations
+import io
+import os
+import time
+import zipfile
+from typing import Any
+import numpy as np
+import pyhmmer
+import pyhmmer.easel
+import pyhmmer.plan7
+import pyrodigal
+import requests
+import torch
+from transformers import AutoModel, AutoTokenizer
+DATASETS_URL = "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/{acc}/download"
+VERSION_FALLBACKS = (".1", ".2", ".3", ".4")
+EMPTY_ZIP_BYTES = 2_000
+EVALUE_THRESHOLD = 1e-5
+MARKER_TO_CATEGORY: dict[str, str] = {
+    "Hsp70_DnaK": "temperature", "Hsp90": "temperature", "Cpn60_GroEL": "temperature",
+    "Hsp20": "temperature", "CSD_cold_shock": "temperature", "TGS_thermosome": "temperature",
+    "ATP_synth_alphabeta": "ph", "ATP_synth_alphabeta_C": "ph", "ATP_synth_F0_B": "ph",
+    "NhaA_Na_H_exch": "ph", "NhaB_Na_H_exch": "ph", "Pyridoxal_decarbox": "ph",
+    "MotA_TolQ_ExbB": "ph", "V_ATPase_subH_N": "ph",
+    "COX1_aerobic": "oxygen", "COX2_TM_aerobic": "oxygen", "COX2_periplasm_aero": "oxygen",
+    "Cyt_CBB3_microaero": "oxygen", "Rieske_2Fe2S": "oxygen", "Catalase": "oxygen",
+    "SOD_FeMn": "oxygen", "SOD_CuZn": "oxygen", "FeFe_hyd_anaerobic": "oxygen",
+    "NiFe_hyd_anaerobic": "oxygen", "FAD_binding_FrdA": "oxygen", "Fer4_FeS_4Fe4S": "oxygen",
+    "KdpD_osmosensor": "salt", "TrkH_K_channel": "salt", "BCCT_compatible": "salt",
+    "BPD_transp_1": "salt", "EctC_ectoine_synth": "salt", "Bact_rhodopsin": "salt",
+    "TP_methylase_B12": "vitamin", "Peripla_BP_2": "vitamin", "THF_DHG_CYH_folate": "vitamin",
+    "FolB_folate": "vitamin", "PdxJ_pyridoxine": "vitamin", "DHBP_riboflavin": "vitamin",
+    "NifH_nitrogenase": "nitrogen", "NifDK_nitrogenase": "nitrogen",
+    "NIR_SIR_ferredoxin": "nitrogen",
+    "RuBisCO_large_form1": "carbon", "RuBisCO_small_form1": "carbon",
+    "Alpha_amylase": "carbon", "Cellulase_GH5": "carbon", "CBM_cellulose": "carbon",
+    "Molybdopterin_OR": "special", "UvrD_helicase_C": "special",
+}
+CATEGORIES = ["temperature", "ph", "oxygen", "salt", "vitamin", "nitrogen", "carbon", "special"]
+_model_name = os.environ.get("ESM2_MODEL", "facebook/esm2_t30_150M_UR50D")
+_batch_size = int(os.environ.get("ESM2_BATCH_SIZE", "16"))
+_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+_dtype = torch.float16 if _device.type == "cuda" else torch.float32
+print(f"[boot] loading {_model_name} on {_device} ({_dtype})", flush=True)
+_tokenizer = AutoTokenizer.from_pretrained(_model_name)
+_model = AutoModel.from_pretrained(_model_name, dtype=_dtype)
+_model.to(_device)
+_model.train(False)
+_embed_dim = _model.config.hidden_size
+_alphabet = pyhmmer.easel.Alphabet.amino()
+with pyhmmer.plan7.HMMFile("/cortex/app/markers.hmm") as _fh:
+    _hmms = list(_fh)
+_ncbi_key = os.environ.get("NCBI_API_KEY")
+print(f"[boot] loaded {len(_hmms)} marker HMMs, embed_dim={_embed_dim}, "
+      f"ncbi_key={'yes' if _ncbi_key else 'no'}", flush=True)
+def _has_version(acc: str) -> bool:
+    return "." in acc and acc.rsplit(".", 1)[-1].isdigit()
+def _candidates(acc: str) -> list[str]:
+    return [acc] if _has_version(acc) else [acc + v for v in VERSION_FALLBACKS]
+def _fetch_fasta(acc: str) -> list[tuple[str, str]] | None:
+    rate = 0.1 if _ncbi_key else 0.34
+    headers = {"Accept": "application/zip"}
+    if _ncbi_key:
+        headers["api-key"] = _ncbi_key
+    params = {"include_annotation_type": "GENOME_FASTA"}
+    for cand in _candidates(acc):
+        zip_bytes: bytes | None = None
+        for attempt in range(3):
+            try:
+                time.sleep(rate)
+                resp = requests.get(
+                    DATASETS_URL.format(acc=cand), params=params,
+                    headers=headers, timeout=120,
+                )
+                if resp.status_code == 404:
+                    break
+                if resp.status_code in (429, 502, 503):
+                    time.sleep(2 ** attempt)
+                    continue
+                resp.raise_for_status()
+            except requests.RequestException:
+                if attempt == 2:
+                    break
+                time.sleep(2 ** attempt)
+                continue
+            if len(resp.content) < EMPTY_ZIP_BYTES:
+                break
+            zip_bytes = resp.content
+            break
+        if zip_bytes is None:
+            continue
+        try:
+            with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+                fna = [n for n in zf.namelist() if n.endswith(".fna")]
+                if not fna:
+                    continue
+                with zf.open(fna[0]) as src:
+                    raw = src.read()
+        except zipfile.BadZipFile:
+            continue
+        return _parse_fasta(raw)
+    return None
+def _parse_fasta(raw: bytes) -> list[tuple[str, str]]:
+    contigs: list[tuple[str, str]] = []
+    cur: str | None = None
+    chunks: list[str] = []
+    for line in raw.splitlines():
+        if not line:
+            continue
+        if line.startswith(b">"):
+            if cur is not None:
+                contigs.append((cur, "".join(chunks).upper()))
+            cur = line[1:].decode("ascii", errors="replace").split()[0]
+            chunks = []
+        else:
+            chunks.append(line.decode("ascii", errors="replace"))
+    if cur is not None:
+        contigs.append((cur, "".join(chunks).upper()))
+    return contigs
+def _predict_proteins(contigs: list[tuple[str, str]]) -> list[str]:
+    encoded = [(n, s.encode("ascii")) for n, s in contigs]
+    total_nt = sum(len(s) for _, s in encoded)
+    if total_nt >= 20_000:
+        finder = pyrodigal.GeneFinder(meta=False)
+        try:
+            finder.train(b"TTAATTAATTAA".join(s for _, s in encoded))
+        except Exception:
+            finder = pyrodigal.GeneFinder(meta=True)
+    else:
+        finder = pyrodigal.GeneFinder(meta=True)
+    proteins: list[str] = []
+    for _, s in encoded:
+        for gene in finder.find_genes(s):
+            proteins.append(gene.translate().rstrip("*"))
+    return proteins
+def _embed_proteins(proteins: list[str]) -> np.ndarray:
+    if not proteins:
+        return np.zeros((0, _embed_dim), dtype=np.float32)
+    out: list = []
+    for i in range(0, len(proteins), _batch_size):
+        batch = proteins[i : i + _batch_size]
+        enc = _tokenizer(batch, return_tensors="pt", padding=True,
+                         truncation=True, max_length=1024)
+        enc = {k: v.to(_device) for k, v in enc.items()}
+        with torch.inference_mode():
+            outs = _model(**enc)
+        last_hidden = outs.last_hidden_state
+        mask = enc["attention_mask"].unsqueeze(-1).to(last_hidden.dtype)
+        pooled = (last_hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
+        out.append(pooled.float().cpu().numpy())
+    return np.concatenate(out, axis=0)
+def _scan_markers(proteins: list[str]) -> dict[str, list[int]]:
+    seqs = []
+    for i, prot in enumerate(proteins):
+        if not prot:
+            continue
+        ts = pyhmmer.easel.TextSequence(name=f"p{i}".encode(), sequence=prot)
+        seqs.append(ts.digitize(_alphabet))
+    result: dict[str, list[int]] = {name: [] for name in MARKER_TO_CATEGORY}
+    if not seqs:
+        return result
+    for top_hits in pyhmmer.hmmer.hmmsearch(_hmms, seqs, E=EVALUE_THRESHOLD):
+        raw = top_hits.query.name
+        marker = raw.decode() if isinstance(raw, bytes) else raw
+        if marker not in result:
+            continue
+        for hit in top_hits:
+            if hit.evalue > EVALUE_THRESHOLD:
+                continue
+            name = hit.name.decode() if isinstance(hit.name, bytes) else hit.name
+            if name.startswith("p"):
+                try:
+                    result[marker].append(int(name[1:]))
+                except ValueError:
+                    pass
+    return result
+def embed_genome(bacdive_id: int, accession: str) -> dict[str, Any]:
+    try:
+        contigs = _fetch_fasta(accession)
+        if not contigs:
+            return {"ok": False, "reason": "fetch_empty", "bacdive_id": bacdive_id, "accession": accession}
+        proteins = _predict_proteins(contigs)
+        if not proteins:
+            return {"ok": False, "reason": "no_proteins", "bacdive_id": bacdive_id, "accession": accession}
+        marker_idx = _scan_markers(proteins)
+        hit_indices = sorted({i for ids in marker_idx.values() for i in ids})
+        row: dict[str, Any] = {
+            "bacdive_id": int(bacdive_id),
+            "genome_accession": accession,
+            "pme_marker_proteins_total": len(hit_indices),
+        }
+        if not hit_indices:
+            for cat in CATEGORIES:
+                row[f"pme_{cat}_n"] = 0
+                for d in range(_embed_dim):
+                    row[f"pme_{cat}_{d}"] = 0.0
+            return {"ok": True, "row": row}
+        hit_proteins = [proteins[i] for i in hit_indices]
+        hit_matrix = _embed_proteins(hit_proteins)
+        gi_to_ri = {gi: ri for ri, gi in enumerate(hit_indices)}
+        for cat in CATEGORIES:
+            idxs: set[int] = set()
+            for marker, gis in marker_idx.items():
+                if MARKER_TO_CATEGORY.get(marker) == cat:
+                    idxs.update(gis)
+            row[f"pme_{cat}_n"] = len(idxs)
+            if idxs:
+                rows = [gi_to_ri[gi] for gi in idxs if gi in gi_to_ri]
+                if rows:
+                    cat_mean = hit_matrix[rows].mean(axis=0).astype(np.float32)
+                    for d, v in enumerate(cat_mean):
+                        row[f"pme_{cat}_{d}"] = float(v)
+                    continue
+            for d in range(_embed_dim):
+                row[f"pme_{cat}_{d}"] = 0.0
+        return {"ok": True, "row": row}
+    except Exception as exc:
+        return {"ok": False, "reason": f"{type(exc).__name__}: {exc}",
+                "bacdive_id": bacdive_id, "accession": accession}

cerebrium/kofam/cerebrium.toml ADDED Viewed

	@@ -0,0 +1,27 @@

+[cerebrium.deployment]
+name = "kofam"
+python_version = "3.11"
+docker_base_image_url = "debian:bookworm-slim"
+disable_auth = false
+include = ['./*', 'main.py', 'cerebrium.toml', 'kofam_relevant.hmm', 'ko_thresholds.tsv']
+exclude = ['.*', '__pycache__', '*.pyc']
+[cerebrium.hardware]
+cpu = 8.0
+memory = 8.0
+compute = "CPU"
+region = "us-east-1"
+[cerebrium.scaling]
+min_replicas = 2
+max_replicas = 10
+cooldown = 300
+replica_concurrency = 1
+scaling_metric = "concurrency_utilization"
+scaling_target = 80
+response_grace_period = 600
+[cerebrium.dependencies.pip]
+pyhmmer = ">=0.12"
+pyrodigal = ">=3.5"
+requests = ">=2.32"

cerebrium/kofam/main.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""KOfam scan service — runs on a Cerebrium CPU container.
+scan_genome(accession) → {"ok": bool, "ko_hits": [...]} or {"ok": False, "reason": ...}.
+The relevant-KO HMM library (~734 MB) and per-KO bitscore thresholds are baked
+into the image via `include` in cerebrium.toml, so each replica loads them
+once at startup.
+"""
+import io
+import os
+import time
+import zipfile
+import pyhmmer
+import pyhmmer.easel
+import pyhmmer.plan7
+import pyrodigal
+import requests
+DATASETS_URL = "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/{acc}/download"
+VERSION_FALLBACKS = (".1", ".2", ".3", ".4")
+EMPTY_ZIP_BYTES = 2_000
+DEFAULT_EVALUE = 1e-5
+def _find(name: str) -> str:
+    here = os.path.dirname(os.path.abspath(__file__))
+    for candidate in (
+        os.path.join(here, name),
+        os.path.join("/cortex", name),
+        os.path.join("/cortex/app", name),
+        name,
+    ):
+        if os.path.exists(candidate):
+            return candidate
+    raise FileNotFoundError(f"{name} not found (tried cwd, /cortex, /cortex/app)")
+HMM_PATH = _find("kofam_relevant.hmm")
+THRESHOLDS_PATH = _find("ko_thresholds.tsv")
+_alphabet = pyhmmer.easel.Alphabet.amino()
+with pyhmmer.plan7.HMMFile(HMM_PATH) as _fh:
+    _hmms = list(_fh)
+_thresholds: dict[str, float] = {}
+with open(THRESHOLDS_PATH) as _fh:
+    next(_fh)
+    for _line in _fh:
+        _parts = _line.rstrip("\n").split("\t")
+        if len(_parts) < 2:
+            continue
+        try:
+            _thresholds[_parts[0]] = float(_parts[1])
+        except (TypeError, ValueError):
+            _thresholds[_parts[0]] = 0.0
+_ncbi_key = os.environ.get("NCBI_API_KEY")
+print(f"[boot] loaded {len(_hmms):,} HMMs, {len(_thresholds):,} thresholds, "
+      f"ncbi_key={'yes' if _ncbi_key else 'no'}", flush=True)
+def _has_version(acc: str) -> bool:
+    return "." in acc and acc.rsplit(".", 1)[-1].isdigit()
+def _candidates(acc: str) -> list[str]:
+    return [acc] if _has_version(acc) else [acc + v for v in VERSION_FALLBACKS]
+def _fetch_fasta(acc: str) -> list[tuple[str, str]] | None:
+    rate = 0.1 if _ncbi_key else 0.34
+    headers = {"Accept": "application/zip"}
+    if _ncbi_key:
+        headers["api-key"] = _ncbi_key
+    params = {"include_annotation_type": "GENOME_FASTA"}
+    for cand in _candidates(acc):
+        zip_bytes: bytes | None = None
+        for attempt in range(3):
+            try:
+                time.sleep(rate)
+                resp = requests.get(
+                    DATASETS_URL.format(acc=cand), params=params,
+                    headers=headers, timeout=120,
+                )
+                if resp.status_code == 404:
+                    break
+                if resp.status_code in (429, 502, 503):
+                    time.sleep(2 ** attempt)
+                    continue
+                resp.raise_for_status()
+            except requests.RequestException:
+                if attempt == 2:
+                    break
+                time.sleep(2 ** attempt)
+                continue
+            if len(resp.content) < EMPTY_ZIP_BYTES:
+                break
+            zip_bytes = resp.content
+            break
+        if zip_bytes is None:
+            continue
+        try:
+            with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+                fna = [n for n in zf.namelist() if n.endswith(".fna")]
+                if not fna:
+                    continue
+                with zf.open(fna[0]) as src:
+                    raw = src.read()
+        except zipfile.BadZipFile:
+            continue
+        return _parse_fasta(raw)
+    return None
+def _parse_fasta(raw: bytes) -> list[tuple[str, str]]:
+    contigs: list[tuple[str, str]] = []
+    cur: str | None = None
+    chunks: list[str] = []
+    for line in raw.splitlines():
+        if not line:
+            continue
+        if line.startswith(b">"):
+            if cur is not None:
+                contigs.append((cur, "".join(chunks).upper()))
+            cur = line[1:].decode("ascii", errors="replace").split()[0]
+            chunks = []
+        else:
+            chunks.append(line.decode("ascii", errors="replace"))
+    if cur is not None:
+        contigs.append((cur, "".join(chunks).upper()))
+    return contigs
+def _predict_proteins(contigs: list[tuple[str, str]]) -> list[str]:
+    encoded = [(n, s.encode("ascii")) for n, s in contigs]
+    total_nt = sum(len(s) for _, s in encoded)
+    if total_nt >= 20_000:
+        finder = pyrodigal.GeneFinder(meta=False)
+        try:
+            finder.train(b"TTAATTAATTAA".join(s for _, s in encoded))
+        except Exception:
+            finder = pyrodigal.GeneFinder(meta=True)
+    else:
+        finder = pyrodigal.GeneFinder(meta=True)
+    proteins: list[str] = []
+    for _, s in encoded:
+        for gene in finder.find_genes(s):
+            proteins.append(gene.translate().rstrip("*"))
+    return proteins
+def _scan(proteins: list[str]) -> set[str]:
+    seqs = []
+    for i, prot in enumerate(proteins):
+        if not prot:
+            continue
+        ts = pyhmmer.easel.TextSequence(name=f"p{i}".encode(), sequence=prot)
+        seqs.append(ts.digitize(_alphabet))
+    found: set[str] = set()
+    if not seqs:
+        return found
+    for top_hits in pyhmmer.hmmer.hmmsearch(_hmms, seqs, E=DEFAULT_EVALUE, cpus=8):
+        raw = top_hits.query.name
+        ko = raw.decode() if isinstance(raw, bytes) else raw
+        thr = _thresholds.get(ko, 0.0)
+        for hit in top_hits:
+            if hit.score >= thr and hit.evalue <= DEFAULT_EVALUE:
+                found.add(ko)
+                break
+    return found
+def scan_genome(accession: str) -> dict:
+    try:
+        contigs = _fetch_fasta(accession)
+        if not contigs:
+            return {"ok": False, "reason": "fetch_empty", "accession": accession}
+        proteins = _predict_proteins(contigs)
+        if not proteins:
+            return {"ok": False, "reason": "no_proteins", "accession": accession}
+        ko_hits = _scan(proteins)
+        return {"ok": True, "accession": accession, "ko_hits": sorted(ko_hits)}
+    except Exception as exc:
+        return {"ok": False, "reason": f"{type(exc).__name__}: {exc}", "accession": accession}

docs/hybrid_predictor.md ADDED Viewed

	@@ -0,0 +1,89 @@

+# Hybrid Predictor
+The hybrid predictor combines the strongest current model for each phenotype:
+- Temperature: saved tabular XGBoost phenotype head
+- pH: saved tabular XGBoost phenotype head
+- Salt: saved tabular XGBoost phenotype head
+- Oxygen: fold-0 all-task LoRA checkpoint
+Script:
+```bash
+PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
+  --features data/training_table.parquet \
+  --marker-sequences data/marker_sequences.jsonl \
+  --limit 25 \
+  --device mps \
+  --output artifacts/hybrid_predictions.parquet
+```
+Use `--device cuda` on a GPU host, `--device mps` on Apple Silicon, or omit the flag
+to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is
+slow because ESM-2 encodes multiple marker proteins per genome.
+For larger uncultured-genome batches, use chunked output so progress is durable:
+```bash
+PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
+  --features artifacts/uncultured_predictions.parquet \
+  --marker-sequences data/uncultured_marker_sequences.jsonl \
+  --join left \
+  --reuse-existing-tabular \
+  --device mps \
+  --batch-size 2 \
+  --chunk-size 250 \
+  --chunk-output-dir artifacts/hybrid_chunks \
+  --resume-chunks \
+  --progress-every 25 \
+  --output artifacts/hybrid_predictions.parquet
+```
+`--resume-chunks` skips existing chunk files and combines all expected chunks into
+the final output when the run finishes. `--reuse-existing-tabular` keeps previously
+materialized temperature, pH, salt, and media outputs while replacing oxygen with
+LoRA where marker sequences are available.
+## Inputs
+`--features` accepts `.parquet`, `.csv`, `.json`, or `.jsonl` feature rows. Rows must
+include `genome_accession` by default. The tabular heads load their feature column
+order from `models/phenotype/feature_cols.json`.
+`--marker-sequences` must be a JSONL file with the same schema as
+`data/marker_sequences.jsonl`:
+```json
+{
+  "bacdive_id": 1,
+  "genome_accession": "GCF_004341595.1",
+  "by_category": {
+    "temperature": ["..."],
+    "ph": ["..."],
+    "oxygen": ["..."],
+    "salt": ["..."],
+    "vitamin": ["..."],
+    "nitrogen": ["..."],
+    "carbon": ["..."],
+    "special": ["..."]
+  }
+}
+```
+For uncultured genomes, first prepare a marker-sequence JSONL for those same
+`genome_accession` values. The existing `data/marker_sequences.jsonl` is keyed to
+BacDive training rows, not the uncultured candidate table.
+## Output
+The output includes:
+- `pred_optimal_temperature_c` plus 80% interval columns
+- `pred_optimal_ph` plus 80% interval columns
+- `pred_salt_tolerance_pct` plus 80% interval columns
+- `pred_oxygen_requirement`
+- `pred_oxygen_requirement_confidence`
+- one probability column per LoRA oxygen class
+The default output is `artifacts/hybrid_predictions.parquet`; `.csv`, `.json`, and
+`.jsonl` outputs are also supported.

docs/lora_results.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# LoRA Fold 0 Results
+This page records the completed fold 0 LoRA experiments and the checkpoint to use.
+## Recommendation
+Use the original all-task fold 0 checkpoint:
+- Local checkpoint path: `artifacts/lora/fold0_best.pt`
+- Durable release asset: `fold0_best_all_task.pt`
+- GitHub release: https://github.com/miyu-horiuchi/microbe-model/releases/tag/lora-fold0-20260518
+The all-task checkpoint is the best current fold 0 LoRA result. Oxygen-only training
+and the anaerobe-weighted run were useful checks, but neither improved the clean
+validation comparison enough to replace the original checkpoint.
+## Experiments
+All runs used fold 0, ESM-2 t12, LoRA `r=8`, one epoch, batch size 2, gradient
+accumulation 8, and Lambda A100 SXM4 GPU training.
+| Run | Local result file | Oxygen macro F1 | Oxygen n | Use? |
+|---|---|---:|---:|---|
+| All-task LoRA | `artifacts/lora/fold0_results.json` | 0.944823 | 2266 | Yes |
+| Oxygen-only LoRA | `artifacts/lora/fold0_results_oxygen.json` | 0.916836 | 2214 | No |
+| Anaerobe-weighted all-task LoRA | `artifacts/lora_weighted_anaerobe/fold0_results.json` | 0.944776 | 2266 | No |
+The anaerobe-weighted run used oxygen class weights:
+```text
+aerobe=1.0, anaerobe=1.5, facultative_anaerobe=1.0, microaerobe=1.0
+```
+It slightly improved anaerobe recall in the detailed diagnostic, but its fold 0
+training-validation oxygen macro F1 was fractionally lower than the all-task run.
+## Checkpoint Assets
+The `.pt` files are not committed to git. They are stored as GitHub Release assets:
+| Asset | SHA256 |
+|---|---|
+| `fold0_best_all_task.pt` | `8a73ee20252b1aa710b0480abd307ffbc38b788b1a152a7e63298c525a04be23` |
+| `fold0_best_oxygen_only.pt` | `fd10d4a2a7cba5d564fb9ba1f730cace07a0a2173d3622f1f572cfd29306fc95` |
+| `fold0_best_weighted_anaerobe.pt` | `c8d34999f570663e020e5644a994f821bf9539a6fcc3e029d5942b8dc7709826` |
+## Loading The Best Checkpoint
+The checkpoint is a PyTorch dictionary with these keys:
+- `epoch`
+- `model_cfg`
+- `train_cfg`
+- `state_dict`
+Minimal load pattern:
+```python
+import torch
+from microbe_model.train.lora_model import LoraModelConfig, PhenoLoRAModel
+checkpoint = torch.load("artifacts/lora/fold0_best.pt", map_location="cpu")
+model_cfg = LoraModelConfig(**checkpoint["model_cfg"])
+model = PhenoLoRAModel(model_cfg)
+model.load_state_dict(checkpoint["state_dict"], strict=False)
+model.eval()
+```
+To regenerate oxygen diagnostics:
+```bash
+PYTHONPATH=src uv run --python 3.11 --extra dev python scripts/38_eval_lora_checkpoint.py \
+  --checkpoint artifacts/lora/fold0_best.pt \
+  --output-json artifacts/lora/fold0_oxygen_diagnostics.json \
+  --output-md artifacts/lora/fold0_oxygen_diagnostics.md \
+  --batch-size 2
+```
+## Next GPU Work
+Do not spend more GPU on fold 0 variants unless there is a new hypothesis. The next
+meaningful validation step is to run the selected all-task LoRA setup across folds
+1-4 and report the mean and variance across all five folds. That is a stronger
+scientific result, but it is also the next major GPU-cost item.

kaggle/README.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# Kaggle migration — LoRA fine-tune of ESM-2 on BacDive phenotypes
+This directory packages everything you need to run the LoRA fine-tune on Kaggle's
+free P100 GPU (30 hr/week quota). Costs $0; one epoch fits in a single 12-h session.
+## One-time setup
+1. **Make a Kaggle account** and verify it via phone (required to enable internet +
+   GPU). https://www.kaggle.com/account
+2. **Install the Kaggle CLI** locally so you can upload datasets without dragging
+   1.3 GB through the browser:
+   ```bash
+   uv pip install kaggle
+   ```
+3. **Download your Kaggle API token** from https://www.kaggle.com/settings → "Create
+   new API token". Save it to `~/.kaggle/kaggle.json` and `chmod 600 ~/.kaggle/kaggle.json`.
+## Upload the three datasets
+Run the helper, which packages and uploads all three at once (and re-runs as
+"new version" pushes on subsequent invocations so the Kaggle URLs stay stable):
+```bash
+export KAGGLE_USERNAME=<your-kaggle-handle>
+bash kaggle/upload.sh
+```
+This creates (or updates) three datasets under your account:
+| Slug | Size | License | Contents |
+|---|---|---|---|
+| `bacdive-marker-sequences` | 1.3 GB | CC0-1.0 | `marker_sequences.jsonl` |
+| `bacdive-tables` | ~50 MB | CC-BY-4.0 | `bacdive_phenotypes.parquet`, `strain_catalog.parquet` |
+| `microbe-model-code` | ~120 KB | MIT | The `microbe_model/` Python package |
+If you ever need to wipe + re-stage the local copies, run with `FORCE_RECREATE=1 bash kaggle/upload.sh`.
+## Running on Kaggle
+1. Open https://www.kaggle.com → "Create" → "New Notebook".
+2. Upload `kaggle/lora_train_kaggle.ipynb` (or paste the contents).
+3. **Settings (right rail):**
+   - Accelerator → **GPU P100**
+   - Persistence → **Files only** (so checkpoints survive between sessions)
+   - Internet → **on** (needed to fetch ESM-2 weights from HuggingFace)
+4. **Add inputs (right rail):**
+   - `<YOUR-KAGGLE-USERNAME>/bacdive-marker-sequences`
+   - `<YOUR-KAGGLE-USERNAME>/bacdive-tables`
+   - `<YOUR-KAGGLE-USERNAME>/microbe-model-code`
+5. Adjust the input paths in cell 2 of the notebook to match the names of the
+   datasets you uploaded (Kaggle slugifies them so the folder name under
+   `/kaggle/input/` will be e.g. `bacdive-marker-sequences/`).
+6. **Run all cells.** Training will print loss every 50 steps; expect ~10 h for
+   one epoch of fold 0 on P100. The trainer saves `fold0_best.pt` and
+   `fold0_results.json` to `/kaggle/working/`, both downloadable from the Output
+   tab when the session ends.
+## Resuming across sessions (only if you want >1 epoch)
+If you need 3 epochs total, your simplest path is three separate Kaggle sessions,
+each running 1 epoch starting from the previous session's checkpoint:
+1. After session 1 finishes, **download** `/kaggle/working/fold0_best.pt`.
+2. **Create a new Kaggle Dataset** called `lora-fold0-ckpt-epoch1` containing it.
+3. In session 2, add this dataset as input and load the checkpoint state-dict
+   into the model before training begins. Add a cell like:
+   ```python
+   import torch
+   ckpt = torch.load(\"/kaggle/input/lora-fold0-ckpt-epoch1/fold0_best.pt\")
+   model.load_state_dict(ckpt[\"state_dict\"], strict=False)
+   ```
+4. Repeat for session 3.
+For a "publishable, modest-cost LoRA result", running 1 epoch in 1 session is
+usually enough — LoRA reaches most of its gain in the first pass through the data.
+## What to do with the result
+When the Kaggle run finishes, download `fold0_results.json` from
+`/kaggle/working/` and drop it at `artifacts/lora/fold0_results.json` locally.
+A follow-up script will compare the LoRA per-target metrics to the
+frozen-PTPE XGBoost baseline in `artifacts/baseline_results.json`.