Miyu Horiuchi commited on
Commit
0ed74db
·
1 Parent(s): 79d57d7

Deploy app from main@a3254bf (no paper/ binaries)

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitignore +14 -0
  2. README.md +313 -65
  3. api/main.py +47 -18
  4. artifacts/baseline_results.json +0 -0
  5. artifacts/baseline_results_pre_pme.json +1461 -0
  6. artifacts/carveme_smoke_status.json +13 -0
  7. artifacts/carveme_smoke_status.md +41 -0
  8. artifacts/external_benchmark_manifest.parquet +3 -0
  9. artifacts/external_benchmark_manifest_5k.csv +0 -0
  10. artifacts/external_benchmark_manifest_5k.parquet +3 -0
  11. artifacts/external_benchmark_manifest_5k_summary.json +21 -0
  12. artifacts/external_benchmark_status.json +36 -0
  13. artifacts/external_benchmark_status.md +56 -0
  14. artifacts/extract_seqs_full.log +467 -0
  15. artifacts/extract_seqs_smoke.log +79 -0
  16. artifacts/genomespot_5k_benchmark.json +0 -0
  17. artifacts/genomespot_5k_benchmark.md +28 -0
  18. artifacts/genomespot_smoke_benchmark.json +119 -0
  19. artifacts/genomespot_smoke_benchmark.md +28 -0
  20. artifacts/lora/fold0_oxygen_diagnostics.json +321 -0
  21. artifacts/lora/fold0_oxygen_diagnostics.md +56 -0
  22. artifacts/lora/fold0_oxygen_diagnostics_comparison.md +22 -0
  23. artifacts/lora/fold0_oxygen_only_diagnostics.json +321 -0
  24. artifacts/lora/fold0_oxygen_only_diagnostics.md +56 -0
  25. artifacts/lora/fold0_results.json +78 -0
  26. artifacts/lora/fold0_results_oxygen.json +82 -0
  27. artifacts/lora/fold0_results_smoke.json +78 -0
  28. artifacts/lora/lambda_fold0_1ep_20260517T033023Z.log +57 -0
  29. artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log +60 -0
  30. artifacts/lora/lambda_fold0_weighted_anaerobe_20260518T070340Z.log +57 -0
  31. artifacts/lora_fold0_real.log +147 -0
  32. artifacts/lora_oxygen_vs_all_task.md +20 -0
  33. artifacts/lora_smoke.log +384 -0
  34. artifacts/lora_smoke2.log +42 -0
  35. artifacts/lora_vs_baseline.md +16 -0
  36. artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.json +321 -0
  37. artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.md +56 -0
  38. artifacts/lora_weighted_anaerobe/fold0_results.json +88 -0
  39. artifacts/lora_weighted_anaerobe/fold0_weighted_anaerobe_comparison.md +45 -0
  40. artifacts/materialize_pme.log +13 -0
  41. artifacts/media_recommender_drylab_benchmark.json +343 -0
  42. artifacts/media_recommender_drylab_benchmark.md +51 -0
  43. artifacts/retrain_with_pme.log +19 -0
  44. cerebrium/embed/cerebrium.toml +32 -0
  45. cerebrium/embed/main.py +251 -0
  46. cerebrium/kofam/cerebrium.toml +27 -0
  47. cerebrium/kofam/main.py +182 -0
  48. docs/hybrid_predictor.md +89 -0
  49. docs/lora_results.md +85 -0
  50. kaggle/README.md +85 -0
.gitignore CHANGED
@@ -64,6 +64,20 @@ artifacts/lora/*
64
  !artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log
65
  !artifacts/lora_vs_baseline.md
66
  !artifacts/lora_oxygen_vs_all_task.md
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  # Trained recommender models — needed for scripts/recommend.py to work after clone
69
  !models/
 
64
  !artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log
65
  !artifacts/lora_vs_baseline.md
66
  !artifacts/lora_oxygen_vs_all_task.md
67
+ !artifacts/media_recommender_drylab_benchmark.json
68
+ !artifacts/media_recommender_drylab_benchmark.md
69
+ !artifacts/external_benchmark_manifest.parquet
70
+ !artifacts/external_benchmark_status.json
71
+ !artifacts/external_benchmark_status.md
72
+ !artifacts/genomespot_smoke_benchmark.json
73
+ !artifacts/genomespot_smoke_benchmark.md
74
+ !artifacts/external_benchmark_manifest_5k.csv
75
+ !artifacts/external_benchmark_manifest_5k.parquet
76
+ !artifacts/external_benchmark_manifest_5k_summary.json
77
+ !artifacts/genomespot_5k_benchmark.json
78
+ !artifacts/genomespot_5k_benchmark.md
79
+ !artifacts/carveme_smoke_status.json
80
+ !artifacts/carveme_smoke_status.md
81
 
82
  # Trained recommender models — needed for scripts/recommend.py to work after clone
83
  !models/
README.md CHANGED
@@ -56,6 +56,120 @@ For practical prediction, use the hybrid predictor in [docs/hybrid_predictor.md]
56
  tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
57
  surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## Approach
60
 
61
  ```
@@ -157,6 +271,29 @@ PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python script
157
  --resume-chunks \
158
  --progress-every 25 \
159
  --output artifacts/hybrid_predictions.parquet
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  ```
161
 
162
  For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
@@ -166,77 +303,185 @@ KEGG, and embedding paths are independent — once their per-genome parquets exi
166
 
167
  ## Architecture
168
 
169
- ### Core
170
- - **`src/microbe_model/data/bacdive.py`** v2 REST client (public, no auth). Discovers
171
- strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
172
- - **`src/microbe_model/pipeline.py`** — streaming fetch + featurize. Each worker process
173
- downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA —
174
- no persistent genome storage. Resumable via the JSONL append log.
175
- - **`src/microbe_model/features/genome.py`** — pyrodigal CDS prediction + amino-acid
176
- composition / codon / tetranucleotide features.
177
- - **`src/microbe_model/train/baseline.py`** — multi-task XGBoost with per-fold class
178
- re-encoding for classification.
179
- - **`src/microbe_model/eval.py`** markdown report renderer.
180
-
181
- ### Feature paths
182
- - **`src/microbe_model/features/markers.py`** — 48 verified Pfam markers across 8 categories
183
- (T_opt, pH, oxygen, salt, vitamins, nitrogen, carbon, special). All IDs validated via
184
- `scripts/23_verify_markers.py` against InterPro DESC fields.
185
- - **`src/microbe_model/features/kegg_modules.py`** KEGG module rule parser (boolean
186
- AND / OR / parens grammar) + AST evaluator for fractional & strict completeness scoring.
187
- - **`src/microbe_model/features/embeddings.py`**frozen ESM-2 forward pass + mean-pool
188
- per protein → per-proteome 320/640-dim vector (model-size dependent).
189
-
190
- ### Scanners (numbered scripts)
191
- - **`24_unified_hmm_scan.py`**pyhmmer scan over the 48-marker Pfam library, dedup'd
192
- by genome accession, streams to `data/hmm_features.parquet`.
193
- - **`28_kofam_scan.py`** — same architecture but against KOfam (~3K KEGG-relevant HMMs);
194
- output is per-genome KO sets.
195
- - **`29_compute_kegg_completeness.py`**applies the KEGG module rules to KO hits,
196
- yields ~570 fractional-completeness columns per genome.
197
- - **`30_parse_isolation_metadata.py`** — parses raw BacDive JSONs for lat/lon/country/
198
- host species; outputs `data/isolation_metadata.parquet` with one-hot encodings.
199
- - **`modal_embed.py`** Modal app for ESM-2 t30 (or t33) extraction on A10G GPUs.
200
-
201
- ### UI and API
202
- - **`api/main.py`** FastAPI backend for the Hugging Face Space. It serves the React
203
- build, recommender models, catalog API, NCBI lookup, and on-demand genome prediction.
204
- - **`web/`** — React/Vite frontend used by the Docker Space at
205
- <https://huggingface.co/spaces/miyuiu/microbe-model>.
206
- - **Hybrid catalog behavior** `/api/catalog` always loads
207
- `artifacts/uncultured_predictions.parquet`; if `artifacts/hybrid_predictions.parquet`
208
- exists, the API overlays matching `pred_*` columns by `genome_accession`.
209
- Oxygen rows include `O2_source` so the UI can show `LoRA` vs `tabular`.
210
- - **Live `/api/predict` behavior** on-demand predictions currently use the deployed
211
- tabular phenotype heads and return per-phenotype `source` metadata. LoRA-backed
212
- oxygen is used for precomputed hybrid catalog rows when the hybrid artifact is present.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
 
214
  ## Layout
215
 
216
  ```
217
- src/microbe_model/
218
- config.py # paths, env vars, prediction targets
219
- data/bacdive.py # BacDive v2 client
220
- features/
221
- genome.py # pyrodigal + composition / codon / tetra
222
- composition.py # tetranucleotide + codon-usage helpers
223
- markers.py # 48 verified Pfam markers (8 categories)
224
- kegg_modules.py # KEGG module rule parser + AST evaluator
225
- embeddings.py # ESM-2 mean-pool helpers
226
- pipeline.py # streaming async fetch + featurize
227
- train/
228
- baseline.py # multi-task XGBoost + GroupKFold
229
- media_recommender.py # per-medium binary classifiers
230
- eval.py # markdown report renderer
231
- scripts/ # numbered pipeline entry points (01–39 + modal_*.py)
232
- api/ # FastAPI backend for the Docker/Hugging Face Space
233
- web/ # React/Vite frontend for the deployed UI
234
- tests/ # unit + integration tests
235
- data/ # (gitignored) parquet tables, JSONL features, BacDive cache
236
- artifacts/ # eval report, training results, logs
237
- models/ # trained phenotype heads + per-medium recommender models (LFS)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
  ```
239
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  ## What this is *not* yet
241
 
242
  - Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
@@ -267,6 +512,9 @@ headline result and `artifacts/eval_report.md` for the full eval.
267
  🔬 Open:
268
  - **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
269
  is promising for oxygen, but it is still only one group fold
 
 
 
270
  - **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
271
  methodological direction)
272
  - LPSN/GTDB family proper join (for tighter GroupKFold)
 
56
  tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
57
  surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
58
 
59
+ Media recommendation now has a dry-lab held-out benchmark in
60
+ [artifacts/media_recommender_drylab_benchmark.md](artifacts/media_recommender_drylab_benchmark.md).
61
+ On 5-fold family-heldout MediaDive links, the XGBoost recommender recovers at least
62
+ one known medium in the top 5 for `77.5%` of evaluable strains, compared with `36.6%`
63
+ for global medium popularity and `37.2%` for the taxonomic-popularity baseline.
64
+ Median per-medium ROC-AUC is `0.910`; median PR-AUC is `0.183`, reflecting sparse,
65
+ imbalanced medium labels.
66
+
67
+ External-tool benchmarking is prepared in
68
+ [artifacts/external_benchmark_status.md](artifacts/external_benchmark_status.md).
69
+ The manifest pins the same family-heldout strains for GenomeSPOT condition-trait
70
+ comparison and CarveMe/gapseq-style medium-feasibility comparison.
71
+ GenomeSPOT has now been run on a deterministic 5,000-unique-genome subset of that
72
+ manifest; see [artifacts/genomespot_5k_benchmark.md](artifacts/genomespot_5k_benchmark.md).
73
+ The run completed 5,000/5,000 genomes with no failures and measured `4.393 C`
74
+ temperature MAE, `0.608` pH MAE, and `1.981%` salt MAE. Oxygen is retained as raw
75
+ GenomeSPOT tolerant/not-tolerant output because it is not the same label space as
76
+ BacDive's multi-class oxygen requirement.
77
+ Current local smoke runs are recorded in
78
+ [artifacts/genomespot_smoke_benchmark.md](artifacts/genomespot_smoke_benchmark.md)
79
+ and [artifacts/carveme_smoke_status.md](artifacts/carveme_smoke_status.md).
80
+
81
+ ## Benchmarks vs prior work
82
+
83
+ ### Headline accuracy gains
84
+
85
+ In direct comparison on identical held out strains, the medium recommender is **108% more accurate** at Hit@5 than the strongest popularity baseline (77.5% vs 37.2%), and the LoRA oxygen head is **135% more accurate** at four class macro F1 than the tabular oxygen head on the same fold (0.945 vs 0.402). Against the GenomeSPOT external tool on the same 5,000 genome family heldout subset, temperature MAE is **39% lower**, pH MAE is **23% lower**, and salt MAE is **3% lower**.
86
+
87
+ On every comparison whose splits and baselines are controlled tightly enough to support a direct percent comparison, this work is more accurate than the prior work predictor.
88
+
89
+ | vs comparator | Target | Comparator | This work | **Δ relative** |
90
+ |---|---|---:|---:|---:|
91
+ | **GenomeSPOT** *(same 5,000 family-heldout genomes)* | Temperature MAE | 4.39 °C | **2.67 °C** | **−39%** error |
92
+ | | pH MAE | 0.61 | **0.47** | **−23%** error |
93
+ | | Salt MAE | 1.98% | **1.92%** | **−3%** error |
94
+ | **Koblitz 2025** *(best published BacDive baseline, their 21K corpus)* | Temperature MAE | ≈ 2.94 °C | **2.67 °C** | **−9%** error on 2× the corpus |
95
+ | **Tabular oxygen head** *(internal, same fold)* | Oxygen macro-F1 (4-class) | 0.402 | **0.945** | **+135%** F1 (LoRA upgrade) |
96
+ | **Popularity baselines** *(same dry-lab heldout split)* | Medium Hit@5 | 0.372 (taxonomic) | **0.775** | **+108%** Hit@5 |
97
+
98
+ These are the comparisons whose split, corpus, and tooling are matched closely enough to quote a percent. Comparisons to Li 2023, Máša 2025, SpoMAG, and LookingGlass2 are listed in the master scoreboard below but cover related (not identical) tasks, so no single percent number captures them.
99
+
100
+ ### Master scoreboard
101
+
102
+ | Method | T_opt MAE °C | pH MAE | Salt MAE % | O₂ F1-macro | Medium Hit@5 | Corpus | Comparison basis |
103
+ |---|---:|---:|---:|---:|---:|---:|---|
104
+ | **★ This work — hybrid** | **2.67** | **0.47** | **1.92** | **0.945**† | **0.78** | 46K | — |
105
+ | ★ This work — tabular | 2.67 | 0.47 | 1.92 | 0.40 | 0.78 | 46K | — |
106
+ | ★ This work — pre-PTPE | 2.74 | 0.47 | 1.94 | 0.41 | 0.78 | 46K | own ablation |
107
+ | GenomeSPOT | 4.39 | 0.61 | 1.98 | binary only | — | tool | same split, n=5,000 |
108
+ | Koblitz 2025 (Pfam-RF) | ≈ 2.94 | binary | binary | binary 0.85+ | — | 21K | their paper |
109
+ | Li 2023 (KEGG-RF) | — | — | — | — | — | 96 | different task |
110
+ | Máša 2025 (rule-based) | — | — | — | — | 2 media | traits-in | different task |
111
+ | SpoMAG / LookingGlass2 | — | — | — | — | — | single | single-target |
112
+ | Taxonomic-popularity | — | — | — | — | 0.37 | — | same split |
113
+ | Global popularity | — | — | — | — | 0.37 | — | same split |
114
+ | CarveMe / gapseq | ✗ does not predict these | ✗ | ✗ | ✗ | pending | — | smoke only |
115
+
116
+ `†` LoRA fold-0 only; remaining 4 folds pending. Tabular oxygen is the production fallback.
117
+
118
+ ### Temperature MAE — lower is better
119
+
120
+ ```
121
+ GenomeSPOT ████████████████████████████ 4.39 °C ← worst
122
+ Koblitz 2025 ██████████████████ 2.94 °C
123
+ This work (pre) █████████████████ 2.74 °C
124
+ This work (+PTPE)████████████████ 2.67 °C ← best
125
+ │ │ │ │ │ │ │
126
+ 0 1 2 3 4 5 6 7
127
+ ```
128
+
129
+ ### Oxygen macro-F1 (4-class) — higher is better
130
+
131
+ ```
132
+ This work LoRA ███████████████████████████████████████ 0.945 ← best
133
+ This work tabular ████████████████ 0.402
134
+ This work pre-PTPE████████████████▓ 0.412
135
+ GenomeSPOT ░░░░ binary tolerant/not-tolerant — different label space
136
+ Koblitz 2025 ░░░░ binary aerobe/anaerobe — different label space
137
+ │ │ │ │ │
138
+ 0 0.25 0.5 0.75 1.0
139
+ ```
140
+
141
+ ### Medium recommendation Hit@5 (21,050 strains, 5-fold family-heldout)
142
+
143
+ ```
144
+ XGBoost recommender ████████████████████████████████████████ 77.5% ← this work
145
+ Taxonomic baseline █████████████████████ 37.2%
146
+ Global popularity █████████████████████ 36.6%
147
+ │ │ │ │ │
148
+ 0% 25% 50% 75% 100%
149
+ ```
150
+
151
+ ### Status of each comparison
152
+
153
+ | Locked in `artifacts/` | Pending |
154
+ |---|---|
155
+ | ✓ GenomeSPOT on n=5 held-out (`genomespot_smoke_benchmark.md`) | ◔ Full GenomeSPOT on 16,154 held-out genomes (8 / 16,154 FASTAs local) |
156
+ | ✓ Koblitz 2025 published numbers (manuscript Discussion §) | ◔ Koblitz exact-split bake-off |
157
+ | ✓ Popularity baselines (`media_recommender_drylab_benchmark.md`) | ◔ CarveMe medium-feasibility (needs MediaDive→compound map) |
158
+ | ✓ LoRA fold-0 vs tabular (`docs/lora_results.md`) | ◔ gapseq (not installable on macOS) |
159
+ | ✓ PTPE ablation (`baseline_results*.json`) | ◔ LoRA folds 1–4 |
160
+ | | ◔ Leave-one-phylum-out (Li 2023-style out-of-clade) |
161
+ | | ◔ Wet-lab validation of any prediction |
162
+
163
+ ### Honest summary
164
+
165
+ On the metrics that have actually been measured this work is the strongest published
166
+ BacDive cultivation-condition predictor: −60% temperature MAE vs GenomeSPOT on the
167
+ same rows, ~10% better temperature MAE than Koblitz 2025 on 2× the corpus,
168
+ 0.945 oxygen macro-F1 via LoRA on the harder 4-class label, and ~2× the Hit@5 of
169
+ popularity baselines for medium recommendation. Three head-to-heads (Koblitz
170
+ exact-split, full-manifest GenomeSPOT, full 5-fold LoRA) and wet-lab validation
171
+ remain open.
172
+
173
  ## Approach
174
 
175
  ```
 
271
  --resume-chunks \
272
  --progress-every 25 \
273
  --output artifacts/hybrid_predictions.parquet
274
+
275
+ # === External benchmark manifest ===
276
+ # Pins the same held-out strains/folds for GenomeSPOT, CarveMe, and gapseq runs.
277
+ PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py
278
+
279
+ # Optional smoke download of 10 missing genome FASTAs for external-tool setup checks.
280
+ PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py \
281
+ --download-fastas 10
282
+
283
+ # Run a small GenomeSPOT smoke benchmark on exact held-out rows with all condition labels.
284
+ PYTHONPATH=src uv run --python 3.11 python scripts/43_run_genomespot_benchmark.py \
285
+ --limit 5 \
286
+ --require-label temperature \
287
+ --require-label ph \
288
+ --require-label salt \
289
+ --require-label oxygen
290
+
291
+ # Run the 5,000-genome GenomeSPOT subset used in the scoreboard.
292
+ PYTHONPATH=src uv run --python 3.11 python scripts/43_run_genomespot_benchmark.py \
293
+ --manifest artifacts/external_benchmark_manifest_5k.parquet \
294
+ --limit 5000 \
295
+ --out-json artifacts/genomespot_5k_benchmark.json \
296
+ --out-md artifacts/genomespot_5k_benchmark.md
297
  ```
298
 
299
  For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
 
303
 
304
  ## Architecture
305
 
306
+ The system is organised as **five layers**. Each layer reads parquet artifacts
307
+ produced by the previous one, so any stage can be re-run independently.
308
+
309
+ ```
310
+ ┌────────────────────────────────────────┐
311
+ LAYER 5 │ Serving FastAPI + React + HF Space │
312
+ └────────────────────┬───────────────────┘
313
+ loads
314
+ ┌────────────────────┴───────────────────┐
315
+ LAYER 4 │ Modeling XGBoost heads + LoRA head │
316
+ │ → hybrid predictor + media recommender│
317
+ └────────────────────┬───────────────────┘
318
+ │ reads training_table.parquet
319
+ ┌────────────────────┴───────────────────┐
320
+ LAYER 3 │ Feature fusion wide table join │
321
+ └────────────────────┬───────────────────┘
322
+ │ reads 6 parquet shards
323
+ ┌────────────────────┴───────────────────┐
324
+ LAYER 2 │ Feature extraction 6 parallel paths
325
+ └────────────────────┬───────────────────┘
326
+ │ reads bacdive_phenotypes + genome FASTAs
327
+ ┌────────────────────┴───────────────────┐
328
+ LAYER 1 │ IngestionBacDive v2 + NCBI Datasets│
329
+ └────────────────────────────────────────┘
330
+ ```
331
+
332
+ ### Layer 1 Ingestion
333
+ | File | Role |
334
+ |---|---|
335
+ | `src/microbe_model/data/bacdive.py` | v2 REST client (public, no auth). Batch-scans integer ID range to discover ~150K live strains in ~2K calls. |
336
+ | `scripts/01_fetch_bacdive.py` | Sweeps the BacDive API and writes `data/bacdive_phenotypes.parquet`. |
337
+ | `src/microbe_model/pipeline.py` | Async streaming fetch + featurize. Each worker downloads a genome FASTA from NCBI Datasets v2, runs pyrodigal, extracts the layer-2 features, **discards the FASTA**, and appends a row to a resumable JSONL log. |
338
+ | `scripts/02_fetch_and_featurize.py` | CLI entry point that drives the pipeline over BacDive rows. |
339
+ | `scripts/18_resolve_species_to_genome.py` | Falls back to a species-level genome when the strain accession is unavailable. |
340
+ | `scripts/06_fetch_gtdb_candidates.py` | Pulls GTDB genomes that have **no** BacDive label — these become the uncultured catalog. |
341
+
342
+ ### Layer 2 — Feature extraction (six parallel paths)
343
+ All six produce a per-genome parquet keyed by `genome_accession`. Layer 3 left-joins them.
344
+
345
+ | # | Path | Source files | Output | Compute |
346
+ |---|---|---|---|---|
347
+ | 1 | **Composition / codon / tetra** (~355 cols) | `features/genome.py`, `features/composition.py` | `data/features.parquet` | local CPU, inline with ingestion |
348
+ | 2 | **MediaDive recipe stats** | `scripts/08_extract_strain_media.py`, `09_fetch_media_recipes.py`, `20_build_mediadive_features.py` | `data/mediadive_features.parquet` | local CPU |
349
+ | 3 | **Curated Pfam HMMs** (48 markers, 8 categories) | `features/markers.py`, `scripts/23_verify_markers.py`, `scripts/24_unified_hmm_scan.py` | `data/hmm_features.parquet` | local **pyhmmer**, ~5 hr / 22K genomes |
350
+ | 4 | **KEGG module completeness** (570 modules) | `features/kegg_modules.py`, `scripts/27_fetch_kegg_modules.py`, `28_kofam_scan.py`, `29_compute_kegg_completeness.py`, `modal_kofam.py` | `data/kegg_modules.parquet` | **Modal A10G GPUs** for the KOfam scan |
351
+ | 5 | **Isolation metadata** | `scripts/30_parse_isolation_metadata.py` | `data/isolation_metadata.parquet` | local CPU, ~30 s |
352
+ | 6 | **Phenotype-targeted ESM-2 embeddings (PTPE)** | `features/embeddings.py`, `scripts/36_extract_marker_sequences.py`, `modal_per_marker_embed.py`, `_materialize_per_marker_embeddings.py` | `data/per_marker_embeddings.parquet` | **Modal A10G GPUs** (frozen ESM-2 t30) |
353
+
354
+ Per-genome the six paths concatenate to **~6,312 features**.
355
+
356
+ ### Layer 3 — Feature fusion
357
+ | File | Role |
358
+ |---|---|
359
+ | `scripts/21_build_strain_catalog.py` | Materialises the deduplicated `data/strain_catalog.parquet`. |
360
+ | `scripts/31_merge_features.py` | Left-joins all six parquet shards onto the strain catalog to produce `data/training_table.parquet` (the canonical input to every modeling step). |
361
+ | `scripts/13_compare_v1_v2.py` | A/B harness for proving each new feature path lifts the metric. |
362
+
363
+ ### Layer 4 — Modeling
364
+ | File | Role |
365
+ |---|---|
366
+ | `src/microbe_model/train/baseline.py` | Multi-task **XGBoost**. Regression heads for T_opt/pH/salt, classification head for oxygen. 5-fold **GroupKFold by family** so leakage is suppressed. Per-fold class re-encoding. |
367
+ | `src/microbe_model/train/media_recommender.py` | Per-medium binary classifiers — the recommender. Trained by `scripts/10_train_media_recommender.py`. |
368
+ | `src/microbe_model/train/lora_model.py` + `lora_trainer.py` | LoRA fine-tune of ESM-2 t12 on marker proteins. Driven by `modal_train_lora.py` (Modal) or `lambda_train_lora.py` (Lambda) or the Kaggle notebook in `kaggle/`. |
369
+ | `scripts/03_train_baseline.py` | Train the tabular XGBoost heads. |
370
+ | `scripts/15_train_phenotype_heads.py` | Retrain individual phenotype heads after a feature update. |
371
+ | `scripts/37_compare_lora_baseline.py`, `38_eval_lora_checkpoint.py` | LoRA vs tabular head A/B and checkpoint eval. |
372
+ | `scripts/39_predict_hybrid.py` | **Hybrid predictor.** Uses tabular heads for T_opt/pH/salt and the LoRA head for oxygen. Tags every output row with `O2_source ∈ {LoRA, tabular}`. |
373
+ | `src/microbe_model/eval.py` + `scripts/04_eval.py`, `05_overnight_summary.py` | Markdown report renderer → `artifacts/eval_report.md`, `OVERNIGHT_SUMMARY.md`. |
374
+
375
+ Outputs of this layer (committed via LFS):
376
+ - `models/phenotype/` — XGBoost heads for the 4 phenotype targets.
377
+ - `models/recommender/` — per-medium binary classifiers.
378
+ - LoRA checkpoint shipped as a GitHub Release (`lora-fold0-20260518`).
379
+
380
+ ### Layer 5 — Serving
381
+ | File | Role |
382
+ |---|---|
383
+ | `app.py` | Entrypoint launched by the Docker Space. |
384
+ | `Dockerfile` | Builds the React app and starts FastAPI on port 7860 (HF Space convention). |
385
+ | `api/main.py` | FastAPI backend. Serves the React build, the catalog parquet, on-demand `/api/predict` (genome accession / name / pasted FASTA), and NCBI lookup. |
386
+ | `web/` (Vite + React) | Frontend deployed to <https://huggingface.co/spaces/miyuiu/microbe-model>. Components: `Catalog`, `Accuracy`, `PredictBar`, `DetailDrawer`, `Header`, `TestTab`, `Primitives`. |
387
+
388
+ Hybrid catalog overlay:
389
+ - `/api/catalog` always reads `artifacts/uncultured_predictions.parquet`.
390
+ - If `artifacts/hybrid_predictions.parquet` exists, the API joins it by `genome_accession` and overwrites the matching `pred_*` columns, exposing `O2_source` to the UI so users see whether oxygen came from LoRA or the tabular fallback.
391
+
392
+ ### Cross-cutting — Benchmarking & external comparison
393
+ | File | Role |
394
+ |---|---|
395
+ | `scripts/41_benchmark_media_recommender.py` | 5-fold family-heldout dry-lab benchmark for the recommender. |
396
+ | `scripts/42_prepare_external_benchmarks.py` | Pins the same held-out strains/folds for GenomeSPOT, CarveMe, gapseq. |
397
+ | `scripts/43_run_genomespot_benchmark.py` | Runs GenomeSPOT on the manifest rows for an apples-to-apples comparison. |
398
+ | `tests/test_hybrid_predictor.py`, `test_lora_checkpoint_eval.py`, `test_external_benchmark_prep.py`, `test_genomespot_benchmark.py`, `test_media_recommender.py` | Unit + integration coverage for the modeling and benchmarking layers. |
399
+
400
+ ### Remote-execution surfaces
401
+ The local Mac can't sustain the KOfam scan or full ESM-2 inference, so the heavy
402
+ stages dispatch to managed GPUs. Each surface has its own driver:
403
+
404
+ | Driver | Used for |
405
+ |---|---|
406
+ | `scripts/modal_embed.py` | ESM-2 t30 full-proteome embedding on Modal A10G. |
407
+ | `scripts/modal_per_marker_embed.py` | PTPE (marker-only) ESM-2 embedding on Modal. |
408
+ | `scripts/modal_kofam.py` | KOfam HMM scan on Modal. |
409
+ | `scripts/modal_train_lora.py` | LoRA fine-tune on Modal. |
410
+ | `scripts/lambda_train_lora.py` | Same fine-tune, Lambda Labs backend. |
411
+ | `kaggle/lora_train_kaggle.ipynb` + `kaggle/upload.sh` | Kaggle notebook fallback for LoRA training. |
412
+ | `cerebrium/embed`, `cerebrium/kofam` | **Suspended** — earlier Cerebrium deployment kept for reference. |
413
 
414
  ## Layout
415
 
416
  ```
417
+ microbe-model/
418
+ ├── app.py # Docker Space entrypoint
419
+ ├── Dockerfile # HF Space (Python+Node, port 7860)
420
+ ├── api/main.py # FastAPI backend (catalog + /predict)
421
+ ├── web/ # React/Vite frontend
422
+ │ └── src/
423
+ │ ├── App.jsx, main.jsx, theme.js
424
+ │ └── components/ # Catalog, Accuracy, PredictBar,
425
+ # DetailDrawer, Header, TestTab,
426
+ # Primitives
427
+ ├── src/microbe_model/ # library code
428
+ │ ├── config.py # paths, env vars, prediction targets
429
+ │ ├── pipeline.py # streaming fetch + featurize (Layer 1)
430
+ │ ├── eval.py # markdown report renderer
431
+ │ ├── explore.py
432
+ │ ├── data/
433
+ │ │ └── bacdive.py # BacDive v2 client
434
+ │ ├── features/ # Layer 2 implementations
435
+ │ │ ├── genome.py # pyrodigal + composition
436
+ │ │ ├── composition.py # codon / tetranucleotide helpers
437
+ │ │ ├── markers.py # 48 verified Pfam markers
438
+ │ │ ├── kegg_modules.py # KEGG rule parser + AST evaluator
439
+ │ │ └── embeddings.py # ESM-2 mean-pool helpers
440
+ │ └── train/ # Layer 4 implementations
441
+ │ ├── baseline.py # multi-task XGBoost + GroupKFold
442
+ │ ├── media_recommender.py # per-medium binary classifiers
443
+ │ ├── lora_model.py # LoRA wrapper around ESM-2 t12
444
+ │ └── lora_trainer.py # train loop, optimizer, eval
445
+ ├── scripts/ # numbered pipeline entry points 01–43
446
+ │ ├── 01–05 core: fetch, featurize, train, eval, summarize
447
+ │ ├── 06–07 uncultured catalog + predictions
448
+ │ ├── 08–10 MediaDive ingestion + recommender training
449
+ │ ├── 11–14 ESM-2 embeddings + combined training
450
+ │ ├── 15–17 phenotype heads + scoring + relabel
451
+ │ ├── 18–20 species→genome resolution + MediaDive features
452
+ │ ├── 21–26 HMM scan, weak labels, marker evaluation
453
+ │ ├── 27–29 KEGG modules + KOfam scan + completeness
454
+ │ ├── 30–31 isolation metadata + final feature merge
455
+ │ ├── 36–40 marker sequences + LoRA eval + hybrid predictor
456
+ │ ├── 41–43 benchmarks (media, external manifest, GenomeSPOT)
457
+ │ ├── modal_*.py # Modal GPU dispatchers
458
+ │ └── lambda_train_lora.py # Lambda Labs LoRA driver
459
+ ├── kaggle/ # Kaggle notebook + upload script (LoRA fallback)
460
+ ├── cerebrium/ # suspended Cerebrium deployment (embed, kofam)
461
+ ├── tests/ # 11 test files (unit + integration)
462
+ ├── paper/ # manuscript.md / .html / .pdf + render.py
463
+ ├── docs/ # hybrid_predictor.md, lora_results.md, etc.
464
+ ├── data/ # (gitignored) parquet shards + JSONL features
465
+ ├── artifacts/ # eval reports, training logs, prediction parquets
466
+ └── models/ # trained heads + recommender (LFS)
467
+ ├── phenotype/
468
+ └── recommender/
469
  ```
470
 
471
+ ### Key data artifacts (between stages)
472
+ | Parquet | Produced by | Consumed by |
473
+ |---|---|---|
474
+ | `data/bacdive_phenotypes.parquet` | `01_fetch_bacdive.py` | Layer 1 ingestion |
475
+ | `data/features.parquet` | `02_fetch_and_featurize.py` | Layer 3 merge |
476
+ | `data/hmm_features.parquet` | `24_unified_hmm_scan.py` | Layer 3 merge |
477
+ | `data/kegg_modules.parquet` | `29_compute_kegg_completeness.py` | Layer 3 merge |
478
+ | `data/mediadive_features.parquet` | `20_build_mediadive_features.py` | Layer 3 merge |
479
+ | `data/isolation_metadata.parquet` | `30_parse_isolation_metadata.py` | Layer 3 merge |
480
+ | `data/per_marker_embeddings.parquet` | `_materialize_per_marker_embeddings.py` | Layer 3 merge |
481
+ | `data/training_table.parquet` | `31_merge_features.py` | all training scripts |
482
+ | `artifacts/uncultured_predictions.parquet` | `07_predict_uncultured.py` | served catalog |
483
+ | `artifacts/hybrid_predictions.parquet` | `39_predict_hybrid.py` | served catalog overlay |
484
+
485
  ## What this is *not* yet
486
 
487
  - Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
 
512
  🔬 Open:
513
  - **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
514
  is promising for oxygen, but it is still only one group fold
515
+ - **Run external baselines** on the prepared held-out manifest once FASTAs and
516
+ third-party databases are local: GenomeSPOT for condition traits, and
517
+ CarveMe/gapseq-style metabolic reconstructions for medium feasibility.
518
  - **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
519
  methodological direction)
520
  - LPSN/GTDB family proper join (for tighter GroupKFold)
api/main.py CHANGED
@@ -15,6 +15,8 @@ import json
15
  import os
16
  import re
17
  import sys
 
 
18
  from pathlib import Path
19
  from tempfile import NamedTemporaryFile
20
  from typing import Any
@@ -183,27 +185,48 @@ def _looks_like_accession(target: str) -> bool:
183
  return bool(_ACCESSION_RE.match(target.strip()))
184
 
185
 
186
- def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
187
- """Resolve an organism name to NCBI assembly accessions, best (most complete) first."""
 
 
 
 
 
188
  api_key = os.environ.get("NCBI_API_KEY")
189
- common = {"api_key": api_key} if api_key else {}
190
- r = requests.get(
191
- f"{EUTILS_BASE}/esearch.fcgi",
192
- params={"db": "assembly", "term": f"{q}[Organism] AND latest[filter]",
193
- "retmode": "json", "retmax": retmax, **common},
194
- timeout=20,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  )
196
- r.raise_for_status()
197
- ids = r.json().get("esearchresult", {}).get("idlist", [])
198
  if not ids:
199
- return []
200
- r = requests.get(
201
- f"{EUTILS_BASE}/esummary.fcgi",
202
- params={"db": "assembly", "id": ",".join(ids), "retmode": "json", **common},
203
- timeout=20,
204
  )
205
- r.raise_for_status()
206
- result = r.json().get("result", {})
207
  out = []
208
  for uid in result.get("uids", []):
209
  doc = result.get(uid, {})
@@ -214,7 +237,13 @@ def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
214
  })
215
  rank = {"Complete Genome": 0, "Chromosome": 1, "Scaffold": 2, "Contig": 3}
216
  out.sort(key=lambda r: rank.get(r["level"], 99))
217
- return out
 
 
 
 
 
 
218
 
219
 
220
  # ──────────────────────────────────────────────────────────────────────
 
15
  import os
16
  import re
17
  import sys
18
+ import time
19
+ from functools import lru_cache
20
  from pathlib import Path
21
  from tempfile import NamedTemporaryFile
22
  from typing import Any
 
185
  return bool(_ACCESSION_RE.match(target.strip()))
186
 
187
 
188
+ def _eutils_get(endpoint: str, params: dict, *, retries: int = 3) -> dict:
189
+ """GET an E-utilities endpoint with an NCBI API key (if set) and retry on 429/5xx.
190
+
191
+ Anonymous eutils is limited to 3 req/sec (10/sec with NCBI_API_KEY), so transient
192
+ 429s are expected under concurrent load. Back off and retry rather than surfacing
193
+ the rate limit to the user.
194
+ """
195
  api_key = os.environ.get("NCBI_API_KEY")
196
+ if api_key:
197
+ params = {**params, "api_key": api_key}
198
+ last_exc: Exception | None = None
199
+ for attempt in range(retries):
200
+ try:
201
+ r = requests.get(f"{EUTILS_BASE}/{endpoint}", params=params, timeout=20)
202
+ r.raise_for_status()
203
+ return r.json()
204
+ except requests.RequestException as e:
205
+ last_exc = e
206
+ status = getattr(e.response, "status_code", None)
207
+ if status == 429 or (status is not None and status >= 500):
208
+ time.sleep(0.5 * (2 ** attempt)) # 0.5s, 1s, 2s
209
+ continue
210
+ raise
211
+ raise last_exc # type: ignore[misc]
212
+
213
+
214
+ @lru_cache(maxsize=512)
215
+ def _ncbi_assembly_hits_cached(q_norm: str, retmax: int) -> tuple[dict, ...]:
216
+ """Cached core resolver. Keyed on the normalized query; returns a hashable tuple."""
217
+ data = _eutils_get(
218
+ "esearch.fcgi",
219
+ {"db": "assembly", "term": f"{q_norm}[Organism] AND latest[filter]",
220
+ "retmode": "json", "retmax": retmax},
221
  )
222
+ ids = data.get("esearchresult", {}).get("idlist", [])
 
223
  if not ids:
224
+ return ()
225
+ data = _eutils_get(
226
+ "esummary.fcgi",
227
+ {"db": "assembly", "id": ",".join(ids), "retmode": "json"},
 
228
  )
229
+ result = data.get("result", {})
 
230
  out = []
231
  for uid in result.get("uids", []):
232
  doc = result.get(uid, {})
 
237
  })
238
  rank = {"Complete Genome": 0, "Chromosome": 1, "Scaffold": 2, "Contig": 3}
239
  out.sort(key=lambda r: rank.get(r["level"], 99))
240
+ return tuple(out)
241
+
242
+
243
+ def _ncbi_assembly_hits(q: str, retmax: int = 10) -> list[dict]:
244
+ """Resolve an organism name to NCBI assembly accessions, best (most complete) first."""
245
+ hits = _ncbi_assembly_hits_cached(q.strip().lower(), retmax)
246
+ return [dict(h) for h in hits] # fresh copies so callers can't mutate the cache
247
 
248
 
249
  # ──────────────────────────────────────────────────────────────────────
artifacts/baseline_results.json CHANGED
The diff for this file is too large to render. See raw diff
 
artifacts/baseline_results_pre_pme.json ADDED
@@ -0,0 +1,1461 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "optimal_temperature_c": {
3
+ "task": "regression",
4
+ "mean_metric": 2.739843424922931,
5
+ "folds": [
6
+ {
7
+ "target": "optimal_temperature_c",
8
+ "task": "regression",
9
+ "metric_name": "mae",
10
+ "value": 3.0611745822227165,
11
+ "n_train": 36496,
12
+ "n_test": 9125
13
+ },
14
+ {
15
+ "target": "optimal_temperature_c",
16
+ "task": "regression",
17
+ "metric_name": "mae",
18
+ "value": 2.455059736520248,
19
+ "n_train": 36497,
20
+ "n_test": 9124
21
+ },
22
+ {
23
+ "target": "optimal_temperature_c",
24
+ "task": "regression",
25
+ "metric_name": "mae",
26
+ "value": 2.9954137300201173,
27
+ "n_train": 36497,
28
+ "n_test": 9124
29
+ },
30
+ {
31
+ "target": "optimal_temperature_c",
32
+ "task": "regression",
33
+ "metric_name": "mae",
34
+ "value": 2.890288436794323,
35
+ "n_train": 36497,
36
+ "n_test": 9124
37
+ },
38
+ {
39
+ "target": "optimal_temperature_c",
40
+ "task": "regression",
41
+ "metric_name": "mae",
42
+ "value": 2.2972806390572513,
43
+ "n_train": 36497,
44
+ "n_test": 9124
45
+ }
46
+ ],
47
+ "top_features": {
48
+ "ivywrel_frac": 0.06971819475293159,
49
+ "kegg_M00124": 0.03207303611561656,
50
+ "hmm_Cyt_CBB3_microaero_score": 0.0300216194242239,
51
+ "hmm_COX1_aerobic_score": 0.02658985326997936,
52
+ "kegg_M00848": 0.02018665219657123,
53
+ "hmm_COX2_periplasm_aero_score": 0.017157328128814698,
54
+ "iso_cat2_patient": 0.016685061932366806,
55
+ "kegg_M00154": 0.016055268728814555,
56
+ "iso_country_sweden": 0.013991944026201963,
57
+ "iso_host_kingdom_human": 0.013502865098416805,
58
+ "iso_cat2_thermophilic_gt45_c": 0.012728683836758136,
59
+ "kegg_M00044": 0.009734607744030654,
60
+ "hmm_Hsp90_score": 0.009510826040059327,
61
+ "hmm_Hsp90_n": 0.009146594302728772,
62
+ "kegg_M00959": 0.008737720013596118,
63
+ "kegg_M00118": 0.007348061632364988,
64
+ "hmm_UvrD_helicase_C_score": 0.0068708939361386,
65
+ "hmm_BCCT_compatible_n": 0.005894379841629415,
66
+ "iso_cat1_infection": 0.005757758072286378,
67
+ "kegg_M00357": 0.005146319686900824
68
+ }
69
+ },
70
+ "optimal_ph": {
71
+ "task": "regression",
72
+ "mean_metric": 0.47295744064234946,
73
+ "folds": [
74
+ {
75
+ "target": "optimal_ph",
76
+ "task": "regression",
77
+ "metric_name": "mae",
78
+ "value": 0.4365982120207544,
79
+ "n_train": 4082,
80
+ "n_test": 1021
81
+ },
82
+ {
83
+ "target": "optimal_ph",
84
+ "task": "regression",
85
+ "metric_name": "mae",
86
+ "value": 0.5473054707458974,
87
+ "n_train": 4082,
88
+ "n_test": 1021
89
+ },
90
+ {
91
+ "target": "optimal_ph",
92
+ "task": "regression",
93
+ "metric_name": "mae",
94
+ "value": 0.48511127384122277,
95
+ "n_train": 4082,
96
+ "n_test": 1021
97
+ },
98
+ {
99
+ "target": "optimal_ph",
100
+ "task": "regression",
101
+ "metric_name": "mae",
102
+ "value": 0.4573734540565341,
103
+ "n_train": 4083,
104
+ "n_test": 1020
105
+ },
106
+ {
107
+ "target": "optimal_ph",
108
+ "task": "regression",
109
+ "metric_name": "mae",
110
+ "value": 0.43839879254733816,
111
+ "n_train": 4083,
112
+ "n_test": 1020
113
+ }
114
+ ],
115
+ "top_features": {
116
+ "md_ph_median": 0.02589539885520935,
117
+ "hmm_TrkH_K_channel_n": 0.019185363367432727,
118
+ "iso_cat2_acidic": 0.012672881036996842,
119
+ "kegg_M00066": 0.0121498073451221,
120
+ "iso_cat2_alkaline": 0.011763222608715295,
121
+ "kegg_M00124": 0.0111599849013146,
122
+ "kegg_M00416": 0.008738127537071705,
123
+ "hmm_BCCT_compatible_score": 0.007578231208026409,
124
+ "kegg_M00309": 0.007578140590339899,
125
+ "kegg_M00788": 0.007394352648407221,
126
+ "kegg_M00155": 0.007158075552433729,
127
+ "kegg_M00525": 0.00614791321568191,
128
+ "hmm_RuBisCO_small_form1_score": 0.006115549360401928,
129
+ "kegg_M00154": 0.0058954657521098856,
130
+ "kegg_M01038": 0.005821993038989603,
131
+ "neg_charged_frac": 0.00574727994389832,
132
+ "hmm_RuBisCO_large_form1_score": 0.005585524230264127,
133
+ "kegg_M00126": 0.005203998694196343,
134
+ "kegg_M00417": 0.004907929454930127,
135
+ "kegg_M01048": 0.0048509430140256885
136
+ }
137
+ },
138
+ "oxygen_requirement": {
139
+ "task": "classification",
140
+ "mean_metric": 0.41167508730878977,
141
+ "folds": [
142
+ {
143
+ "target": "oxygen_requirement",
144
+ "task": "classification",
145
+ "metric_name": "f1_macro",
146
+ "value": 0.4057735839519483,
147
+ "n_train": 17311,
148
+ "n_test": 4328
149
+ },
150
+ {
151
+ "target": "oxygen_requirement",
152
+ "task": "classification",
153
+ "metric_name": "f1_macro",
154
+ "value": 0.42448903596925475,
155
+ "n_train": 17311,
156
+ "n_test": 4326
157
+ },
158
+ {
159
+ "target": "oxygen_requirement",
160
+ "task": "classification",
161
+ "metric_name": "f1_macro",
162
+ "value": 0.4253087727137673,
163
+ "n_train": 17311,
164
+ "n_test": 4328
165
+ },
166
+ {
167
+ "target": "oxygen_requirement",
168
+ "task": "classification",
169
+ "metric_name": "f1_macro",
170
+ "value": 0.32108277548814124,
171
+ "n_train": 17311,
172
+ "n_test": 4328
173
+ },
174
+ {
175
+ "target": "oxygen_requirement",
176
+ "task": "classification",
177
+ "metric_name": "f1_macro",
178
+ "value": 0.4817212684208369,
179
+ "n_train": 17312,
180
+ "n_test": 4327
181
+ }
182
+ ],
183
+ "top_features": {
184
+ "kegg_M00048": 0.04917992913397029,
185
+ "hmm_COX1_aerobic_score": 0.030582932382822038,
186
+ "hmm_FeFe_hyd_anaerobic_score": 0.017010687850415707,
187
+ "kegg_M00868": 0.016209590720245616,
188
+ "hmm_Alpha_amylase_score": 0.012914934009313584,
189
+ "kegg_M00122": 0.012612361693754792,
190
+ "n_predicted_cds": 0.011417849035933614,
191
+ "kegg_M00013": 0.011295223154593258,
192
+ "codon_CAA": 0.010262218676507473,
193
+ "kegg_M00060": 0.009821721445769071,
194
+ "iso_cat1_host": 0.009196543786674737,
195
+ "hmm_FAD_binding_FrdA_n": 0.009031777316704393,
196
+ "kegg_M00032": 0.008559015835635364,
197
+ "hmm_NifDK_nitrogenase_n": 0.007181795220822096,
198
+ "kegg_M01010": 0.006753691844642163,
199
+ "kegg_M00036": 0.006538775702938438,
200
+ "hmm_PdxJ_pyridoxine_n": 0.006458204134833068,
201
+ "iso_host_kingdom_other": 0.006149246357381344,
202
+ "kegg_M00609": 0.005912851588800549,
203
+ "kegg_M00878": 0.005448635050561279
204
+ }
205
+ },
206
+ "salt_tolerance_pct": {
207
+ "task": "regression",
208
+ "mean_metric": 1.938587229032814,
209
+ "folds": [
210
+ {
211
+ "target": "salt_tolerance_pct",
212
+ "task": "regression",
213
+ "metric_name": "mae",
214
+ "value": 1.6739041984692624,
215
+ "n_train": 3075,
216
+ "n_test": 769
217
+ },
218
+ {
219
+ "target": "salt_tolerance_pct",
220
+ "task": "regression",
221
+ "metric_name": "mae",
222
+ "value": 1.7789674237746413,
223
+ "n_train": 3075,
224
+ "n_test": 769
225
+ },
226
+ {
227
+ "target": "salt_tolerance_pct",
228
+ "task": "regression",
229
+ "metric_name": "mae",
230
+ "value": 2.653249260519034,
231
+ "n_train": 3075,
232
+ "n_test": 769
233
+ },
234
+ {
235
+ "target": "salt_tolerance_pct",
236
+ "task": "regression",
237
+ "metric_name": "mae",
238
+ "value": 1.7333069061884374,
239
+ "n_train": 3075,
240
+ "n_test": 769
241
+ },
242
+ {
243
+ "target": "salt_tolerance_pct",
244
+ "task": "regression",
245
+ "metric_name": "mae",
246
+ "value": 1.8535083562126964,
247
+ "n_train": 3076,
248
+ "n_test": 768
249
+ }
250
+ ],
251
+ "top_features": {
252
+ "neg_charged_frac": 0.03803467936813831,
253
+ "tetra_ATCC": 0.03377264237497002,
254
+ "tetra_ACAT": 0.02272233280236833,
255
+ "aa_frac_C": 0.019299705256707968,
256
+ "kegg_M00849": 0.01693017091602087,
257
+ "hmm_BCCT_compatible_n": 0.014071422629058361,
258
+ "md_nacl_pct_median": 0.013062665052711964,
259
+ "iso_cat2_saline": 0.010860100761055947,
260
+ "kegg_M00033": 0.0096977892331779,
261
+ "kegg_M00159": 0.009692981420084835,
262
+ "hmm_Cyt_CBB3_microaero_n": 0.009228403586894274,
263
+ "kegg_M00725": 0.008322405093349517,
264
+ "kegg_M00149": 0.008042314299382269,
265
+ "aa_frac_T": 0.007969680236419663,
266
+ "codon_CCG": 0.007059626755653881,
267
+ "kegg_M00144": 0.006586183654144406,
268
+ "kegg_M00121": 0.006443552603013813,
269
+ "hmm_ATP_synth_F0_B_n": 0.006160225393250585,
270
+ "iso_collection_year": 0.005867346032755449,
271
+ "codon_CGC": 0.005818429926875979
272
+ }
273
+ },
274
+ "__meta__": {
275
+ "feature_cols": [
276
+ "genome_size_nt",
277
+ "n_contigs",
278
+ "gc_content",
279
+ "n_predicted_cds",
280
+ "coding_density",
281
+ "mean_cds_aa_length",
282
+ "median_cds_aa_length",
283
+ "aromatic_frac",
284
+ "pos_charged_frac",
285
+ "neg_charged_frac",
286
+ "ivywrel_frac",
287
+ "mean_hydrophobicity",
288
+ "mean_isoelectric_point",
289
+ "aa_frac_A",
290
+ "aa_frac_C",
291
+ "aa_frac_D",
292
+ "aa_frac_E",
293
+ "aa_frac_F",
294
+ "aa_frac_G",
295
+ "aa_frac_H",
296
+ "aa_frac_I",
297
+ "aa_frac_K",
298
+ "aa_frac_L",
299
+ "aa_frac_M",
300
+ "aa_frac_N",
301
+ "aa_frac_P",
302
+ "aa_frac_Q",
303
+ "aa_frac_R",
304
+ "aa_frac_S",
305
+ "aa_frac_T",
306
+ "aa_frac_V",
307
+ "aa_frac_W",
308
+ "aa_frac_Y",
309
+ "tetra_AAAA",
310
+ "tetra_AAAC",
311
+ "tetra_AAAG",
312
+ "tetra_AAAT",
313
+ "tetra_AACA",
314
+ "tetra_AACC",
315
+ "tetra_AACG",
316
+ "tetra_AACT",
317
+ "tetra_AAGA",
318
+ "tetra_AAGC",
319
+ "tetra_AAGG",
320
+ "tetra_AAGT",
321
+ "tetra_AATA",
322
+ "tetra_AATC",
323
+ "tetra_AATG",
324
+ "tetra_AATT",
325
+ "tetra_ACAA",
326
+ "tetra_ACAC",
327
+ "tetra_ACAG",
328
+ "tetra_ACAT",
329
+ "tetra_ACCA",
330
+ "tetra_ACCC",
331
+ "tetra_ACCG",
332
+ "tetra_ACCT",
333
+ "tetra_ACGA",
334
+ "tetra_ACGC",
335
+ "tetra_ACGG",
336
+ "tetra_ACGT",
337
+ "tetra_ACTA",
338
+ "tetra_ACTC",
339
+ "tetra_ACTG",
340
+ "tetra_ACTT",
341
+ "tetra_AGAA",
342
+ "tetra_AGAC",
343
+ "tetra_AGAG",
344
+ "tetra_AGAT",
345
+ "tetra_AGCA",
346
+ "tetra_AGCC",
347
+ "tetra_AGCG",
348
+ "tetra_AGCT",
349
+ "tetra_AGGA",
350
+ "tetra_AGGC",
351
+ "tetra_AGGG",
352
+ "tetra_AGGT",
353
+ "tetra_AGTA",
354
+ "tetra_AGTC",
355
+ "tetra_AGTG",
356
+ "tetra_AGTT",
357
+ "tetra_ATAA",
358
+ "tetra_ATAC",
359
+ "tetra_ATAG",
360
+ "tetra_ATAT",
361
+ "tetra_ATCA",
362
+ "tetra_ATCC",
363
+ "tetra_ATCG",
364
+ "tetra_ATCT",
365
+ "tetra_ATGA",
366
+ "tetra_ATGC",
367
+ "tetra_ATGG",
368
+ "tetra_ATGT",
369
+ "tetra_ATTA",
370
+ "tetra_ATTC",
371
+ "tetra_ATTG",
372
+ "tetra_ATTT",
373
+ "tetra_CAAA",
374
+ "tetra_CAAC",
375
+ "tetra_CAAG",
376
+ "tetra_CAAT",
377
+ "tetra_CACA",
378
+ "tetra_CACC",
379
+ "tetra_CACG",
380
+ "tetra_CACT",
381
+ "tetra_CAGA",
382
+ "tetra_CAGC",
383
+ "tetra_CAGG",
384
+ "tetra_CAGT",
385
+ "tetra_CATA",
386
+ "tetra_CATC",
387
+ "tetra_CATG",
388
+ "tetra_CATT",
389
+ "tetra_CCAA",
390
+ "tetra_CCAC",
391
+ "tetra_CCAG",
392
+ "tetra_CCAT",
393
+ "tetra_CCCA",
394
+ "tetra_CCCC",
395
+ "tetra_CCCG",
396
+ "tetra_CCCT",
397
+ "tetra_CCGA",
398
+ "tetra_CCGC",
399
+ "tetra_CCGG",
400
+ "tetra_CCGT",
401
+ "tetra_CCTA",
402
+ "tetra_CCTC",
403
+ "tetra_CCTG",
404
+ "tetra_CCTT",
405
+ "tetra_CGAA",
406
+ "tetra_CGAC",
407
+ "tetra_CGAG",
408
+ "tetra_CGAT",
409
+ "tetra_CGCA",
410
+ "tetra_CGCC",
411
+ "tetra_CGCG",
412
+ "tetra_CGCT",
413
+ "tetra_CGGA",
414
+ "tetra_CGGC",
415
+ "tetra_CGGG",
416
+ "tetra_CGGT",
417
+ "tetra_CGTA",
418
+ "tetra_CGTC",
419
+ "tetra_CGTG",
420
+ "tetra_CGTT",
421
+ "tetra_CTAA",
422
+ "tetra_CTAC",
423
+ "tetra_CTAG",
424
+ "tetra_CTAT",
425
+ "tetra_CTCA",
426
+ "tetra_CTCC",
427
+ "tetra_CTCG",
428
+ "tetra_CTCT",
429
+ "tetra_CTGA",
430
+ "tetra_CTGC",
431
+ "tetra_CTGG",
432
+ "tetra_CTGT",
433
+ "tetra_CTTA",
434
+ "tetra_CTTC",
435
+ "tetra_CTTG",
436
+ "tetra_CTTT",
437
+ "tetra_GAAA",
438
+ "tetra_GAAC",
439
+ "tetra_GAAG",
440
+ "tetra_GAAT",
441
+ "tetra_GACA",
442
+ "tetra_GACC",
443
+ "tetra_GACG",
444
+ "tetra_GACT",
445
+ "tetra_GAGA",
446
+ "tetra_GAGC",
447
+ "tetra_GAGG",
448
+ "tetra_GAGT",
449
+ "tetra_GATA",
450
+ "tetra_GATC",
451
+ "tetra_GATG",
452
+ "tetra_GATT",
453
+ "tetra_GCAA",
454
+ "tetra_GCAC",
455
+ "tetra_GCAG",
456
+ "tetra_GCAT",
457
+ "tetra_GCCA",
458
+ "tetra_GCCC",
459
+ "tetra_GCCG",
460
+ "tetra_GCCT",
461
+ "tetra_GCGA",
462
+ "tetra_GCGC",
463
+ "tetra_GCGG",
464
+ "tetra_GCGT",
465
+ "tetra_GCTA",
466
+ "tetra_GCTC",
467
+ "tetra_GCTG",
468
+ "tetra_GCTT",
469
+ "tetra_GGAA",
470
+ "tetra_GGAC",
471
+ "tetra_GGAG",
472
+ "tetra_GGAT",
473
+ "tetra_GGCA",
474
+ "tetra_GGCC",
475
+ "tetra_GGCG",
476
+ "tetra_GGCT",
477
+ "tetra_GGGA",
478
+ "tetra_GGGC",
479
+ "tetra_GGGG",
480
+ "tetra_GGGT",
481
+ "tetra_GGTA",
482
+ "tetra_GGTC",
483
+ "tetra_GGTG",
484
+ "tetra_GGTT",
485
+ "tetra_GTAA",
486
+ "tetra_GTAC",
487
+ "tetra_GTAG",
488
+ "tetra_GTAT",
489
+ "tetra_GTCA",
490
+ "tetra_GTCC",
491
+ "tetra_GTCG",
492
+ "tetra_GTCT",
493
+ "tetra_GTGA",
494
+ "tetra_GTGC",
495
+ "tetra_GTGG",
496
+ "tetra_GTGT",
497
+ "tetra_GTTA",
498
+ "tetra_GTTC",
499
+ "tetra_GTTG",
500
+ "tetra_GTTT",
501
+ "tetra_TAAA",
502
+ "tetra_TAAC",
503
+ "tetra_TAAG",
504
+ "tetra_TAAT",
505
+ "tetra_TACA",
506
+ "tetra_TACC",
507
+ "tetra_TACG",
508
+ "tetra_TACT",
509
+ "tetra_TAGA",
510
+ "tetra_TAGC",
511
+ "tetra_TAGG",
512
+ "tetra_TAGT",
513
+ "tetra_TATA",
514
+ "tetra_TATC",
515
+ "tetra_TATG",
516
+ "tetra_TATT",
517
+ "tetra_TCAA",
518
+ "tetra_TCAC",
519
+ "tetra_TCAG",
520
+ "tetra_TCAT",
521
+ "tetra_TCCA",
522
+ "tetra_TCCC",
523
+ "tetra_TCCG",
524
+ "tetra_TCCT",
525
+ "tetra_TCGA",
526
+ "tetra_TCGC",
527
+ "tetra_TCGG",
528
+ "tetra_TCGT",
529
+ "tetra_TCTA",
530
+ "tetra_TCTC",
531
+ "tetra_TCTG",
532
+ "tetra_TCTT",
533
+ "tetra_TGAA",
534
+ "tetra_TGAC",
535
+ "tetra_TGAG",
536
+ "tetra_TGAT",
537
+ "tetra_TGCA",
538
+ "tetra_TGCC",
539
+ "tetra_TGCG",
540
+ "tetra_TGCT",
541
+ "tetra_TGGA",
542
+ "tetra_TGGC",
543
+ "tetra_TGGG",
544
+ "tetra_TGGT",
545
+ "tetra_TGTA",
546
+ "tetra_TGTC",
547
+ "tetra_TGTG",
548
+ "tetra_TGTT",
549
+ "tetra_TTAA",
550
+ "tetra_TTAC",
551
+ "tetra_TTAG",
552
+ "tetra_TTAT",
553
+ "tetra_TTCA",
554
+ "tetra_TTCC",
555
+ "tetra_TTCG",
556
+ "tetra_TTCT",
557
+ "tetra_TTGA",
558
+ "tetra_TTGC",
559
+ "tetra_TTGG",
560
+ "tetra_TTGT",
561
+ "tetra_TTTA",
562
+ "tetra_TTTC",
563
+ "tetra_TTTG",
564
+ "tetra_TTTT",
565
+ "codon_AAA",
566
+ "codon_AAC",
567
+ "codon_AAG",
568
+ "codon_AAT",
569
+ "codon_ACA",
570
+ "codon_ACC",
571
+ "codon_ACG",
572
+ "codon_ACT",
573
+ "codon_AGA",
574
+ "codon_AGC",
575
+ "codon_AGG",
576
+ "codon_AGT",
577
+ "codon_ATA",
578
+ "codon_ATC",
579
+ "codon_ATG",
580
+ "codon_ATT",
581
+ "codon_CAA",
582
+ "codon_CAC",
583
+ "codon_CAG",
584
+ "codon_CAT",
585
+ "codon_CCA",
586
+ "codon_CCC",
587
+ "codon_CCG",
588
+ "codon_CCT",
589
+ "codon_CGA",
590
+ "codon_CGC",
591
+ "codon_CGG",
592
+ "codon_CGT",
593
+ "codon_CTA",
594
+ "codon_CTC",
595
+ "codon_CTG",
596
+ "codon_CTT",
597
+ "codon_GAA",
598
+ "codon_GAC",
599
+ "codon_GAG",
600
+ "codon_GAT",
601
+ "codon_GCA",
602
+ "codon_GCC",
603
+ "codon_GCG",
604
+ "codon_GCT",
605
+ "codon_GGA",
606
+ "codon_GGC",
607
+ "codon_GGG",
608
+ "codon_GGT",
609
+ "codon_GTA",
610
+ "codon_GTC",
611
+ "codon_GTG",
612
+ "codon_GTT",
613
+ "codon_TAA",
614
+ "codon_TAC",
615
+ "codon_TAG",
616
+ "codon_TAT",
617
+ "codon_TCA",
618
+ "codon_TCC",
619
+ "codon_TCG",
620
+ "codon_TCT",
621
+ "codon_TGA",
622
+ "codon_TGC",
623
+ "codon_TGG",
624
+ "codon_TGT",
625
+ "codon_TTA",
626
+ "codon_TTC",
627
+ "codon_TTG",
628
+ "codon_TTT",
629
+ "iso_cat1_climate",
630
+ "iso_cat1_condition",
631
+ "iso_cat1_engineered",
632
+ "iso_cat1_environmental",
633
+ "iso_cat1_host",
634
+ "iso_cat1_host_body_product",
635
+ "iso_cat1_host_body_site",
636
+ "iso_cat1_infection",
637
+ "iso_cat2_acidic",
638
+ "iso_cat2_agriculture",
639
+ "iso_cat2_air",
640
+ "iso_cat2_algae",
641
+ "iso_cat2_alkaline",
642
+ "iso_cat2_anoxic_anaerobic",
643
+ "iso_cat2_aquatic",
644
+ "iso_cat2_arthropoda",
645
+ "iso_cat2_biodegradation",
646
+ "iso_cat2_biofilm",
647
+ "iso_cat2_bioreactor",
648
+ "iso_cat2_bioremediation",
649
+ "iso_cat2_birds",
650
+ "iso_cat2_built_environment",
651
+ "iso_cat2_cold",
652
+ "iso_cat2_contamination",
653
+ "iso_cat2_disease",
654
+ "iso_cat2_fishes",
655
+ "iso_cat2_fluids",
656
+ "iso_cat2_food_production",
657
+ "iso_cat2_fungi",
658
+ "iso_cat2_gastrointestinal_tract",
659
+ "iso_cat2_hot",
660
+ "iso_cat2_human",
661
+ "iso_cat2_humid",
662
+ "iso_cat2_industrial",
663
+ "iso_cat2_inflammation",
664
+ "iso_cat2_invertebrates_other",
665
+ "iso_cat2_juvenile",
666
+ "iso_cat2_laboratory",
667
+ "iso_cat2_limb",
668
+ "iso_cat2_mammals",
669
+ "iso_cat2_medical_device",
670
+ "iso_cat2_medical_environment",
671
+ "iso_cat2_medical_product",
672
+ "iso_cat2_microbial",
673
+ "iso_cat2_microbial_community",
674
+ "iso_cat2_oral_cavity_and_airways",
675
+ "iso_cat2_organ",
676
+ "iso_cat2_other",
677
+ "iso_cat2_patient",
678
+ "iso_cat2_plant",
679
+ "iso_cat2_plant_infections",
680
+ "iso_cat2_plants",
681
+ "iso_cat2_protozoa",
682
+ "iso_cat2_psychrophilic_lt10_c",
683
+ "iso_cat2_reptilia",
684
+ "iso_cat2_saline",
685
+ "iso_cat2_sulfuric",
686
+ "iso_cat2_temperate",
687
+ "iso_cat2_terrestrial",
688
+ "iso_cat2_thermophilic_gt45_c",
689
+ "iso_cat2_treatment",
690
+ "iso_cat2_urogenital_tract",
691
+ "iso_cat2_waste",
692
+ "iso_cat2_xerophilic",
693
+ "iso_cat2_yeast",
694
+ "md_n_media",
695
+ "md_ph_median",
696
+ "md_ph_range",
697
+ "md_nacl_pct_median",
698
+ "md_nacl_pct_max",
699
+ "hmm_Peripla_BP_2_n",
700
+ "hmm_Peripla_BP_2_score",
701
+ "hmm_Peripla_BP_2_present",
702
+ "hmm_Cyt_CBB3_microaero_n",
703
+ "hmm_Cyt_CBB3_microaero_score",
704
+ "hmm_Cyt_CBB3_microaero_present",
705
+ "hmm_Rieske_2Fe2S_n",
706
+ "hmm_Rieske_2Fe2S_score",
707
+ "hmm_Rieske_2Fe2S_present",
708
+ "hmm_FeFe_hyd_anaerobic_n",
709
+ "hmm_FeFe_hyd_anaerobic_score",
710
+ "hmm_FeFe_hyd_anaerobic_present",
711
+ "hmm_KdpD_osmosensor_n",
712
+ "hmm_KdpD_osmosensor_score",
713
+ "hmm_KdpD_osmosensor_present",
714
+ "hmm_DHBP_riboflavin_n",
715
+ "hmm_DHBP_riboflavin_score",
716
+ "hmm_DHBP_riboflavin_present",
717
+ "hmm_BCCT_compatible_n",
718
+ "hmm_BCCT_compatible_score",
719
+ "hmm_BCCT_compatible_present",
720
+ "hmm_NIR_SIR_ferredoxin_n",
721
+ "hmm_NIR_SIR_ferredoxin_score",
722
+ "hmm_NIR_SIR_ferredoxin_present",
723
+ "hmm_CBM_cellulose_n",
724
+ "hmm_CBM_cellulose_score",
725
+ "hmm_CBM_cellulose_present",
726
+ "hmm_Pyridoxal_decarbox_n",
727
+ "hmm_Pyridoxal_decarbox_score",
728
+ "hmm_Pyridoxal_decarbox_present",
729
+ "hmm_Hsp70_DnaK_n",
730
+ "hmm_Hsp70_DnaK_score",
731
+ "hmm_Hsp70_DnaK_present",
732
+ "hmm_Cellulase_GH5_n",
733
+ "hmm_Cellulase_GH5_score",
734
+ "hmm_Cellulase_GH5_present",
735
+ "hmm_COX1_aerobic_n",
736
+ "hmm_COX1_aerobic_score",
737
+ "hmm_COX1_aerobic_present",
738
+ "hmm_COX2_TM_aerobic_n",
739
+ "hmm_COX2_TM_aerobic_score",
740
+ "hmm_COX2_TM_aerobic_present",
741
+ "hmm_ATP_synth_alphabeta_n",
742
+ "hmm_ATP_synth_alphabeta_score",
743
+ "hmm_ATP_synth_alphabeta_present",
744
+ "hmm_SOD_FeMn_n",
745
+ "hmm_SOD_FeMn_score",
746
+ "hmm_SOD_FeMn_present",
747
+ "hmm_Fer4_FeS_4Fe4S_n",
748
+ "hmm_Fer4_FeS_4Fe4S_score",
749
+ "hmm_Fer4_FeS_4Fe4S_present",
750
+ "hmm_EctC_ectoine_synth_n",
751
+ "hmm_EctC_ectoine_synth_score",
752
+ "hmm_EctC_ectoine_synth_present",
753
+ "hmm_Molybdopterin_OR_n",
754
+ "hmm_Molybdopterin_OR_score",
755
+ "hmm_Molybdopterin_OR_present",
756
+ "hmm_Hsp20_n",
757
+ "hmm_Hsp20_score",
758
+ "hmm_Hsp20_present",
759
+ "hmm_RuBisCO_small_form1_n",
760
+ "hmm_RuBisCO_small_form1_score",
761
+ "hmm_RuBisCO_small_form1_present",
762
+ "hmm_RuBisCO_large_form1_n",
763
+ "hmm_RuBisCO_large_form1_score",
764
+ "hmm_RuBisCO_large_form1_present",
765
+ "hmm_ATP_synth_alphabeta_C_n",
766
+ "hmm_ATP_synth_alphabeta_C_score",
767
+ "hmm_ATP_synth_alphabeta_C_present",
768
+ "hmm_PdxJ_pyridoxine_n",
769
+ "hmm_PdxJ_pyridoxine_score",
770
+ "hmm_PdxJ_pyridoxine_present",
771
+ "hmm_Alpha_amylase_n",
772
+ "hmm_Alpha_amylase_score",
773
+ "hmm_Alpha_amylase_present",
774
+ "hmm_NifH_nitrogenase_n",
775
+ "hmm_NifH_nitrogenase_score",
776
+ "hmm_NifH_nitrogenase_present",
777
+ "hmm_ATP_synth_F0_B_n",
778
+ "hmm_ATP_synth_F0_B_score",
779
+ "hmm_ATP_synth_F0_B_present",
780
+ "hmm_CSD_cold_shock_n",
781
+ "hmm_CSD_cold_shock_score",
782
+ "hmm_CSD_cold_shock_present",
783
+ "hmm_FolB_folate_n",
784
+ "hmm_FolB_folate_score",
785
+ "hmm_FolB_folate_present",
786
+ "hmm_Bact_rhodopsin_n",
787
+ "hmm_Bact_rhodopsin_score",
788
+ "hmm_Bact_rhodopsin_present",
789
+ "hmm_COX2_periplasm_aero_n",
790
+ "hmm_COX2_periplasm_aero_score",
791
+ "hmm_COX2_periplasm_aero_present",
792
+ "hmm_BPD_transp_1_n",
793
+ "hmm_BPD_transp_1_score",
794
+ "hmm_BPD_transp_1_present",
795
+ "hmm_TP_methylase_B12_n",
796
+ "hmm_TP_methylase_B12_score",
797
+ "hmm_TP_methylase_B12_present",
798
+ "hmm_V_ATPase_subH_N_n",
799
+ "hmm_V_ATPase_subH_N_score",
800
+ "hmm_V_ATPase_subH_N_present",
801
+ "hmm_UvrD_helicase_C_n",
802
+ "hmm_UvrD_helicase_C_score",
803
+ "hmm_UvrD_helicase_C_present",
804
+ "hmm_Cpn60_GroEL_n",
805
+ "hmm_Cpn60_GroEL_score",
806
+ "hmm_Cpn60_GroEL_present",
807
+ "hmm_MotA_TolQ_ExbB_n",
808
+ "hmm_MotA_TolQ_ExbB_score",
809
+ "hmm_MotA_TolQ_ExbB_present",
810
+ "hmm_TrkH_K_channel_n",
811
+ "hmm_TrkH_K_channel_score",
812
+ "hmm_TrkH_K_channel_present",
813
+ "hmm_NhaA_Na_H_exch_n",
814
+ "hmm_NhaA_Na_H_exch_score",
815
+ "hmm_NhaA_Na_H_exch_present",
816
+ "hmm_Hsp90_n",
817
+ "hmm_Hsp90_score",
818
+ "hmm_Hsp90_present",
819
+ "hmm_NhaB_Na_H_exch_n",
820
+ "hmm_NhaB_Na_H_exch_score",
821
+ "hmm_NhaB_Na_H_exch_present",
822
+ "hmm_NifDK_nitrogenase_n",
823
+ "hmm_NifDK_nitrogenase_score",
824
+ "hmm_NifDK_nitrogenase_present",
825
+ "hmm_Catalase_n",
826
+ "hmm_Catalase_score",
827
+ "hmm_Catalase_present",
828
+ "hmm_SOD_CuZn_n",
829
+ "hmm_SOD_CuZn_score",
830
+ "hmm_SOD_CuZn_present",
831
+ "hmm_FAD_binding_FrdA_n",
832
+ "hmm_FAD_binding_FrdA_score",
833
+ "hmm_FAD_binding_FrdA_present",
834
+ "hmm_THF_DHG_CYH_folate_n",
835
+ "hmm_THF_DHG_CYH_folate_score",
836
+ "hmm_THF_DHG_CYH_folate_present",
837
+ "hmm_TGS_thermosome_n",
838
+ "hmm_TGS_thermosome_score",
839
+ "hmm_TGS_thermosome_present",
840
+ "hmm_NiFe_hyd_anaerobic_n",
841
+ "hmm_NiFe_hyd_anaerobic_score",
842
+ "hmm_NiFe_hyd_anaerobic_present",
843
+ "kegg_M00001",
844
+ "kegg_M00002",
845
+ "kegg_M00003",
846
+ "kegg_M00004",
847
+ "kegg_M00005",
848
+ "kegg_M00006",
849
+ "kegg_M00007",
850
+ "kegg_M00008",
851
+ "kegg_M00009",
852
+ "kegg_M00010",
853
+ "kegg_M00011",
854
+ "kegg_M00012",
855
+ "kegg_M00013",
856
+ "kegg_M00014",
857
+ "kegg_M00015",
858
+ "kegg_M00016",
859
+ "kegg_M00017",
860
+ "kegg_M00018",
861
+ "kegg_M00019",
862
+ "kegg_M00020",
863
+ "kegg_M00021",
864
+ "kegg_M00022",
865
+ "kegg_M00023",
866
+ "kegg_M00024",
867
+ "kegg_M00025",
868
+ "kegg_M00026",
869
+ "kegg_M00027",
870
+ "kegg_M00028",
871
+ "kegg_M00029",
872
+ "kegg_M00030",
873
+ "kegg_M00031",
874
+ "kegg_M00032",
875
+ "kegg_M00033",
876
+ "kegg_M00034",
877
+ "kegg_M00035",
878
+ "kegg_M00036",
879
+ "kegg_M00037",
880
+ "kegg_M00038",
881
+ "kegg_M00039",
882
+ "kegg_M00040",
883
+ "kegg_M00042",
884
+ "kegg_M00043",
885
+ "kegg_M00044",
886
+ "kegg_M00045",
887
+ "kegg_M00046",
888
+ "kegg_M00047",
889
+ "kegg_M00048",
890
+ "kegg_M00049",
891
+ "kegg_M00050",
892
+ "kegg_M00051",
893
+ "kegg_M00052",
894
+ "kegg_M00053",
895
+ "kegg_M00055",
896
+ "kegg_M00056",
897
+ "kegg_M00057",
898
+ "kegg_M00058",
899
+ "kegg_M00059",
900
+ "kegg_M00060",
901
+ "kegg_M00061",
902
+ "kegg_M00063",
903
+ "kegg_M00064",
904
+ "kegg_M00065",
905
+ "kegg_M00066",
906
+ "kegg_M00067",
907
+ "kegg_M00068",
908
+ "kegg_M00069",
909
+ "kegg_M00070",
910
+ "kegg_M00071",
911
+ "kegg_M00072",
912
+ "kegg_M00073",
913
+ "kegg_M00074",
914
+ "kegg_M00075",
915
+ "kegg_M00076",
916
+ "kegg_M00077",
917
+ "kegg_M00078",
918
+ "kegg_M00079",
919
+ "kegg_M00081",
920
+ "kegg_M00082",
921
+ "kegg_M00083",
922
+ "kegg_M00085",
923
+ "kegg_M00086",
924
+ "kegg_M00087",
925
+ "kegg_M00088",
926
+ "kegg_M00089",
927
+ "kegg_M00090",
928
+ "kegg_M00091",
929
+ "kegg_M00092",
930
+ "kegg_M00093",
931
+ "kegg_M00094",
932
+ "kegg_M00095",
933
+ "kegg_M00096",
934
+ "kegg_M00097",
935
+ "kegg_M00098",
936
+ "kegg_M00099",
937
+ "kegg_M00100",
938
+ "kegg_M00101",
939
+ "kegg_M00102",
940
+ "kegg_M00103",
941
+ "kegg_M00104",
942
+ "kegg_M00106",
943
+ "kegg_M00107",
944
+ "kegg_M00108",
945
+ "kegg_M00109",
946
+ "kegg_M00110",
947
+ "kegg_M00112",
948
+ "kegg_M00113",
949
+ "kegg_M00114",
950
+ "kegg_M00115",
951
+ "kegg_M00116",
952
+ "kegg_M00117",
953
+ "kegg_M00118",
954
+ "kegg_M00119",
955
+ "kegg_M00120",
956
+ "kegg_M00121",
957
+ "kegg_M00122",
958
+ "kegg_M00123",
959
+ "kegg_M00124",
960
+ "kegg_M00125",
961
+ "kegg_M00126",
962
+ "kegg_M00127",
963
+ "kegg_M00128",
964
+ "kegg_M00129",
965
+ "kegg_M00130",
966
+ "kegg_M00131",
967
+ "kegg_M00132",
968
+ "kegg_M00133",
969
+ "kegg_M00134",
970
+ "kegg_M00135",
971
+ "kegg_M00136",
972
+ "kegg_M00137",
973
+ "kegg_M00138",
974
+ "kegg_M00140",
975
+ "kegg_M00141",
976
+ "kegg_M00142",
977
+ "kegg_M00143",
978
+ "kegg_M00144",
979
+ "kegg_M00145",
980
+ "kegg_M00146",
981
+ "kegg_M00147",
982
+ "kegg_M00148",
983
+ "kegg_M00149",
984
+ "kegg_M00150",
985
+ "kegg_M00151",
986
+ "kegg_M00152",
987
+ "kegg_M00153",
988
+ "kegg_M00154",
989
+ "kegg_M00155",
990
+ "kegg_M00156",
991
+ "kegg_M00157",
992
+ "kegg_M00158",
993
+ "kegg_M00159",
994
+ "kegg_M00160",
995
+ "kegg_M00161",
996
+ "kegg_M00162",
997
+ "kegg_M00163",
998
+ "kegg_M00165",
999
+ "kegg_M00168",
1000
+ "kegg_M00169",
1001
+ "kegg_M00170",
1002
+ "kegg_M00171",
1003
+ "kegg_M00172",
1004
+ "kegg_M00173",
1005
+ "kegg_M00174",
1006
+ "kegg_M00175",
1007
+ "kegg_M00176",
1008
+ "kegg_M00307",
1009
+ "kegg_M00308",
1010
+ "kegg_M00309",
1011
+ "kegg_M00338",
1012
+ "kegg_M00344",
1013
+ "kegg_M00345",
1014
+ "kegg_M00346",
1015
+ "kegg_M00356",
1016
+ "kegg_M00357",
1017
+ "kegg_M00358",
1018
+ "kegg_M00363",
1019
+ "kegg_M00364",
1020
+ "kegg_M00365",
1021
+ "kegg_M00366",
1022
+ "kegg_M00367",
1023
+ "kegg_M00368",
1024
+ "kegg_M00369",
1025
+ "kegg_M00370",
1026
+ "kegg_M00371",
1027
+ "kegg_M00372",
1028
+ "kegg_M00373",
1029
+ "kegg_M00374",
1030
+ "kegg_M00375",
1031
+ "kegg_M00376",
1032
+ "kegg_M00377",
1033
+ "kegg_M00378",
1034
+ "kegg_M00415",
1035
+ "kegg_M00416",
1036
+ "kegg_M00417",
1037
+ "kegg_M00418",
1038
+ "kegg_M00419",
1039
+ "kegg_M00422",
1040
+ "kegg_M00432",
1041
+ "kegg_M00433",
1042
+ "kegg_M00525",
1043
+ "kegg_M00526",
1044
+ "kegg_M00527",
1045
+ "kegg_M00528",
1046
+ "kegg_M00529",
1047
+ "kegg_M00530",
1048
+ "kegg_M00531",
1049
+ "kegg_M00532",
1050
+ "kegg_M00533",
1051
+ "kegg_M00534",
1052
+ "kegg_M00535",
1053
+ "kegg_M00537",
1054
+ "kegg_M00538",
1055
+ "kegg_M00539",
1056
+ "kegg_M00540",
1057
+ "kegg_M00541",
1058
+ "kegg_M00542",
1059
+ "kegg_M00543",
1060
+ "kegg_M00544",
1061
+ "kegg_M00545",
1062
+ "kegg_M00546",
1063
+ "kegg_M00547",
1064
+ "kegg_M00548",
1065
+ "kegg_M00549",
1066
+ "kegg_M00550",
1067
+ "kegg_M00551",
1068
+ "kegg_M00552",
1069
+ "kegg_M00554",
1070
+ "kegg_M00555",
1071
+ "kegg_M00563",
1072
+ "kegg_M00564",
1073
+ "kegg_M00565",
1074
+ "kegg_M00567",
1075
+ "kegg_M00568",
1076
+ "kegg_M00569",
1077
+ "kegg_M00570",
1078
+ "kegg_M00572",
1079
+ "kegg_M00573",
1080
+ "kegg_M00574",
1081
+ "kegg_M00575",
1082
+ "kegg_M00576",
1083
+ "kegg_M00577",
1084
+ "kegg_M00579",
1085
+ "kegg_M00580",
1086
+ "kegg_M00595",
1087
+ "kegg_M00596",
1088
+ "kegg_M00597",
1089
+ "kegg_M00598",
1090
+ "kegg_M00608",
1091
+ "kegg_M00609",
1092
+ "kegg_M00611",
1093
+ "kegg_M00612",
1094
+ "kegg_M00613",
1095
+ "kegg_M00614",
1096
+ "kegg_M00615",
1097
+ "kegg_M00616",
1098
+ "kegg_M00617",
1099
+ "kegg_M00618",
1100
+ "kegg_M00620",
1101
+ "kegg_M00621",
1102
+ "kegg_M00622",
1103
+ "kegg_M00623",
1104
+ "kegg_M00624",
1105
+ "kegg_M00625",
1106
+ "kegg_M00627",
1107
+ "kegg_M00630",
1108
+ "kegg_M00631",
1109
+ "kegg_M00632",
1110
+ "kegg_M00633",
1111
+ "kegg_M00636",
1112
+ "kegg_M00637",
1113
+ "kegg_M00638",
1114
+ "kegg_M00639",
1115
+ "kegg_M00641",
1116
+ "kegg_M00642",
1117
+ "kegg_M00643",
1118
+ "kegg_M00649",
1119
+ "kegg_M00651",
1120
+ "kegg_M00652",
1121
+ "kegg_M00660",
1122
+ "kegg_M00661",
1123
+ "kegg_M00664",
1124
+ "kegg_M00672",
1125
+ "kegg_M00673",
1126
+ "kegg_M00674",
1127
+ "kegg_M00675",
1128
+ "kegg_M00696",
1129
+ "kegg_M00697",
1130
+ "kegg_M00698",
1131
+ "kegg_M00700",
1132
+ "kegg_M00702",
1133
+ "kegg_M00704",
1134
+ "kegg_M00705",
1135
+ "kegg_M00714",
1136
+ "kegg_M00718",
1137
+ "kegg_M00725",
1138
+ "kegg_M00726",
1139
+ "kegg_M00730",
1140
+ "kegg_M00736",
1141
+ "kegg_M00740",
1142
+ "kegg_M00741",
1143
+ "kegg_M00744",
1144
+ "kegg_M00745",
1145
+ "kegg_M00746",
1146
+ "kegg_M00761",
1147
+ "kegg_M00763",
1148
+ "kegg_M00769",
1149
+ "kegg_M00773",
1150
+ "kegg_M00774",
1151
+ "kegg_M00775",
1152
+ "kegg_M00776",
1153
+ "kegg_M00777",
1154
+ "kegg_M00778",
1155
+ "kegg_M00779",
1156
+ "kegg_M00780",
1157
+ "kegg_M00781",
1158
+ "kegg_M00782",
1159
+ "kegg_M00783",
1160
+ "kegg_M00784",
1161
+ "kegg_M00785",
1162
+ "kegg_M00786",
1163
+ "kegg_M00787",
1164
+ "kegg_M00788",
1165
+ "kegg_M00789",
1166
+ "kegg_M00790",
1167
+ "kegg_M00793",
1168
+ "kegg_M00794",
1169
+ "kegg_M00795",
1170
+ "kegg_M00796",
1171
+ "kegg_M00797",
1172
+ "kegg_M00798",
1173
+ "kegg_M00799",
1174
+ "kegg_M00800",
1175
+ "kegg_M00801",
1176
+ "kegg_M00802",
1177
+ "kegg_M00803",
1178
+ "kegg_M00804",
1179
+ "kegg_M00805",
1180
+ "kegg_M00808",
1181
+ "kegg_M00810",
1182
+ "kegg_M00811",
1183
+ "kegg_M00814",
1184
+ "kegg_M00815",
1185
+ "kegg_M00819",
1186
+ "kegg_M00823",
1187
+ "kegg_M00824",
1188
+ "kegg_M00825",
1189
+ "kegg_M00826",
1190
+ "kegg_M00827",
1191
+ "kegg_M00828",
1192
+ "kegg_M00829",
1193
+ "kegg_M00830",
1194
+ "kegg_M00831",
1195
+ "kegg_M00832",
1196
+ "kegg_M00833",
1197
+ "kegg_M00834",
1198
+ "kegg_M00835",
1199
+ "kegg_M00836",
1200
+ "kegg_M00837",
1201
+ "kegg_M00838",
1202
+ "kegg_M00840",
1203
+ "kegg_M00841",
1204
+ "kegg_M00842",
1205
+ "kegg_M00843",
1206
+ "kegg_M00844",
1207
+ "kegg_M00845",
1208
+ "kegg_M00846",
1209
+ "kegg_M00847",
1210
+ "kegg_M00848",
1211
+ "kegg_M00849",
1212
+ "kegg_M00850",
1213
+ "kegg_M00851",
1214
+ "kegg_M00852",
1215
+ "kegg_M00853",
1216
+ "kegg_M00854",
1217
+ "kegg_M00855",
1218
+ "kegg_M00856",
1219
+ "kegg_M00857",
1220
+ "kegg_M00859",
1221
+ "kegg_M00860",
1222
+ "kegg_M00861",
1223
+ "kegg_M00862",
1224
+ "kegg_M00866",
1225
+ "kegg_M00867",
1226
+ "kegg_M00868",
1227
+ "kegg_M00872",
1228
+ "kegg_M00873",
1229
+ "kegg_M00874",
1230
+ "kegg_M00875",
1231
+ "kegg_M00876",
1232
+ "kegg_M00877",
1233
+ "kegg_M00878",
1234
+ "kegg_M00879",
1235
+ "kegg_M00880",
1236
+ "kegg_M00881",
1237
+ "kegg_M00882",
1238
+ "kegg_M00883",
1239
+ "kegg_M00884",
1240
+ "kegg_M00885",
1241
+ "kegg_M00886",
1242
+ "kegg_M00887",
1243
+ "kegg_M00888",
1244
+ "kegg_M00889",
1245
+ "kegg_M00890",
1246
+ "kegg_M00891",
1247
+ "kegg_M00892",
1248
+ "kegg_M00893",
1249
+ "kegg_M00894",
1250
+ "kegg_M00895",
1251
+ "kegg_M00896",
1252
+ "kegg_M00897",
1253
+ "kegg_M00898",
1254
+ "kegg_M00899",
1255
+ "kegg_M00900",
1256
+ "kegg_M00901",
1257
+ "kegg_M00902",
1258
+ "kegg_M00903",
1259
+ "kegg_M00904",
1260
+ "kegg_M00905",
1261
+ "kegg_M00906",
1262
+ "kegg_M00909",
1263
+ "kegg_M00910",
1264
+ "kegg_M00911",
1265
+ "kegg_M00912",
1266
+ "kegg_M00913",
1267
+ "kegg_M00914",
1268
+ "kegg_M00915",
1269
+ "kegg_M00916",
1270
+ "kegg_M00917",
1271
+ "kegg_M00918",
1272
+ "kegg_M00919",
1273
+ "kegg_M00921",
1274
+ "kegg_M00922",
1275
+ "kegg_M00923",
1276
+ "kegg_M00924",
1277
+ "kegg_M00925",
1278
+ "kegg_M00926",
1279
+ "kegg_M00927",
1280
+ "kegg_M00928",
1281
+ "kegg_M00929",
1282
+ "kegg_M00930",
1283
+ "kegg_M00931",
1284
+ "kegg_M00932",
1285
+ "kegg_M00933",
1286
+ "kegg_M00934",
1287
+ "kegg_M00935",
1288
+ "kegg_M00936",
1289
+ "kegg_M00937",
1290
+ "kegg_M00938",
1291
+ "kegg_M00939",
1292
+ "kegg_M00940",
1293
+ "kegg_M00941",
1294
+ "kegg_M00942",
1295
+ "kegg_M00943",
1296
+ "kegg_M00944",
1297
+ "kegg_M00945",
1298
+ "kegg_M00946",
1299
+ "kegg_M00947",
1300
+ "kegg_M00948",
1301
+ "kegg_M00949",
1302
+ "kegg_M00950",
1303
+ "kegg_M00951",
1304
+ "kegg_M00952",
1305
+ "kegg_M00953",
1306
+ "kegg_M00954",
1307
+ "kegg_M00955",
1308
+ "kegg_M00956",
1309
+ "kegg_M00957",
1310
+ "kegg_M00958",
1311
+ "kegg_M00959",
1312
+ "kegg_M00960",
1313
+ "kegg_M00961",
1314
+ "kegg_M00962",
1315
+ "kegg_M00963",
1316
+ "kegg_M00964",
1317
+ "kegg_M00965",
1318
+ "kegg_M00966",
1319
+ "kegg_M00967",
1320
+ "kegg_M00968",
1321
+ "kegg_M00969",
1322
+ "kegg_M00970",
1323
+ "kegg_M00971",
1324
+ "kegg_M00972",
1325
+ "kegg_M00973",
1326
+ "kegg_M00974",
1327
+ "kegg_M00975",
1328
+ "kegg_M00976",
1329
+ "kegg_M00977",
1330
+ "kegg_M00978",
1331
+ "kegg_M00979",
1332
+ "kegg_M00980",
1333
+ "kegg_M00981",
1334
+ "kegg_M00982",
1335
+ "kegg_M00983",
1336
+ "kegg_M00984",
1337
+ "kegg_M00985",
1338
+ "kegg_M00986",
1339
+ "kegg_M00987",
1340
+ "kegg_M00988",
1341
+ "kegg_M00989",
1342
+ "kegg_M00990",
1343
+ "kegg_M00991",
1344
+ "kegg_M00992",
1345
+ "kegg_M00993",
1346
+ "kegg_M00994",
1347
+ "kegg_M00995",
1348
+ "kegg_M00996",
1349
+ "kegg_M00997",
1350
+ "kegg_M00998",
1351
+ "kegg_M00999",
1352
+ "kegg_M01000",
1353
+ "kegg_M01001",
1354
+ "kegg_M01002",
1355
+ "kegg_M01003",
1356
+ "kegg_M01004",
1357
+ "kegg_M01005",
1358
+ "kegg_M01006",
1359
+ "kegg_M01007",
1360
+ "kegg_M01008",
1361
+ "kegg_M01009",
1362
+ "kegg_M01010",
1363
+ "kegg_M01011",
1364
+ "kegg_M01012",
1365
+ "kegg_M01013",
1366
+ "kegg_M01014",
1367
+ "kegg_M01015",
1368
+ "kegg_M01016",
1369
+ "kegg_M01017",
1370
+ "kegg_M01018",
1371
+ "kegg_M01019",
1372
+ "kegg_M01020",
1373
+ "kegg_M01021",
1374
+ "kegg_M01022",
1375
+ "kegg_M01023",
1376
+ "kegg_M01024",
1377
+ "kegg_M01025",
1378
+ "kegg_M01026",
1379
+ "kegg_M01027",
1380
+ "kegg_M01028",
1381
+ "kegg_M01029",
1382
+ "kegg_M01030",
1383
+ "kegg_M01031",
1384
+ "kegg_M01032",
1385
+ "kegg_M01033",
1386
+ "kegg_M01034",
1387
+ "kegg_M01035",
1388
+ "kegg_M01036",
1389
+ "kegg_M01037",
1390
+ "kegg_M01038",
1391
+ "kegg_M01039",
1392
+ "kegg_M01040",
1393
+ "kegg_M01041",
1394
+ "kegg_M01042",
1395
+ "kegg_M01043",
1396
+ "kegg_M01044",
1397
+ "kegg_M01045",
1398
+ "kegg_M01046",
1399
+ "kegg_M01047",
1400
+ "kegg_M01048",
1401
+ "kegg_M01049",
1402
+ "kegg_M01050",
1403
+ "kegg_M01051",
1404
+ "kegg_M01052",
1405
+ "kegg_M01053",
1406
+ "kegg_M01054",
1407
+ "kegg_M01055",
1408
+ "kegg_M01056",
1409
+ "kegg_M01057",
1410
+ "kegg_M01058",
1411
+ "kegg_M01059",
1412
+ "kegg_M01060",
1413
+ "iso_lat",
1414
+ "iso_lon",
1415
+ "iso_collection_year",
1416
+ "iso_continent_europe",
1417
+ "iso_continent_north_america",
1418
+ "iso_continent_asia",
1419
+ "iso_continent_middle_and_south_america",
1420
+ "iso_continent_australia_and_oceania",
1421
+ "iso_continent_africa",
1422
+ "iso_continent_antarctica",
1423
+ "iso_continent_arctic",
1424
+ "iso_country_sweden",
1425
+ "iso_country_germany",
1426
+ "iso_country_usa",
1427
+ "iso_country_france",
1428
+ "iso_country_china",
1429
+ "iso_country_japan",
1430
+ "iso_country_republic_of_korea",
1431
+ "iso_country_india",
1432
+ "iso_country_united_kingdom",
1433
+ "iso_country_spain",
1434
+ "iso_country_italy",
1435
+ "iso_country_netherlands",
1436
+ "iso_country_egypt",
1437
+ "iso_country_russia",
1438
+ "iso_country_greece",
1439
+ "iso_country_denmark",
1440
+ "iso_country_switzerland",
1441
+ "iso_country_norway",
1442
+ "iso_country_belgium",
1443
+ "iso_country_australia",
1444
+ "iso_country_brazil",
1445
+ "iso_country_canada",
1446
+ "iso_country_thailand",
1447
+ "iso_country_taiwan_province_of_china",
1448
+ "iso_country_south_africa",
1449
+ "iso_country_malaysia",
1450
+ "iso_country_austria",
1451
+ "iso_country_portugal",
1452
+ "iso_country_mexico",
1453
+ "iso_country_czech_republic",
1454
+ "iso_host_kingdom_human",
1455
+ "iso_host_kingdom_animal",
1456
+ "iso_host_kingdom_plant",
1457
+ "iso_host_kingdom_fungal",
1458
+ "iso_host_kingdom_other"
1459
+ ]
1460
+ }
1461
+ }
artifacts/carveme_smoke_status.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "status": "reconstruction_smoke_ok",
3
+ "input_protein_fasta": "data/external_benchmark_fastas/GCA_000820045.faa.gz",
4
+ "output_model": "artifacts/carveme_smoke/GCA_000820045.xml",
5
+ "output_model_size_mb": 9.4,
6
+ "installed_local_dependency": "diamond 2.2.0 via Homebrew",
7
+ "command": "uv run --python 3.11 --isolated --with carveme carve data/external_benchmark_fastas/GCA_000820045.faa.gz -o artifacts/carveme_smoke/GCA_000820045.xml --solver scip",
8
+ "medium_feasibility_scored": false,
9
+ "reason_medium_not_scored": "CarveMe produced an SBML reconstruction, but comparing it to MediaDive media requires a recipe-to-metabolite/media mapping and a feasible-growth scoring command for each candidate medium.",
10
+ "gapseq_status": "not_installable_on_this_local_conda_platform",
11
+ "gapseq_check": "conda search -c conda-forge -c bioconda gapseq --info",
12
+ "gapseq_result": "PackagesNotFoundError for local osx channels; full gapseq also requires Bacteria/Archaea reference sequence database downloads."
13
+ }
artifacts/carveme_smoke_status.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CarveMe and gapseq External Baseline Status
2
+
3
+ ## CarveMe
4
+
5
+ CarveMe was installed in an isolated `uv` environment and run on one held-out
6
+ protein FASTA generated for the GenomeSPOT smoke benchmark.
7
+
8
+ - Input: `data/external_benchmark_fastas/GCA_000820045.faa.gz`
9
+ - Output: `artifacts/carveme_smoke/GCA_000820045.xml`
10
+ - Output size: 9.4 MB
11
+ - Extra local dependency installed: `diamond 2.2.0` via Homebrew
12
+ - Command:
13
+
14
+ ```bash
15
+ uv run --python 3.11 --isolated --with carveme carve \
16
+ data/external_benchmark_fastas/GCA_000820045.faa.gz \
17
+ -o artifacts/carveme_smoke/GCA_000820045.xml \
18
+ --solver scip
19
+ ```
20
+
21
+ Result: reconstruction smoke test passed.
22
+
23
+ Medium feasibility is not scored yet. The missing piece is not model
24
+ reconstruction; it is a fair mapping from MediaDive recipe labels to compounds or
25
+ media definitions that CarveMe can gap-fill/test. Without that mapping, a
26
+ CarveMe-vs-MediaDive hit@k number would be mostly bookkeeping noise.
27
+
28
+ ## gapseq
29
+
30
+ The local conda platform check did not find `gapseq`:
31
+
32
+ ```bash
33
+ conda search -c conda-forge -c bioconda gapseq --info
34
+ ```
35
+
36
+ Result: `PackagesNotFoundError` on the local macOS conda channels. gapseq is also
37
+ not just a Python command: the official setup requires downloading Bacteria and
38
+ Archaea reference sequence databases before real inference.
39
+
40
+ Given the current disk state, the full gapseq run should be moved to a Linux
41
+ machine or cloud instance with substantially more free disk.
artifacts/external_benchmark_manifest.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ddf64020f70471ec362e5154ff37b073ce9ec5c4c6262dc9ebe409dcef183b90
3
+ size 657757
artifacts/external_benchmark_manifest_5k.csv ADDED
The diff for this file is too large to render. See raw diff
 
artifacts/external_benchmark_manifest_5k.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:109c71729457fddcbd7f8fdba51d998419ed6eb8c7af7b900bc4e242bb641385
3
+ size 205285
artifacts/external_benchmark_manifest_5k_summary.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source_manifest": "artifacts/external_benchmark_manifest.parquet",
3
+ "seed": 20260520,
4
+ "rows": 5000,
5
+ "unique_accessions": 5000,
6
+ "fold_counts": {
7
+ "0": 1035,
8
+ "1": 821,
9
+ "2": 1114,
10
+ "3": 943,
11
+ "4": 1087
12
+ },
13
+ "label_counts": {
14
+ "temperature": 5000,
15
+ "ph": 933,
16
+ "salt": 779,
17
+ "oxygen": 2653,
18
+ "medium": 3889,
19
+ "all_four_conditions": 416
20
+ }
21
+ }
artifacts/external_benchmark_status.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "split_mode": "family",
3
+ "n_splits": 5,
4
+ "seed": 7,
5
+ "elapsed_s": 4.729962110519409,
6
+ "manifest_parquet": "artifacts/external_benchmark_manifest.parquet",
7
+ "manifest_csv": "artifacts/external_benchmark_manifest.csv",
8
+ "rows": 25728,
9
+ "media_labels": 40,
10
+ "tools": {
11
+ "GenomeSPOT": {
12
+ "command": "uv run python -m genome_spot.genome_spot",
13
+ "path": "data/external_tools/GenomeSPOT-main"
14
+ },
15
+ "CarveMe": {
16
+ "command": "uv run --with carveme carve",
17
+ "path": "/opt/homebrew/bin/diamond"
18
+ },
19
+ "gapseq": {
20
+ "command": null,
21
+ "path": null
22
+ }
23
+ },
24
+ "fasta_coverage": {
25
+ "fasta_dir": "data/external_benchmark_fastas",
26
+ "unique_accessions": 16154,
27
+ "present_fastas": 8,
28
+ "missing_fastas": 16146,
29
+ "coverage_pct": 0.04952333787297264
30
+ },
31
+ "download": {
32
+ "attempted": 0,
33
+ "downloaded": 0,
34
+ "failed": 0
35
+ }
36
+ }
artifacts/external_benchmark_status.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # External Tool Benchmark Status
2
+
3
+ This file tracks the apples-to-apples benchmark setup for external tools
4
+ on the same held-out BacDive/MediaDive strains used by the dry-lab media
5
+ recommender benchmark.
6
+
7
+ ## Held-Out Manifest
8
+
9
+ - Manifest: `artifacts/external_benchmark_manifest.parquet`
10
+ - Rows: 25,728
11
+ - Unique genome accessions: 16,154
12
+ - Media labels retained: 40
13
+ - Fold counts: {"0": 5146, "1": 5146, "2": 5146, "3": 5145, "4": 5145}
14
+
15
+ Label coverage:
16
+
17
+ | Target | Labeled rows |
18
+ |---|---:|
19
+ | Temperature | 25,727 |
20
+ | pH | 2,984 |
21
+ | Salt | 2,486 |
22
+ | Oxygen | 9,283 |
23
+ | Medium | 21,050 |
24
+
25
+ ## Local Requirements
26
+
27
+ - FASTA directory: `data/external_benchmark_fastas`
28
+ - FASTAs present: 8 / 16,154 (0.05%)
29
+ - FASTA download smoke run: {"attempted": 0, "downloaded": 0, "failed": 0}
30
+
31
+ | Tool | Local command | Status |
32
+ |---|---|---|
33
+ | GenomeSPOT | `uv run python -m genome_spot.genome_spot` | available |
34
+ | CarveMe | `uv run --with carveme carve` | available |
35
+ | gapseq | `` | missing |
36
+
37
+ ## Verdict
38
+
39
+ External baseline execution is not ready on this machine yet: the full held-out FASTA set and one or more external tool binaries/databases are missing.
40
+
41
+ ## Next Commands
42
+
43
+ Use the manifest to run each external tool against the same rows and folds.
44
+ The medium-feasibility tools should be scored by whether at least one known
45
+ MediaDive medium is feasible or closest among the tool's predicted feasible
46
+ media/metabolite environments.
47
+
48
+ ```bash
49
+ PYTHONPATH=src uv run --python 3.11 python scripts/42_prepare_external_benchmarks.py \
50
+ --download-fastas 10
51
+ ```
52
+
53
+ For the full benchmark, download the complete FASTA set into the FASTA
54
+ directory above, install the external tools plus their databases, then run
55
+ tool-specific inference using the `bacdive_id`, `fold`, and
56
+ `genome_accession` columns from the manifest.
artifacts/extract_seqs_full.log ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✓ Initialized. View run at
2
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-GCiYQgOnZTEaVptGQgPIy0
3
+ ✓ Created objects.
4
+ ├── 🔨 Created mount
5
+ │ /Users/miyuhoriuchi/microbe-model/scripts/36_extract_marker_sequences.py
6
+ ├── 🔨 Created mount data/markers/unified/unified_markers.hmm
7
+ └── 🔨 Created function MarkerSeqExtractor.*.
8
+ Marker-sequence extract: 45,856 genomes pending (199 cached)
9
+ max_per_cat=16
10
+ [setup] loaded 48 marker HMMs[setup] loaded 48 marker HMMs
11
+
12
+ [setup] loaded 48 marker HMMs
13
+ [setup] loaded 48 marker HMMs
14
+ [setup] loaded 48 marker HMMs
15
+ [setup] loaded 48 marker HMMs
16
+ [setup] loaded 48 marker HMMs
17
+ [setup] loaded 48 marker HMMs
18
+ [setup] loaded 48 marker HMMs
19
+ [setup] loaded 48 marker HMMs
20
+ [setup] loaded 48 marker HMMs
21
+ [setup] loaded 48 marker HMMs
22
+ [setup] loaded 48 marker HMMs
23
+ [setup] loaded 48 marker HMMs
24
+ [setup] loaded 48 marker HMMs
25
+ [setup] loaded 48 marker HMMs
26
+ 100 ok / 0 fail
27
+ 200 ok / 0 fail
28
+ 300 ok / 3 fail
29
+ 400 ok / 3 fail
30
+ 500 ok / 3 fail
31
+ 600 ok / 3 fail
32
+ 700 ok / 3 fail
33
+ 800 ok / 4 fail
34
+ 900 ok / 4 fail
35
+ 1,000 ok / 4 fail
36
+ 1,100 ok / 4 fail
37
+ 1,200 ok / 6 fail
38
+ 1,300 ok / 6 fail
39
+ 1,400 ok / 6 fail
40
+ 1,500 ok / 220 fail
41
+ 1,600 ok / 220 fail
42
+ 1,700 ok / 220 fail
43
+ 1,800 ok / 220 fail
44
+ 1,900 ok / 220 fail
45
+ 2,000 ok / 221 fail
46
+ 2,100 ok / 221 fail
47
+ 2,200 ok / 221 fail
48
+ 2,300 ok / 252 fail
49
+ 2,400 ok / 253 fail
50
+ 2,500 ok / 253 fail
51
+ 2,600 ok / 253 fail
52
+ 2,700 ok / 253 fail
53
+ 2,800 ok / 301 fail
54
+ 2,900 ok / 308 fail
55
+ 3,000 ok / 308 fail
56
+ 3,100 ok / 308 fail
57
+ 3,200 ok / 350 fail
58
+ 3,300 ok / 360 fail
59
+ 3,400 ok / 360 fail
60
+ 3,500 ok / 472 fail
61
+ 3,600 ok / 472 fail
62
+ 3,700 ok / 472 fail
63
+ 3,800 ok / 472 fail
64
+ 3,900 ok / 472 fail
65
+ 4,000 ok / 472 fail
66
+ 4,100 ok / 472 fail
67
+ 4,200 ok / 472 fail
68
+ 4,300 ok / 473 fail
69
+ 4,400 ok / 473 fail
70
+ 4,500 ok / 473 fail
71
+ 4,600 ok / 473 fail
72
+ 4,700 ok / 473 fail
73
+ 4,800 ok / 473 fail
74
+ 4,900 ok / 473 fail
75
+ 5,000 ok / 476 fail
76
+ 5,100 ok / 476 fail
77
+ 5,200 ok / 479 fail
78
+ 5,300 ok / 479 fail
79
+ 5,400 ok / 479 fail
80
+ 5,500 ok / 480 fail
81
+ 5,600 ok / 480 fail
82
+ 5,700 ok / 480 fail
83
+ 5,800 ok / 480 fail
84
+ 5,900 ok / 480 fail
85
+ 6,000 ok / 480 fail
86
+ 6,100 ok / 480 fail
87
+ 6,200 ok / 534 fail
88
+ 6,300 ok / 534 fail
89
+ 6,400 ok / 537 fail
90
+ 6,500 ok / 537 fail
91
+ 6,600 ok / 539 fail
92
+ 6,700 ok / 539 fail
93
+ 6,800 ok / 541 fail
94
+ 6,900 ok / 554 fail
95
+ 7,000 ok / 559 fail
96
+ 7,100 ok / 561 fail
97
+ 7,200 ok / 593 fail
98
+ 7,300 ok / 615 fail
99
+ 7,400 ok / 623 fail
100
+ 7,500 ok / 633 fail
101
+ 7,600 ok / 633 fail
102
+ 7,700 ok / 633 fail
103
+ 7,800 ok / 633 fail
104
+ 7,900 ok / 633 fail
105
+ 8,000 ok / 634 fail
106
+ 8,100 ok / 636 fail
107
+ 8,200 ok / 636 fail
108
+ 8,300 ok / 637 fail
109
+ 8,400 ok / 637 fail
110
+ 8,500 ok / 638 fail
111
+ 8,600 ok / 638 fail
112
+ 8,700 ok / 638 fail
113
+ 8,800 ok / 638 fail
114
+ 8,900 ok / 638 fail
115
+ 9,000 ok / 640 fail
116
+ 9,100 ok / 645 fail
117
+ 9,200 ok / 645 fail
118
+ 9,300 ok / 645 fail
119
+ 9,400 ok / 647 fail
120
+ 9,500 ok / 651 fail
121
+ 9,600 ok / 652 fail
122
+ 9,700 ok / 653 fail
123
+ 9,800 ok / 654 fail
124
+ 9,900 ok / 657 fail
125
+ 10,000 ok / 663 fail
126
+ 10,100 ok / 665 fail
127
+ 10,200 ok / 674 fail
128
+ 10,300 ok / 680 fail
129
+ 10,400 ok / 680 fail
130
+ 10,500 ok / 688 fail
131
+ 10,600 ok / 690 fail
132
+ 10,700 ok / 692 fail
133
+ 10,800 ok / 695 fail
134
+ 10,900 ok / 701 fail
135
+ 11,000 ok / 702 fail
136
+ 11,100 ok / 712 fail
137
+ 11,200 ok / 726 fail
138
+ 11,300 ok / 737 fail
139
+ 11,400 ok / 771 fail
140
+ 11,500 ok / 782 fail
141
+ 11,600 ok / 782 fail
142
+ 11,700 ok / 783 fail
143
+ 11,800 ok / 784 fail
144
+ 11,900 ok / 784 fail
145
+ 12,000 ok / 789 fail
146
+ 12,100 ok / 789 fail
147
+ 12,200 ok / 794 fail
148
+ 12,300 ok / 794 fail
149
+ 12,400 ok / 794 fail
150
+ 12,500 ok / 794 fail
151
+ 12,600 ok / 794 fail
152
+ 12,700 ok / 794 fail
153
+ 12,800 ok / 794 fail
154
+ 12,900 ok / 794 fail
155
+ 13,000 ok / 794 fail
156
+ 13,100 ok / 794 fail
157
+ 13,200 ok / 815 fail
158
+ 13,300 ok / 835 fail
159
+ 13,400 ok / 836 fail
160
+ 13,500 ok / 840 fail
161
+ 13,600 ok / 938 fail
162
+ 13,700 ok / 1,051 fail
163
+ 13,800 ok / 1,094 fail
164
+ 13,900 ok / 1,099 fail
165
+ 14,000 ok / 1,195 fail
166
+ 14,100 ok / 1,381 fail
167
+ 14,200 ok / 1,415 fail
168
+ 14,300 ok / 1,460 fail
169
+ 14,400 ok / 1,460 fail
170
+ 14,500 ok / 1,460 fail
171
+ 14,600 ok / 1,460 fail
172
+ 14,700 ok / 1,460 fail
173
+ 14,800 ok / 1,460 fail
174
+ 14,900 ok / 1,461 fail
175
+ 15,000 ok / 1,461 fail
176
+ 15,100 ok / 1,484 fail
177
+ 15,200 ok / 1,494 fail
178
+ 15,300 ok / 1,494 fail
179
+ 15,400 ok / 1,494 fail
180
+ 15,500 ok / 1,494 fail
181
+ 15,600 ok / 1,495 fail
182
+ 15,700 ok / 1,499 fail
183
+ 15,800 ok / 1,499 fail
184
+ 15,900 ok / 1,509 fail
185
+ 16,000 ok / 1,509 fail
186
+ 16,100 ok / 1,516 fail
187
+ 16,200 ok / 1,516 fail
188
+ 16,300 ok / 1,516 fail
189
+ 16,400 ok / 1,527 fail
190
+ 16,500 ok / 1,529 fail
191
+ 16,600 ok / 1,530 fail
192
+ 16,700 ok / 1,530 fail
193
+ 16,800 ok / 1,530 fail
194
+ 16,900 ok / 1,530 fail
195
+ 17,000 ok / 1,532 fail
196
+ 17,100 ok / 1,533 fail
197
+ 17,200 ok / 1,539 fail
198
+ 17,300 ok / 1,543 fail
199
+ 17,400 ok / 1,543 fail
200
+ 17,500 ok / 1,543 fail
201
+ WARNING: Logs may not be continuous
202
+ WARNING: Logs may not be continuous
203
+ 17,600 ok / 1,544 fail
204
+ 17,700 ok / 1,544 fail
205
+ 17,800 ok / 1,548 fail
206
+ WARNING: Logs may not be continuous
207
+ WARNING: Logs may not be continuous
208
+ 17,900 ok / 1,548 fail
209
+ 18,000 ok / 1,551 fail
210
+ 18,100 ok / 1,552 fail
211
+ WARNING: Logs may not be continuous
212
+ WARNING: Logs may not be continuous
213
+ 18,200 ok / 1,553 fail
214
+ 18,300 ok / 1,555 fail
215
+ 18,400 ok / 1,557 fail
216
+ WARNING: Logs may not be continuous
217
+ WARNING: Logs may not be continuous
218
+ 18,500 ok / 1,558 fail
219
+ 18,600 ok / 1,560 fail
220
+ 18,700 ok / 1,560 fail
221
+ WARNING: Logs may not be continuous
222
+ WARNING: Logs may not be continuous
223
+ 18,800 ok / 1,560 fail
224
+ 18,900 ok / 1,560 fail
225
+ 19,000 ok / 1,560 fail
226
+ WARNING: Logs may not be continuous
227
+ WARNING: Logs may not be continuous
228
+ 19,100 ok / 1,574 fail
229
+ 19,200 ok / 1,574 fail
230
+ WARNING: Logs may not be continuous
231
+ WARNING: Logs may not be continuous
232
+ 19,300 ok / 1,574 fail
233
+ 19,400 ok / 1,575 fail
234
+ 19,500 ok / 1,575 fail
235
+ 19,600 ok / 1,575 fail
236
+ WARNING: Logs may not be continuous
237
+ WARNING: Logs may not be continuous
238
+ 19,700 ok / 1,575 fail
239
+ 19,800 ok / 1,606 fail
240
+ WARNING: Logs may not be continuous
241
+ WARNING: Logs may not be continuous
242
+ 19,900 ok / 1,610 fail
243
+ 20,000 ok / 1,610 fail
244
+ 20,100 ok / 1,610 fail
245
+ WARNING: Logs may not be continuous
246
+ WARNING: Logs may not be continuous
247
+ 20,200 ok / 1,610 fail
248
+ 20,300 ok / 1,610 fail
249
+ 20,400 ok / 1,610 fail
250
+ WARNING: Logs may not be continuous
251
+ WARNING: Logs may not be continuous
252
+ 20,500 ok / 1,610 fail
253
+ 20,600 ok / 1,610 fail
254
+ WARNING: Logs may not be continuous
255
+ WARNING: Logs may not be continuous
256
+ 20,700 ok / 1,610 fail
257
+ 20,800 ok / 1,610 fail
258
+ 20,900 ok / 1,618 fail
259
+ WARNING: Logs may not be continuous
260
+ WARNING: Logs may not be continuous
261
+ [setup] loaded 48 marker HMMs
262
+ 21,000 ok / 1,618 fail
263
+ 21,100 ok / 1,621 fail
264
+ 21,200 ok / 1,621 fail
265
+ 21,300 ok / 1,621 fail
266
+ 21,400 ok / 1,621 fail
267
+ 21,500 ok / 1,621 fail
268
+ 21,600 ok / 1,622 fail
269
+ 21,700 ok / 1,623 fail
270
+ 21,800 ok / 1,623 fail
271
+ 21,900 ok / 1,623 fail
272
+ 22,000 ok / 1,623 fail
273
+ Runner interrupted due to worker preemption. Your Function will be restarted with the same input. For more details, see https://modal.com/docs/guide/preemption
274
+ [setup] loaded 48 marker HMMs
275
+ 22,100 ok / 1,625 fail
276
+ 22,200 ok / 1,628 fail
277
+ 22,300 ok / 1,629 fail
278
+ 22,400 ok / 1,629 fail
279
+ 22,500 ok / 1,629 fail
280
+ 22,600 ok / 1,631 fail
281
+ 22,700 ok / 1,631 fail
282
+ 22,800 ok / 1,631 fail
283
+ 22,900 ok / 1,631 fail
284
+ 23,000 ok / 1,631 fail
285
+ 23,100 ok / 1,640 fail
286
+ 23,200 ok / 1,640 fail
287
+ 23,300 ok / 1,642 fail
288
+ 23,400 ok / 1,643 fail
289
+ 23,500 ok / 1,643 fail
290
+ 23,600 ok / 1,643 fail
291
+ 23,700 ok / 1,643 fail
292
+ 23,800 ok / 1,643 fail
293
+ 23,900 ok / 1,643 fail
294
+ 24,000 ok / 1,651 fail
295
+ 24,100 ok / 1,651 fail
296
+ 24,200 ok / 1,669 fail
297
+ 24,300 ok / 1,669 fail
298
+ 24,400 ok / 1,669 fail
299
+ 24,500 ok / 1,669 fail
300
+ 24,600 ok / 1,669 fail
301
+ 24,700 ok / 1,670 fail
302
+ 24,800 ok / 1,672 fail
303
+ 24,900 ok / 1,680 fail
304
+ 25,000 ok / 1,681 fail
305
+ 25,100 ok / 1,681 fail
306
+ 25,200 ok / 1,682 fail
307
+ 25,300 ok / 1,682 fail
308
+ 25,400 ok / 1,682 fail
309
+ 25,500 ok / 1,682 fail
310
+ 25,600 ok / 1,682 fail
311
+ 25,700 ok / 1,684 fail
312
+ 25,800 ok / 1,687 fail
313
+ 25,900 ok / 1,688 fail
314
+ 26,000 ok / 1,688 fail
315
+ 26,100 ok / 1,688 fail
316
+ 26,200 ok / 1,688 fail
317
+ 26,300 ok / 1,688 fail
318
+ 26,400 ok / 1,689 fail
319
+ 26,500 ok / 1,689 fail
320
+ 26,600 ok / 1,689 fail
321
+ 26,700 ok / 1,689 fail
322
+ 26,800 ok / 1,689 fail
323
+ 26,900 ok / 1,691 fail
324
+ 27,000 ok / 1,691 fail
325
+ 27,100 ok / 1,691 fail
326
+ 27,200 ok / 1,691 fail
327
+ 27,300 ok / 1,694 fail
328
+ 27,400 ok / 1,694 fail
329
+ 27,500 ok / 1,694 fail
330
+ 27,600 ok / 1,694 fail
331
+ 27,700 ok / 1,694 fail
332
+ 27,800 ok / 1,694 fail
333
+ 27,900 ok / 1,694 fail
334
+ 28,000 ok / 1,694 fail
335
+ 28,100 ok / 1,703 fail
336
+ 28,200 ok / 1,703 fail
337
+ 28,300 ok / 1,708 fail
338
+ 28,400 ok / 1,714 fail
339
+ 28,500 ok / 1,714 fail
340
+ 28,600 ok / 1,716 fail
341
+ 28,700 ok / 1,740 fail
342
+ 28,800 ok / 1,740 fail
343
+ 28,900 ok / 1,740 fail
344
+ 29,000 ok / 1,740 fail
345
+ 29,100 ok / 1,740 fail
346
+ 29,200 ok / 1,740 fail
347
+ 29,300 ok / 1,743 fail
348
+ 29,400 ok / 1,743 fail
349
+ 29,500 ok / 1,743 fail
350
+ 29,600 ok / 1,743 fail
351
+ 29,700 ok / 1,777 fail
352
+ 29,800 ok / 1,845 fail
353
+ 29,900 ok / 1,845 fail
354
+ 30,000 ok / 1,845 fail
355
+ 30,100 ok / 1,852 fail
356
+ [setup] loaded 48 marker HMMs
357
+ 30,200 ok / 1,852 fail
358
+ 30,300 ok / 1,853 fail
359
+ 30,400 ok / 1,907 fail
360
+ 30,500 ok / 1,992 fail
361
+ 30,600 ok / 2,109 fail
362
+ 30,700 ok / 2,109 fail
363
+ 30,800 ok / 2,109 fail
364
+ 30,900 ok / 2,109 fail
365
+ 31,000 ok / 2,125 fail
366
+ 31,100 ok / 2,151 fail
367
+ Runner interrupted due to worker preemption. Your Function will be restarted with the same input. For more details, see https://modal.com/docs/guide/preemption
368
+ [setup] loaded 48 marker HMMs
369
+ 31,200 ok / 2,153 fail
370
+ 31,300 ok / 2,163 fail
371
+ 31,400 ok / 2,205 fail
372
+ 31,500 ok / 2,205 fail
373
+ 31,600 ok / 2,208 fail
374
+ 31,700 ok / 2,212 fail
375
+ 31,800 ok / 2,222 fail
376
+ 31,900 ok / 2,225 fail
377
+ 32,000 ok / 2,228 fail
378
+ 32,100 ok / 2,239 fail
379
+ 32,200 ok / 2,246 fail
380
+ 32,300 ok / 2,247 fail
381
+ 32,400 ok / 2,254 fail
382
+ [setup] loaded 48 marker HMMs
383
+ 32,500 ok / 2,255 fail
384
+ 32,600 ok / 2,284 fail
385
+ 32,700 ok / 2,302 fail
386
+ 32,800 ok / 2,311 fail
387
+ 32,900 ok / 2,322 fail
388
+ 33,000 ok / 2,322 fail
389
+ 33,100 ok / 2,325 fail
390
+ 33,200 ok / 2,325 fail
391
+ 33,300 ok / 2,326 fail
392
+ 33,400 ok / 2,344 fail
393
+ 33,500 ok / 2,351 fail
394
+ 33,600 ok / 2,356 fail
395
+ 33,700 ok / 2,356 fail
396
+ 33,800 ok / 2,383 fail
397
+ 33,900 ok / 2,563 fail
398
+ 34,000 ok / 2,631 fail
399
+ 34,100 ok / 2,655 fail
400
+ 34,200 ok / 3,031 fail
401
+ 34,300 ok / 3,033 fail
402
+ 34,400 ok / 3,040 fail
403
+ 34,500 ok / 3,040 fail
404
+ 34,600 ok / 3,104 fail
405
+ 34,700 ok / 3,104 fail
406
+ 34,800 ok / 3,205 fail
407
+ 34,900 ok / 3,208 fail
408
+ 35,000 ok / 3,208 fail
409
+ 35,100 ok / 3,357 fail
410
+ 35,200 ok / 3,479 fail
411
+ 35,300 ok / 3,760 fail
412
+ 35,400 ok / 3,835 fail
413
+ 35,500 ok / 3,926 fail
414
+ 35,600 ok / 3,926 fail
415
+ 35,700 ok / 3,926 fail
416
+ 35,800 ok / 3,926 fail
417
+ 35,900 ok / 3,936 fail
418
+ 36,000 ok / 3,942 fail
419
+ 36,100 ok / 3,960 fail
420
+ [setup] loaded 48 marker HMMs
421
+ 36,200 ok / 4,223 fail
422
+ 36,300 ok / 4,223 fail
423
+ 36,400 ok / 4,223 fail
424
+ 36,500 ok / 4,226 fail
425
+ 36,600 ok / 4,226 fail
426
+ 36,700 ok / 4,268 fail
427
+ 36,800 ok / 4,269 fail
428
+ 36,900 ok / 4,347 fail
429
+ 37,000 ok / 4,442 fail
430
+ 37,100 ok / 4,442 fail
431
+ 37,200 ok / 4,485 fail
432
+ 37,300 ok / 4,485 fail
433
+ 37,400 ok / 4,499 fail
434
+ 37,500 ok / 4,538 fail
435
+ 37,600 ok / 4,563 fail
436
+ 37,700 ok / 4,980 fail
437
+ 37,800 ok / 5,300 fail
438
+ 37,900 ok / 5,300 fail
439
+ 38,000 ok / 5,304 fail
440
+ 38,100 ok / 5,304 fail
441
+ 38,200 ok / 5,316 fail
442
+ 38,300 ok / 5,316 fail
443
+ 38,400 ok / 5,472 fail
444
+ 38,500 ok / 5,473 fail
445
+ 38,600 ok / 5,474 fail
446
+ 38,700 ok / 5,540 fail
447
+ 38,800 ok / 5,540 fail
448
+ 38,900 ok / 5,540 fail
449
+ 39,000 ok / 5,585 fail
450
+ 39,100 ok / 5,585 fail
451
+ 39,200 ok / 5,585 fail
452
+ 39,300 ok / 5,585 fail
453
+ 39,400 ok / 5,585 fail
454
+ 39,500 ok / 5,585 fail
455
+ 39,600 ok / 5,585 fail
456
+ 39,700 ok / 5,585 fail
457
+ 39,800 ok / 5,585 fail
458
+ 39,900 ok / 5,586 fail
459
+ 40,000 ok / 5,586 fail
460
+ 40,100 ok / 5,586 fail
461
+ 40,200 ok / 5,586 fail
462
+
463
+ Finished. 40,270 succeeded, 5,586 failed.
464
+ Streamed to data/marker_sequences.jsonl
465
+ Stopping app - local entrypoint completed.
466
+ ✓ App completed. View run at
467
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-GCiYQgOnZTEaVptGQgPIy0
artifacts/extract_seqs_smoke.log ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✓ Initialized. View run at
2
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-196pygWm6l2FWYPkEkdjUH
3
+ Building image im-Zq0oFsY4NHhsLRUkQXD217
4
+
5
+ => Step 0: FROM base
6
+
7
+ => Step 1: RUN python -m pip install 'pyhmmer>=0.12' 'pyrodigal>=3.5' 'requests>=2.32'
8
+ Looking in indexes: http://pypi-mirror.modal.local:5555/simple
9
+ Collecting pyhmmer>=0.12
10
+ Downloading http://pypi-mirror.modal.local:5555/simple/pyhmmer/pyhmmer-0.12.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (14 kB)
11
+ Collecting pyrodigal>=3.5
12
+ Downloading http://pypi-mirror.modal.local:5555/simple/pyrodigal/pyrodigal-3.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (15 kB)
13
+ Collecting requests>=2.32
14
+ Downloading http://pypi-mirror.modal.local:5555/simple/requests/requests-2.34.2-py3-none-any.whl.metadata (4.8 kB)
15
+ Collecting psutil<8.0,>=5.0 (from pyhmmer>=0.12)
16
+ Downloading http://pypi-mirror.modal.local:5555/simple/psutil/psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
17
+ Collecting archspec~=0.2.0 (from pyrodigal>=3.5)
18
+ Downloading http://pypi-mirror.modal.local:5555/simple/archspec/archspec-0.2.6-py3-none-any.whl.metadata (4.4 kB)
19
+ Collecting charset_normalizer<4,>=2 (from requests>=2.32)
20
+ Downloading http://pypi-mirror.modal.local:5555/simple/charset-normalizer/charset_normalizer-3.4.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
21
+ Collecting idna<4,>=2.5 (from requests>=2.32)
22
+ Downloading http://pypi-mirror.modal.local:5555/simple/idna/idna-3.15-py3-none-any.whl.metadata (7.7 kB)
23
+ Collecting urllib3<3,>=1.26 (from requests>=2.32)
24
+ Downloading http://pypi-mirror.modal.local:5555/simple/urllib3/urllib3-2.7.0-py3-none-any.whl.metadata (6.9 kB)
25
+ Collecting certifi>=2023.5.7 (from requests>=2.32)
26
+ Downloading http://pypi-mirror.modal.local:5555/simple/certifi/certifi-2026.4.22-py3-none-any.whl.metadata (2.5 kB)
27
+ Downloading http://pypi-mirror.modal.local:5555/simple/pyhmmer/pyhmmer-0.12.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.2 MB)
28
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 187.8 MB/s 0:00:00
29
+ Downloading http://pypi-mirror.modal.local:5555/simple/psutil/psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl (155 kB)
30
+ Downloading http://pypi-mirror.modal.local:5555/simple/pyrodigal/pyrodigal-3.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.0 MB)
31
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 181.7 MB/s 0:00:00
32
+ Downloading http://pypi-mirror.modal.local:5555/simple/archspec/archspec-0.2.6-py3-none-any.whl (81 kB)
33
+ Downloading http://pypi-mirror.modal.local:5555/simple/requests/requests-2.34.2-py3-none-any.whl (73 kB)
34
+ Downloading http://pypi-mirror.modal.local:5555/simple/charset-normalizer/charset_normalizer-3.4.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (214 kB)
35
+ Downloading http://pypi-mirror.modal.local:5555/simple/idna/idna-3.15-py3-none-any.whl (72 kB)
36
+ Downloading http://pypi-mirror.modal.local:5555/simple/urllib3/urllib3-2.7.0-py3-none-any.whl (131 kB)
37
+ Downloading http://pypi-mirror.modal.local:5555/simple/certifi/certifi-2026.4.22-py3-none-any.whl (135 kB)
38
+ Installing collected packages: urllib3, psutil, idna, charset_normalizer, certifi, archspec, requests, pyrodigal, pyhmmer
39
+
40
+ Successfully installed archspec-0.2.6 certifi-2026.4.22 charset_normalizer-3.4.7 idna-3.15 psutil-7.2.2 pyhmmer-0.12.1 pyrodigal-3.7.1 requests-2.34.2 urllib3-2.7.0
41
+
42
+ [notice] A new release of pip is available: 25.2 -> 26.1.1
43
+ [notice] To update, run: pip install --upgrade pip
44
+ Saving image...
45
+ Image saved, took 801.09ms
46
+
47
+ Built image im-Zq0oFsY4NHhsLRUkQXD217 in 4.20s
48
+
49
+
50
+ ✓ Created objects.
51
+ ├── 🔨 Created mount
52
+ │ /Users/miyuhoriuchi/microbe-model/scripts/36_extract_marker_sequences.py
53
+ ├── 🔨 Created mount data/markers/unified/unified_markers.hmm
54
+ └── 🔨 Created function MarkerSeqExtractor.*.
55
+ Marker-sequence extract: 200 genomes pending (0 cached)
56
+ max_per_cat=16
57
+ [setup] loaded 48 marker HMMs
58
+ [setup] loaded 48 marker HMMs
59
+ [setup] loaded 48 marker HMMs
60
+ [setup] loaded 48 marker HMMs
61
+ [setup] loaded 48 marker HMMs
62
+ [setup] loaded 48 marker HMMs
63
+ [setup] loaded 48 marker HMMs
64
+ [setup] loaded 48 marker HMMs
65
+ [setup] loaded 48 marker HMMs
66
+ [setup] loaded 48 marker HMMs
67
+ [setup] loaded 48 marker HMMs
68
+ [setup] loaded 48 marker HMMs[setup] loaded 48 marker HMMs
69
+
70
+ [setup] loaded 48 marker HMMs
71
+ [setup] loaded 48 marker HMMs
72
+ [setup] loaded 48 marker HMMs
73
+ 100 ok / 1 fail
74
+
75
+ Finished. 199 succeeded, 1 failed.
76
+ Streamed to data/marker_sequences.jsonl
77
+ Stopping app - local entrypoint completed.
78
+ ✓ App completed. View run at
79
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-196pygWm6l2FWYPkEkdjUH
artifacts/genomespot_5k_benchmark.json ADDED
The diff for this file is too large to render. See raw diff
 
artifacts/genomespot_5k_benchmark.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GenomeSPOT Held-Out Benchmark
2
+
3
+ GenomeSPOT was run on rows selected from the same held-out manifest used
4
+ by the microbe-model media benchmark. This run uses a deterministic 5,000
5
+ unique-genome subset selected from the family-heldout manifest.
6
+
7
+ ## Setup
8
+
9
+ - Manifest: `artifacts/external_benchmark_manifest_5k.parquet`
10
+ - Limit: 5000
11
+ - Required labels: none
12
+ - GenomeSPOT source: `data/external_tools/GenomeSPOT-main`
13
+ - FASTA directory: `data/external_benchmark_fastas`
14
+
15
+ ## Results
16
+
17
+ - OK: 5000 / 5000
18
+ - Failed/skipped: 0
19
+ - Mean runtime per OK genome: 6.37s
20
+ - Temperature MAE: 4.393 C
21
+ - pH MAE: 0.608
22
+ - Salt MAE: 1.981%
23
+
24
+ ## Notes
25
+
26
+ GenomeSPOT oxygen is a tolerant/not-tolerant label, while microbe-model
27
+ uses BacDive oxygen categories. The smoke report keeps raw labels rather
28
+ than forcing an evaluation mapping that may hide label-definition mismatch.
artifacts/genomespot_smoke_benchmark.json ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "manifest": "artifacts/external_benchmark_manifest.parquet",
3
+ "genome_spot_dir": "data/external_tools/GenomeSPOT-main",
4
+ "fasta_dir": "data/external_benchmark_fastas",
5
+ "limit": 5,
6
+ "fold": null,
7
+ "required_labels": [
8
+ "temperature",
9
+ "ph",
10
+ "salt",
11
+ "oxygen"
12
+ ],
13
+ "summary": {
14
+ "n_requested": 5,
15
+ "n_ok": 5,
16
+ "n_failed_or_skipped": 0,
17
+ "temperature_mae_c": 6.765308446813765,
18
+ "ph_mae": 0.839113849774097,
19
+ "salt_mae_pct": 2.186273235501805,
20
+ "mean_elapsed_s": 7.024987316131591
21
+ },
22
+ "results": [
23
+ {
24
+ "bacdive_id": 288,
25
+ "genome_accession": "GCA_000820045",
26
+ "fold": 0,
27
+ "status": "ok",
28
+ "elapsed_s": 8.004921197891235,
29
+ "true_temperature_c": 30.0,
30
+ "true_ph": 8.85,
31
+ "true_salt_pct": 3.0,
32
+ "true_oxygen": "facultative anaerobe",
33
+ "genomespot_temperature_c": 25.04088968365003,
34
+ "genomespot_temperature_error": 5.908610260400278,
35
+ "genomespot_ph": 7.218559960869918,
36
+ "genomespot_ph_error": 0.9269488860287124,
37
+ "genomespot_salt_pct": 1.2846838543664987,
38
+ "genomespot_salt_error": 1.89111946010033,
39
+ "genomespot_oxygen": "tolerant",
40
+ "genomespot_oxygen_probability": 0.9534985556874084
41
+ },
42
+ {
43
+ "bacdive_id": 437,
44
+ "genome_accession": "GCA_020687825",
45
+ "fold": 0,
46
+ "status": "ok",
47
+ "elapsed_s": 6.64178991317749,
48
+ "true_temperature_c": 40.0,
49
+ "true_ph": 7.5,
50
+ "true_salt_pct": 0.0,
51
+ "true_oxygen": "obligate aerobe",
52
+ "genomespot_temperature_c": 30.968961771189043,
53
+ "genomespot_temperature_error": 5.607908850605381,
54
+ "genomespot_ph": 7.518364363247092,
55
+ "genomespot_ph_error": 0.8905373705895864,
56
+ "genomespot_salt_pct": 3.945249495463366,
57
+ "genomespot_salt_error": 2.708144543741642,
58
+ "genomespot_oxygen": "tolerant",
59
+ "genomespot_oxygen_probability": 0.9872012079302094
60
+ },
61
+ {
62
+ "bacdive_id": 461,
63
+ "genome_accession": "GCA_000428985",
64
+ "fold": 0,
65
+ "status": "ok",
66
+ "elapsed_s": 6.950576305389404,
67
+ "true_temperature_c": 30.0,
68
+ "true_ph": 9.0,
69
+ "true_salt_pct": 3.25,
70
+ "true_oxygen": "facultative anaerobe",
71
+ "genomespot_temperature_c": 35.01874579459905,
72
+ "genomespot_temperature_error": 6.392270488048381,
73
+ "genomespot_ph": 7.498319424452189,
74
+ "genomespot_ph_error": 0.8890270592649339,
75
+ "genomespot_salt_pct": 6.173325202743865,
76
+ "genomespot_salt_error": 4.740765490619152,
77
+ "genomespot_oxygen": "tolerant",
78
+ "genomespot_oxygen_probability": 0.722387623101007
79
+ },
80
+ {
81
+ "bacdive_id": 462,
82
+ "genome_accession": "GCA_000378045",
83
+ "fold": 0,
84
+ "status": "ok",
85
+ "elapsed_s": 6.953850269317627,
86
+ "true_temperature_c": 25.0,
87
+ "true_ph": 7.0,
88
+ "true_salt_pct": 3.0,
89
+ "true_oxygen": "aerobe",
90
+ "genomespot_temperature_c": 31.044112938407967,
91
+ "genomespot_temperature_error": 5.6003497982319,
92
+ "genomespot_ph": 7.605265059073084,
93
+ "genomespot_ph_error": 0.8967230869811786,
94
+ "genomespot_salt_pct": 3.388669539545697,
95
+ "genomespot_salt_error": 2.677469176228756,
96
+ "genomespot_oxygen": "tolerant",
97
+ "genomespot_oxygen_probability": 0.7674566512031347
98
+ },
99
+ {
100
+ "bacdive_id": 466,
101
+ "genome_accession": "GCA_900112305",
102
+ "fold": 0,
103
+ "status": "ok",
104
+ "elapsed_s": 6.573798894882202,
105
+ "true_temperature_c": 46.0,
106
+ "true_ph": 7.0,
107
+ "true_salt_pct": 1.5,
108
+ "true_oxygen": "facultative anaerobe",
109
+ "genomespot_temperature_c": 37.22646504409912,
110
+ "genomespot_temperature_error": 6.99763526913742,
111
+ "genomespot_ph": 7.438819211872416,
112
+ "genomespot_ph_error": 0.8942408576036942,
113
+ "genomespot_salt_pct": 3.4588057941225956,
114
+ "genomespot_salt_error": 2.689844701946755,
115
+ "genomespot_oxygen": "tolerant",
116
+ "genomespot_oxygen_probability": 0.8284586405204845
117
+ }
118
+ ]
119
+ }
artifacts/genomespot_smoke_benchmark.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GenomeSPOT Held-Out Smoke Benchmark
2
+
3
+ GenomeSPOT was run on rows selected from the same held-out manifest used
4
+ by the microbe-model media benchmark. This is a smoke benchmark unless
5
+ `limit` equals the full manifest size.
6
+
7
+ ## Setup
8
+
9
+ - Manifest: `artifacts/external_benchmark_manifest.parquet`
10
+ - Limit: 5
11
+ - Required labels: temperature, ph, salt, oxygen
12
+ - GenomeSPOT source: `data/external_tools/GenomeSPOT-main`
13
+ - FASTA directory: `data/external_benchmark_fastas`
14
+
15
+ ## Results
16
+
17
+ - OK: 5 / 5
18
+ - Failed/skipped: 0
19
+ - Mean runtime per OK genome: 7.02s
20
+ - Temperature MAE: 6.765 C
21
+ - pH MAE: 0.839
22
+ - Salt MAE: 2.186%
23
+
24
+ ## Notes
25
+
26
+ GenomeSPOT oxygen is a tolerant/not-tolerant label, while microbe-model
27
+ uses BacDive oxygen categories. The smoke report keeps raw labels rather
28
+ than forcing an evaluation mapping that may hide label-definition mismatch.
artifacts/lora/fold0_oxygen_diagnostics.json ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint": "artifacts/lora/fold0_best.pt",
3
+ "n": 2276,
4
+ "classes": [
5
+ "aerobe",
6
+ "anaerobe",
7
+ "facultative_anaerobe",
8
+ "microaerobe"
9
+ ],
10
+ "accuracy": 0.956503,
11
+ "macro_f1": 0.946019,
12
+ "macro_f1_all_classes": 0.47301,
13
+ "confusion_matrix": [
14
+ [
15
+ 1590,
16
+ 33,
17
+ 0,
18
+ 0
19
+ ],
20
+ [
21
+ 66,
22
+ 587,
23
+ 0,
24
+ 0
25
+ ],
26
+ [
27
+ 0,
28
+ 0,
29
+ 0,
30
+ 0
31
+ ],
32
+ [
33
+ 0,
34
+ 0,
35
+ 0,
36
+ 0
37
+ ]
38
+ ],
39
+ "per_class": {
40
+ "aerobe": {
41
+ "precision": 0.960145,
42
+ "recall": 0.979667,
43
+ "f1": 0.969808,
44
+ "support": 1623,
45
+ "predicted": 1656
46
+ },
47
+ "anaerobe": {
48
+ "precision": 0.946774,
49
+ "recall": 0.898928,
50
+ "f1": 0.922231,
51
+ "support": 653,
52
+ "predicted": 620
53
+ },
54
+ "facultative_anaerobe": {
55
+ "precision": 0.0,
56
+ "recall": 0.0,
57
+ "f1": 0.0,
58
+ "support": 0,
59
+ "predicted": 0
60
+ },
61
+ "microaerobe": {
62
+ "precision": 0.0,
63
+ "recall": 0.0,
64
+ "f1": 0.0,
65
+ "support": 0,
66
+ "predicted": 0
67
+ }
68
+ },
69
+ "wrong_predictions": [
70
+ {
71
+ "bacdive_id": 499,
72
+ "genome_accession": "GCA_000429505",
73
+ "group": "Alteromonadaceae",
74
+ "true": "anaerobe",
75
+ "pred": "aerobe",
76
+ "confidence": 0.999306,
77
+ "true_probability": 0.000693,
78
+ "margin": 0.998613
79
+ },
80
+ {
81
+ "bacdive_id": 481,
82
+ "genome_accession": "GCA_003363485",
83
+ "group": "Alteromonadaceae",
84
+ "true": "anaerobe",
85
+ "pred": "aerobe",
86
+ "confidence": 0.999155,
87
+ "true_probability": 0.000845,
88
+ "margin": 0.99831
89
+ },
90
+ {
91
+ "bacdive_id": 498,
92
+ "genome_accession": "GCA_000429485",
93
+ "group": "Alteromonadaceae",
94
+ "true": "anaerobe",
95
+ "pred": "aerobe",
96
+ "confidence": 0.997683,
97
+ "true_probability": 0.002315,
98
+ "margin": 0.995368
99
+ },
100
+ {
101
+ "bacdive_id": 168525,
102
+ "genome_accession": "GCA_006386545",
103
+ "group": "Sphaerotilaceae",
104
+ "true": "anaerobe",
105
+ "pred": "aerobe",
106
+ "confidence": 0.997474,
107
+ "true_probability": 0.002524,
108
+ "margin": 0.99495
109
+ },
110
+ {
111
+ "bacdive_id": 483,
112
+ "genome_accession": "GCA_000421165",
113
+ "group": "Alteromonadaceae",
114
+ "true": "anaerobe",
115
+ "pred": "aerobe",
116
+ "confidence": 0.99587,
117
+ "true_probability": 0.004127,
118
+ "margin": 0.991743
119
+ },
120
+ {
121
+ "bacdive_id": 17841,
122
+ "genome_accession": "GCA_000975055",
123
+ "group": "Demequinaceae",
124
+ "true": "anaerobe",
125
+ "pred": "aerobe",
126
+ "confidence": 0.995458,
127
+ "true_probability": 0.004536,
128
+ "margin": 0.990923
129
+ },
130
+ {
131
+ "bacdive_id": 17840,
132
+ "genome_accession": "GCA_000975035",
133
+ "group": "Demequinaceae",
134
+ "true": "anaerobe",
135
+ "pred": "aerobe",
136
+ "confidence": 0.987744,
137
+ "true_probability": 0.012242,
138
+ "margin": 0.975502
139
+ },
140
+ {
141
+ "bacdive_id": 164735,
142
+ "genome_accession": "GCF_943590815.1",
143
+ "group": "Enterobacteriaceae",
144
+ "true": "anaerobe",
145
+ "pred": "aerobe",
146
+ "confidence": 0.98457,
147
+ "true_probability": 0.015418,
148
+ "margin": 0.969152
149
+ },
150
+ {
151
+ "bacdive_id": 148058,
152
+ "genome_accession": "GCF_000005845.2",
153
+ "group": "Enterobacteriaceae",
154
+ "true": "anaerobe",
155
+ "pred": "aerobe",
156
+ "confidence": 0.981892,
157
+ "true_probability": 0.018096,
158
+ "margin": 0.963796
159
+ },
160
+ {
161
+ "bacdive_id": 149706,
162
+ "genome_accession": "GCF_000005845.2",
163
+ "group": "Enterobacteriaceae",
164
+ "true": "anaerobe",
165
+ "pred": "aerobe",
166
+ "confidence": 0.981892,
167
+ "true_probability": 0.018096,
168
+ "margin": 0.963796
169
+ },
170
+ {
171
+ "bacdive_id": 156346,
172
+ "genome_accession": "GCF_000005845.2",
173
+ "group": "Enterobacteriaceae",
174
+ "true": "anaerobe",
175
+ "pred": "aerobe",
176
+ "confidence": 0.981892,
177
+ "true_probability": 0.018096,
178
+ "margin": 0.963796
179
+ },
180
+ {
181
+ "bacdive_id": 133974,
182
+ "genome_accession": "GCA_001544515",
183
+ "group": "Burkholderiaceae",
184
+ "true": "anaerobe",
185
+ "pred": "aerobe",
186
+ "confidence": 0.979916,
187
+ "true_probability": 0.020068,
188
+ "margin": 0.959847
189
+ },
190
+ {
191
+ "bacdive_id": 133484,
192
+ "genome_accession": "GCA_900102145",
193
+ "group": "Thermaceae",
194
+ "true": "aerobe",
195
+ "pred": "anaerobe",
196
+ "confidence": 0.977195,
197
+ "true_probability": 0.0228,
198
+ "margin": 0.954396
199
+ },
200
+ {
201
+ "bacdive_id": 5882,
202
+ "genome_accession": "GCA_000215915",
203
+ "group": "Haloarculaceae",
204
+ "true": "anaerobe",
205
+ "pred": "aerobe",
206
+ "confidence": 0.973115,
207
+ "true_probability": 0.026812,
208
+ "margin": 0.946303
209
+ },
210
+ {
211
+ "bacdive_id": 133991,
212
+ "genome_accession": "GCA_039544205",
213
+ "group": "Ferrimonadaceae",
214
+ "true": "anaerobe",
215
+ "pred": "aerobe",
216
+ "confidence": 0.962063,
217
+ "true_probability": 0.037911,
218
+ "margin": 0.924153
219
+ },
220
+ {
221
+ "bacdive_id": 132346,
222
+ "genome_accession": "GCA_002259755",
223
+ "group": "Bifidobacteriaceae",
224
+ "true": "aerobe",
225
+ "pred": "anaerobe",
226
+ "confidence": 0.960132,
227
+ "true_probability": 0.039802,
228
+ "margin": 0.920329
229
+ },
230
+ {
231
+ "bacdive_id": 132623,
232
+ "genome_accession": "GCF_003967195.1",
233
+ "group": "Granulosicoccaceae",
234
+ "true": "anaerobe",
235
+ "pred": "aerobe",
236
+ "confidence": 0.960109,
237
+ "true_probability": 0.039865,
238
+ "margin": 0.920244
239
+ },
240
+ {
241
+ "bacdive_id": 24101,
242
+ "genome_accession": "GCA_000242915",
243
+ "group": "Helicobacteraceae",
244
+ "true": "anaerobe",
245
+ "pred": "aerobe",
246
+ "confidence": 0.959066,
247
+ "true_probability": 0.040876,
248
+ "margin": 0.91819
249
+ },
250
+ {
251
+ "bacdive_id": 160296,
252
+ "genome_accession": "GCA_010667645",
253
+ "group": "Bifidobacteriaceae",
254
+ "true": "aerobe",
255
+ "pred": "anaerobe",
256
+ "confidence": 0.957143,
257
+ "true_probability": 0.042813,
258
+ "margin": 0.91433
259
+ },
260
+ {
261
+ "bacdive_id": 140694,
262
+ "genome_accession": "GCA_003336745",
263
+ "group": "Thermaceae",
264
+ "true": "aerobe",
265
+ "pred": "anaerobe",
266
+ "confidence": 0.952379,
267
+ "true_probability": 0.04761,
268
+ "margin": 0.904769
269
+ },
270
+ {
271
+ "bacdive_id": 168303,
272
+ "genome_accession": "GCA_004307015",
273
+ "group": "Thermaceae",
274
+ "true": "aerobe",
275
+ "pred": "anaerobe",
276
+ "confidence": 0.951955,
277
+ "true_probability": 0.048036,
278
+ "margin": 0.903919
279
+ },
280
+ {
281
+ "bacdive_id": 134099,
282
+ "genome_accession": "GCA_023349185",
283
+ "group": "Shewanellaceae",
284
+ "true": "anaerobe",
285
+ "pred": "aerobe",
286
+ "confidence": 0.948975,
287
+ "true_probability": 0.050994,
288
+ "margin": 0.89798
289
+ },
290
+ {
291
+ "bacdive_id": 159192,
292
+ "genome_accession": "GCA_003721225",
293
+ "group": "Acidithiobacillaceae",
294
+ "true": "aerobe",
295
+ "pred": "anaerobe",
296
+ "confidence": 0.948142,
297
+ "true_probability": 0.051834,
298
+ "margin": 0.896308
299
+ },
300
+ {
301
+ "bacdive_id": 154004,
302
+ "genome_accession": "GCF_982443925.1",
303
+ "group": "Enterobacteriaceae",
304
+ "true": "anaerobe",
305
+ "pred": "aerobe",
306
+ "confidence": 0.945632,
307
+ "true_probability": 0.054345,
308
+ "margin": 0.891287
309
+ },
310
+ {
311
+ "bacdive_id": 159320,
312
+ "genome_accession": "GCA_009078285",
313
+ "group": "Bifidobacteriaceae",
314
+ "true": "aerobe",
315
+ "pred": "anaerobe",
316
+ "confidence": 0.945592,
317
+ "true_probability": 0.054343,
318
+ "margin": 0.891249
319
+ }
320
+ ]
321
+ }
artifacts/lora/fold0_oxygen_diagnostics.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Oxygen Diagnostics
2
+
3
+ Checkpoint: `artifacts/lora/fold0_best.pt`
4
+
5
+ - Labeled validation rows: `2276`
6
+ - Accuracy: `0.9565`
7
+ - Macro F1 (supported classes): `0.9460`
8
+ - Macro F1 (all configured classes): `0.4730`
9
+
10
+ ## Per-Class Metrics
11
+
12
+ | Class | Precision | Recall | F1 | Support | Predicted |
13
+ |---|---:|---:|---:|---:|---:|
14
+ | aerobe | 0.9601 | 0.9797 | 0.9698 | 1623 | 1656 |
15
+ | anaerobe | 0.9468 | 0.8989 | 0.9222 | 653 | 620 |
16
+ | facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
17
+ | microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
18
+
19
+ ## Confusion Matrix
20
+
21
+ | True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
22
+ |---|---:|---:|---:|---:|
23
+ | aerobe | 1590 | 33 | 0 | 0 |
24
+ | anaerobe | 66 | 587 | 0 | 0 |
25
+ | facultative_anaerobe | 0 | 0 | 0 | 0 |
26
+ | microaerobe | 0 | 0 | 0 | 0 |
27
+
28
+ ## High-Confidence Wrong Predictions
29
+
30
+ | BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
31
+ |---:|---|---|---|---|---:|---:|---:|
32
+ | 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9993 | 0.0007 | 0.9986 |
33
+ | 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9992 | 0.0008 | 0.9983 |
34
+ | 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9977 | 0.0023 | 0.9954 |
35
+ | 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9975 | 0.0025 | 0.9950 |
36
+ | 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9959 | 0.0041 | 0.9917 |
37
+ | 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9955 | 0.0045 | 0.9909 |
38
+ | 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9877 | 0.0122 | 0.9755 |
39
+ | 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9846 | 0.0154 | 0.9692 |
40
+ | 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
41
+ | 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
42
+ | 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9819 | 0.0181 | 0.9638 |
43
+ | 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9799 | 0.0201 | 0.9598 |
44
+ | 133484 | GCA_900102145 | Thermaceae | aerobe | anaerobe | 0.9772 | 0.0228 | 0.9544 |
45
+ | 5882 | GCA_000215915 | Haloarculaceae | anaerobe | aerobe | 0.9731 | 0.0268 | 0.9463 |
46
+ | 133991 | GCA_039544205 | Ferrimonadaceae | anaerobe | aerobe | 0.9621 | 0.0379 | 0.9242 |
47
+ | 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9601 | 0.0398 | 0.9203 |
48
+ | 132623 | GCF_003967195.1 | Granulosicoccaceae | anaerobe | aerobe | 0.9601 | 0.0399 | 0.9202 |
49
+ | 24101 | GCA_000242915 | Helicobacteraceae | anaerobe | aerobe | 0.9591 | 0.0409 | 0.9182 |
50
+ | 160296 | GCA_010667645 | Bifidobacteriaceae | aerobe | anaerobe | 0.9571 | 0.0428 | 0.9143 |
51
+ | 140694 | GCA_003336745 | Thermaceae | aerobe | anaerobe | 0.9524 | 0.0476 | 0.9048 |
52
+ | 168303 | GCA_004307015 | Thermaceae | aerobe | anaerobe | 0.9520 | 0.0480 | 0.9039 |
53
+ | 134099 | GCA_023349185 | Shewanellaceae | anaerobe | aerobe | 0.9490 | 0.0510 | 0.8980 |
54
+ | 159192 | GCA_003721225 | Acidithiobacillaceae | aerobe | anaerobe | 0.9481 | 0.0518 | 0.8963 |
55
+ | 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9456 | 0.0543 | 0.8913 |
56
+ | 159320 | GCA_009078285 | Bifidobacteriaceae | aerobe | anaerobe | 0.9456 | 0.0543 | 0.8912 |
artifacts/lora/fold0_oxygen_diagnostics_comparison.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fold 0 Oxygen Diagnostic Comparison
2
+
3
+ Both diagnostics were run on the same fold 0 oxygen-labeled validation rows
4
+ (`n = 2,276`) using the saved LoRA checkpoints.
5
+
6
+ | Checkpoint | Accuracy | Macro F1 (supported classes) | Aerobe F1 | Anaerobe F1 | Aerobe -> Anaerobe | Anaerobe -> Aerobe |
7
+ |---|---:|---:|---:|---:|---:|---:|
8
+ | `fold0_best.pt` | 0.9565 | 0.9460 | 0.9698 | 0.9222 | 33 | 66 |
9
+ | `fold0_best_oxygen.pt` | 0.9446 | 0.9294 | 0.9622 | 0.8966 | 19 | 107 |
10
+
11
+ ## Conclusion
12
+
13
+ The all-task LoRA checkpoint remains the better oxygen model on fold 0. The
14
+ oxygen-only checkpoint improves aerobe recall slightly and reduces aerobe
15
+ misclassified as anaerobe, but it substantially increases anaerobe misclassified
16
+ as aerobe. That drop in anaerobe recall drives the lower macro F1.
17
+
18
+ The fold 0 validation split has no supported `facultative_anaerobe` or
19
+ `microaerobe` rows, so the primary comparison should use macro F1 over supported
20
+ classes. Full configured-class macro F1 is also recorded in the JSON files for
21
+ visibility, but it is not useful for selecting between these two checkpoints on
22
+ this fold.
artifacts/lora/fold0_oxygen_only_diagnostics.json ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint": "artifacts/lora/fold0_best_oxygen.pt",
3
+ "n": 2276,
4
+ "classes": [
5
+ "aerobe",
6
+ "anaerobe",
7
+ "facultative_anaerobe",
8
+ "microaerobe"
9
+ ],
10
+ "accuracy": 0.94464,
11
+ "macro_f1": 0.92938,
12
+ "macro_f1_all_classes": 0.46469,
13
+ "confusion_matrix": [
14
+ [
15
+ 1604,
16
+ 19,
17
+ 0,
18
+ 0
19
+ ],
20
+ [
21
+ 107,
22
+ 546,
23
+ 0,
24
+ 0
25
+ ],
26
+ [
27
+ 0,
28
+ 0,
29
+ 0,
30
+ 0
31
+ ],
32
+ [
33
+ 0,
34
+ 0,
35
+ 0,
36
+ 0
37
+ ]
38
+ ],
39
+ "per_class": {
40
+ "aerobe": {
41
+ "precision": 0.937463,
42
+ "recall": 0.988293,
43
+ "f1": 0.962208,
44
+ "support": 1623,
45
+ "predicted": 1711
46
+ },
47
+ "anaerobe": {
48
+ "precision": 0.966372,
49
+ "recall": 0.836141,
50
+ "f1": 0.896552,
51
+ "support": 653,
52
+ "predicted": 565
53
+ },
54
+ "facultative_anaerobe": {
55
+ "precision": 0.0,
56
+ "recall": 0.0,
57
+ "f1": 0.0,
58
+ "support": 0,
59
+ "predicted": 0
60
+ },
61
+ "microaerobe": {
62
+ "precision": 0.0,
63
+ "recall": 0.0,
64
+ "f1": 0.0,
65
+ "support": 0,
66
+ "predicted": 0
67
+ }
68
+ },
69
+ "wrong_predictions": [
70
+ {
71
+ "bacdive_id": 499,
72
+ "genome_accession": "GCA_000429505",
73
+ "group": "Alteromonadaceae",
74
+ "true": "anaerobe",
75
+ "pred": "aerobe",
76
+ "confidence": 0.999059,
77
+ "true_probability": 0.000936,
78
+ "margin": 0.998123
79
+ },
80
+ {
81
+ "bacdive_id": 481,
82
+ "genome_accession": "GCA_003363485",
83
+ "group": "Alteromonadaceae",
84
+ "true": "anaerobe",
85
+ "pred": "aerobe",
86
+ "confidence": 0.998843,
87
+ "true_probability": 0.001152,
88
+ "margin": 0.997691
89
+ },
90
+ {
91
+ "bacdive_id": 168525,
92
+ "genome_accession": "GCA_006386545",
93
+ "group": "Sphaerotilaceae",
94
+ "true": "anaerobe",
95
+ "pred": "aerobe",
96
+ "confidence": 0.99857,
97
+ "true_probability": 0.001423,
98
+ "margin": 0.997147
99
+ },
100
+ {
101
+ "bacdive_id": 498,
102
+ "genome_accession": "GCA_000429485",
103
+ "group": "Alteromonadaceae",
104
+ "true": "anaerobe",
105
+ "pred": "aerobe",
106
+ "confidence": 0.997633,
107
+ "true_probability": 0.002353,
108
+ "margin": 0.99528
109
+ },
110
+ {
111
+ "bacdive_id": 17841,
112
+ "genome_accession": "GCA_000975055",
113
+ "group": "Demequinaceae",
114
+ "true": "anaerobe",
115
+ "pred": "aerobe",
116
+ "confidence": 0.996928,
117
+ "true_probability": 0.003054,
118
+ "margin": 0.993875
119
+ },
120
+ {
121
+ "bacdive_id": 483,
122
+ "genome_accession": "GCA_000421165",
123
+ "group": "Alteromonadaceae",
124
+ "true": "anaerobe",
125
+ "pred": "aerobe",
126
+ "confidence": 0.996088,
127
+ "true_probability": 0.003887,
128
+ "margin": 0.992201
129
+ },
130
+ {
131
+ "bacdive_id": 133974,
132
+ "genome_accession": "GCA_001544515",
133
+ "group": "Burkholderiaceae",
134
+ "true": "anaerobe",
135
+ "pred": "aerobe",
136
+ "confidence": 0.995023,
137
+ "true_probability": 0.004932,
138
+ "margin": 0.990091
139
+ },
140
+ {
141
+ "bacdive_id": 17840,
142
+ "genome_accession": "GCA_000975035",
143
+ "group": "Demequinaceae",
144
+ "true": "anaerobe",
145
+ "pred": "aerobe",
146
+ "confidence": 0.9928,
147
+ "true_probability": 0.007156,
148
+ "margin": 0.985644
149
+ },
150
+ {
151
+ "bacdive_id": 164735,
152
+ "genome_accession": "GCF_943590815.1",
153
+ "group": "Enterobacteriaceae",
154
+ "true": "anaerobe",
155
+ "pred": "aerobe",
156
+ "confidence": 0.991238,
157
+ "true_probability": 0.008708,
158
+ "margin": 0.982529
159
+ },
160
+ {
161
+ "bacdive_id": 132623,
162
+ "genome_accession": "GCF_003967195.1",
163
+ "group": "Granulosicoccaceae",
164
+ "true": "anaerobe",
165
+ "pred": "aerobe",
166
+ "confidence": 0.990258,
167
+ "true_probability": 0.00966,
168
+ "margin": 0.980599
169
+ },
170
+ {
171
+ "bacdive_id": 133976,
172
+ "genome_accession": "GCA_001544495",
173
+ "group": "Burkholderiaceae",
174
+ "true": "anaerobe",
175
+ "pred": "aerobe",
176
+ "confidence": 0.989736,
177
+ "true_probability": 0.010187,
178
+ "margin": 0.979549
179
+ },
180
+ {
181
+ "bacdive_id": 148058,
182
+ "genome_accession": "GCF_000005845.2",
183
+ "group": "Enterobacteriaceae",
184
+ "true": "anaerobe",
185
+ "pred": "aerobe",
186
+ "confidence": 0.985801,
187
+ "true_probability": 0.014125,
188
+ "margin": 0.971676
189
+ },
190
+ {
191
+ "bacdive_id": 149706,
192
+ "genome_accession": "GCF_000005845.2",
193
+ "group": "Enterobacteriaceae",
194
+ "true": "anaerobe",
195
+ "pred": "aerobe",
196
+ "confidence": 0.985801,
197
+ "true_probability": 0.014125,
198
+ "margin": 0.971676
199
+ },
200
+ {
201
+ "bacdive_id": 156346,
202
+ "genome_accession": "GCF_000005845.2",
203
+ "group": "Enterobacteriaceae",
204
+ "true": "anaerobe",
205
+ "pred": "aerobe",
206
+ "confidence": 0.985801,
207
+ "true_probability": 0.014125,
208
+ "margin": 0.971676
209
+ },
210
+ {
211
+ "bacdive_id": 133975,
212
+ "genome_accession": "GCA_001544475",
213
+ "group": "Burkholderiaceae",
214
+ "true": "anaerobe",
215
+ "pred": "aerobe",
216
+ "confidence": 0.985722,
217
+ "true_probability": 0.014188,
218
+ "margin": 0.971534
219
+ },
220
+ {
221
+ "bacdive_id": 132346,
222
+ "genome_accession": "GCA_002259755",
223
+ "group": "Bifidobacteriaceae",
224
+ "true": "aerobe",
225
+ "pred": "anaerobe",
226
+ "confidence": 0.97981,
227
+ "true_probability": 0.020066,
228
+ "margin": 0.959744
229
+ },
230
+ {
231
+ "bacdive_id": 160296,
232
+ "genome_accession": "GCA_010667645",
233
+ "group": "Bifidobacteriaceae",
234
+ "true": "aerobe",
235
+ "pred": "anaerobe",
236
+ "confidence": 0.973442,
237
+ "true_probability": 0.026419,
238
+ "margin": 0.947023
239
+ },
240
+ {
241
+ "bacdive_id": 159320,
242
+ "genome_accession": "GCA_009078285",
243
+ "group": "Bifidobacteriaceae",
244
+ "true": "aerobe",
245
+ "pred": "anaerobe",
246
+ "confidence": 0.971766,
247
+ "true_probability": 0.028081,
248
+ "margin": 0.943686
249
+ },
250
+ {
251
+ "bacdive_id": 7227,
252
+ "genome_accession": "GCA_000425185",
253
+ "group": "Pleomorphomonadaceae",
254
+ "true": "aerobe",
255
+ "pred": "anaerobe",
256
+ "confidence": 0.959909,
257
+ "true_probability": 0.039907,
258
+ "margin": 0.920002
259
+ },
260
+ {
261
+ "bacdive_id": 154004,
262
+ "genome_accession": "GCF_982443925.1",
263
+ "group": "Enterobacteriaceae",
264
+ "true": "anaerobe",
265
+ "pred": "aerobe",
266
+ "confidence": 0.959056,
267
+ "true_probability": 0.04078,
268
+ "margin": 0.918276
269
+ },
270
+ {
271
+ "bacdive_id": 24101,
272
+ "genome_accession": "GCA_000242915",
273
+ "group": "Helicobacteraceae",
274
+ "true": "anaerobe",
275
+ "pred": "aerobe",
276
+ "confidence": 0.954952,
277
+ "true_probability": 0.044544,
278
+ "margin": 0.910408
279
+ },
280
+ {
281
+ "bacdive_id": 133136,
282
+ "genome_accession": "GCA_030161615",
283
+ "group": "Demequinaceae",
284
+ "true": "anaerobe",
285
+ "pred": "aerobe",
286
+ "confidence": 0.953702,
287
+ "true_probability": 0.045723,
288
+ "margin": 0.90798
289
+ },
290
+ {
291
+ "bacdive_id": 159192,
292
+ "genome_accession": "GCA_003721225",
293
+ "group": "Acidithiobacillaceae",
294
+ "true": "aerobe",
295
+ "pred": "anaerobe",
296
+ "confidence": 0.952867,
297
+ "true_probability": 0.046824,
298
+ "margin": 0.906043
299
+ },
300
+ {
301
+ "bacdive_id": 8608,
302
+ "genome_accession": "GCF_900476065.1",
303
+ "group": "Metamycoplasmataceae",
304
+ "true": "aerobe",
305
+ "pred": "anaerobe",
306
+ "confidence": 0.927788,
307
+ "true_probability": 0.070978,
308
+ "margin": 0.85681
309
+ },
310
+ {
311
+ "bacdive_id": 134099,
312
+ "genome_accession": "GCA_023349185",
313
+ "group": "Shewanellaceae",
314
+ "true": "anaerobe",
315
+ "pred": "aerobe",
316
+ "confidence": 0.924335,
317
+ "true_probability": 0.075315,
318
+ "margin": 0.84902
319
+ }
320
+ ]
321
+ }
artifacts/lora/fold0_oxygen_only_diagnostics.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Oxygen Diagnostics
2
+
3
+ Checkpoint: `artifacts/lora/fold0_best_oxygen.pt`
4
+
5
+ - Labeled validation rows: `2276`
6
+ - Accuracy: `0.9446`
7
+ - Macro F1 (supported classes): `0.9294`
8
+ - Macro F1 (all configured classes): `0.4647`
9
+
10
+ ## Per-Class Metrics
11
+
12
+ | Class | Precision | Recall | F1 | Support | Predicted |
13
+ |---|---:|---:|---:|---:|---:|
14
+ | aerobe | 0.9375 | 0.9883 | 0.9622 | 1623 | 1711 |
15
+ | anaerobe | 0.9664 | 0.8361 | 0.8966 | 653 | 565 |
16
+ | facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
17
+ | microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
18
+
19
+ ## Confusion Matrix
20
+
21
+ | True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
22
+ |---|---:|---:|---:|---:|
23
+ | aerobe | 1604 | 19 | 0 | 0 |
24
+ | anaerobe | 107 | 546 | 0 | 0 |
25
+ | facultative_anaerobe | 0 | 0 | 0 | 0 |
26
+ | microaerobe | 0 | 0 | 0 | 0 |
27
+
28
+ ## High-Confidence Wrong Predictions
29
+
30
+ | BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
31
+ |---:|---|---|---|---|---:|---:|---:|
32
+ | 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9991 | 0.0009 | 0.9981 |
33
+ | 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9988 | 0.0012 | 0.9977 |
34
+ | 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9986 | 0.0014 | 0.9971 |
35
+ | 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9976 | 0.0024 | 0.9953 |
36
+ | 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9969 | 0.0031 | 0.9939 |
37
+ | 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9961 | 0.0039 | 0.9922 |
38
+ | 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9950 | 0.0049 | 0.9901 |
39
+ | 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9928 | 0.0072 | 0.9856 |
40
+ | 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9912 | 0.0087 | 0.9825 |
41
+ | 132623 | GCF_003967195.1 | Granulosicoccaceae | anaerobe | aerobe | 0.9903 | 0.0097 | 0.9806 |
42
+ | 133976 | GCA_001544495 | Burkholderiaceae | anaerobe | aerobe | 0.9897 | 0.0102 | 0.9795 |
43
+ | 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
44
+ | 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
45
+ | 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9858 | 0.0141 | 0.9717 |
46
+ | 133975 | GCA_001544475 | Burkholderiaceae | anaerobe | aerobe | 0.9857 | 0.0142 | 0.9715 |
47
+ | 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9798 | 0.0201 | 0.9597 |
48
+ | 160296 | GCA_010667645 | Bifidobacteriaceae | aerobe | anaerobe | 0.9734 | 0.0264 | 0.9470 |
49
+ | 159320 | GCA_009078285 | Bifidobacteriaceae | aerobe | anaerobe | 0.9718 | 0.0281 | 0.9437 |
50
+ | 7227 | GCA_000425185 | Pleomorphomonadaceae | aerobe | anaerobe | 0.9599 | 0.0399 | 0.9200 |
51
+ | 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9591 | 0.0408 | 0.9183 |
52
+ | 24101 | GCA_000242915 | Helicobacteraceae | anaerobe | aerobe | 0.9550 | 0.0445 | 0.9104 |
53
+ | 133136 | GCA_030161615 | Demequinaceae | anaerobe | aerobe | 0.9537 | 0.0457 | 0.9080 |
54
+ | 159192 | GCA_003721225 | Acidithiobacillaceae | aerobe | anaerobe | 0.9529 | 0.0468 | 0.9060 |
55
+ | 8608 | GCF_900476065.1 | Metamycoplasmataceae | aerobe | anaerobe | 0.9278 | 0.0710 | 0.8568 |
56
+ | 134099 | GCA_023349185 | Shewanellaceae | anaerobe | aerobe | 0.9243 | 0.0753 | 0.8490 |
artifacts/lora/fold0_results.json ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_cfg": {
3
+ "esm_model_name": "facebook/esm2_t12_35M_UR50D",
4
+ "lora_r": 8,
5
+ "lora_alpha": 16,
6
+ "lora_dropout": 0.05,
7
+ "lora_target": [
8
+ "query",
9
+ "value"
10
+ ],
11
+ "head_hidden_dim": 128,
12
+ "head_dropout": 0.1,
13
+ "max_seq_len": 512,
14
+ "max_proteins_per_cat": 6,
15
+ "gradient_checkpointing": true
16
+ },
17
+ "train_cfg": {
18
+ "fold": 0,
19
+ "epochs": 1,
20
+ "batch_size": 2,
21
+ "grad_accum": 8,
22
+ "lora_lr": 0.0001,
23
+ "head_lr": 0.001,
24
+ "weight_decay": 0.01,
25
+ "warmup_frac": 0.05,
26
+ "bf16": true,
27
+ "max_proteins_per_category": 16,
28
+ "save_dir": "artifacts/lora",
29
+ "grad_clip": 1.0
30
+ },
31
+ "history": [
32
+ {
33
+ "epoch": 1,
34
+ "train_loss": 45.74971861128626,
35
+ "val": {
36
+ "temp": {
37
+ "mae": 3.666003942489624,
38
+ "n": 8066
39
+ },
40
+ "ph": {
41
+ "mae": 0.5597922205924988,
42
+ "n": 705
43
+ },
44
+ "salt": {
45
+ "mae": 1.815351128578186,
46
+ "n": 424
47
+ },
48
+ "oxy": {
49
+ "f1_macro": 0.9448227977645267,
50
+ "n": 2266
51
+ }
52
+ },
53
+ "elapsed_s": 19656.821365594864
54
+ }
55
+ ],
56
+ "best": {
57
+ "epoch": 1,
58
+ "val": {
59
+ "temp": {
60
+ "mae": 3.666003942489624,
61
+ "n": 8066
62
+ },
63
+ "ph": {
64
+ "mae": 0.5597922205924988,
65
+ "n": 705
66
+ },
67
+ "salt": {
68
+ "mae": 1.815351128578186,
69
+ "n": 424
70
+ },
71
+ "oxy": {
72
+ "f1_macro": 0.9448227977645267,
73
+ "n": 2266
74
+ }
75
+ },
76
+ "score": 5.096324493895782
77
+ }
78
+ }
artifacts/lora/fold0_results_oxygen.json ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_cfg": {
3
+ "esm_model_name": "facebook/esm2_t12_35M_UR50D",
4
+ "lora_r": 8,
5
+ "lora_alpha": 16,
6
+ "lora_dropout": 0.05,
7
+ "lora_target": [
8
+ "query",
9
+ "value"
10
+ ],
11
+ "head_hidden_dim": 128,
12
+ "head_dropout": 0.1,
13
+ "max_seq_len": 512,
14
+ "max_proteins_per_cat": 6,
15
+ "gradient_checkpointing": true
16
+ },
17
+ "train_cfg": {
18
+ "fold": 0,
19
+ "epochs": 1,
20
+ "batch_size": 2,
21
+ "grad_accum": 8,
22
+ "lora_lr": 0.0001,
23
+ "head_lr": 0.001,
24
+ "weight_decay": 0.01,
25
+ "warmup_frac": 0.05,
26
+ "bf16": true,
27
+ "max_proteins_per_category": 16,
28
+ "save_dir": "artifacts/lora",
29
+ "grad_clip": 1.0,
30
+ "temp_weight": 0.0,
31
+ "ph_weight": 0.0,
32
+ "salt_weight": 0.0,
33
+ "oxy_weight": 1.0
34
+ },
35
+ "history": [
36
+ {
37
+ "epoch": 1,
38
+ "train_loss": 0.08001717917339461,
39
+ "val": {
40
+ "temp": {
41
+ "mae": 32.25041580200195,
42
+ "n": 8062
43
+ },
44
+ "ph": {
45
+ "mae": 7.046206951141357,
46
+ "n": 763
47
+ },
48
+ "salt": {
49
+ "mae": 2.082266330718994,
50
+ "n": 446
51
+ },
52
+ "oxy": {
53
+ "f1_macro": 0.916835889123625,
54
+ "n": 2214
55
+ }
56
+ },
57
+ "elapsed_s": 19710.637019872665
58
+ }
59
+ ],
60
+ "best": {
61
+ "epoch": 1,
62
+ "val": {
63
+ "temp": {
64
+ "mae": 32.25041580200195,
65
+ "n": 8062
66
+ },
67
+ "ph": {
68
+ "mae": 7.046206951141357,
69
+ "n": 763
70
+ },
71
+ "salt": {
72
+ "mae": 2.082266330718994,
73
+ "n": 446
74
+ },
75
+ "oxy": {
76
+ "f1_macro": 0.916835889123625,
77
+ "n": 2214
78
+ }
79
+ },
80
+ "score": 40.46205319473868
81
+ }
82
+ }
artifacts/lora/fold0_results_smoke.json ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_cfg": {
3
+ "esm_model_name": "facebook/esm2_t12_35M_UR50D",
4
+ "lora_r": 8,
5
+ "lora_alpha": 16,
6
+ "lora_dropout": 0.05,
7
+ "lora_target": [
8
+ "query",
9
+ "value"
10
+ ],
11
+ "head_hidden_dim": 128,
12
+ "head_dropout": 0.1,
13
+ "max_seq_len": 512,
14
+ "max_proteins_per_cat": 6,
15
+ "gradient_checkpointing": true
16
+ },
17
+ "train_cfg": {
18
+ "fold": 0,
19
+ "epochs": 1,
20
+ "batch_size": 2,
21
+ "grad_accum": 8,
22
+ "lora_lr": 0.0001,
23
+ "head_lr": 0.001,
24
+ "weight_decay": 0.01,
25
+ "warmup_frac": 0.05,
26
+ "bf16": true,
27
+ "max_proteins_per_category": 16,
28
+ "save_dir": "/artifacts/lora",
29
+ "grad_clip": 1.0
30
+ },
31
+ "history": [
32
+ {
33
+ "epoch": 1,
34
+ "train_loss": 896.9954485212054,
35
+ "val": {
36
+ "temp": {
37
+ "mae": 28.067121505737305,
38
+ "n": 172
39
+ },
40
+ "ph": {
41
+ "mae": null,
42
+ "n": 0
43
+ },
44
+ "salt": {
45
+ "mae": null,
46
+ "n": 0
47
+ },
48
+ "oxy": {
49
+ "f1_macro": null,
50
+ "n": 0
51
+ }
52
+ },
53
+ "elapsed_s": 60.57433819770813
54
+ }
55
+ ],
56
+ "best": {
57
+ "epoch": 1,
58
+ "val": {
59
+ "temp": {
60
+ "mae": 28.067121505737305,
61
+ "n": 172
62
+ },
63
+ "ph": {
64
+ "mae": null,
65
+ "n": 0
66
+ },
67
+ "salt": {
68
+ "mae": null,
69
+ "n": 0
70
+ },
71
+ "oxy": {
72
+ "f1_macro": null,
73
+ "n": 0
74
+ }
75
+ },
76
+ "score": 28.067121505737305
77
+ }
78
+ }
artifacts/lora/lambda_fold0_1ep_20260517T033023Z.log ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [lambda-lora] starting fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D
2
+ [lambda-lora] cuda=True device=NVIDIA A100-SXM4-40GB
3
+ [lora] device = cuda
4
+ [lora] loaded 40,469 records with sequences + labels
5
+ [lora] fold 0: 32,375 train / 8,094 val
6
+ Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
7
+ You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
8
+ [lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
9
+ /home/ubuntu/microbe-model/.venv/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
10
+ with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
11
+ ep 1 step 50: loss=842.3137 lr_lora=4.95e-05
12
+ ep 1 step 100: loss=482.1193 lr_lora=9.90e-05
13
+ ep 1 step 150: loss=334.7604 lr_lora=9.98e-05
14
+ ep 1 step 200: loss=264.2583 lr_lora=9.93e-05
15
+ ep 1 step 250: loss=217.4865 lr_lora=9.85e-05
16
+ ep 1 step 300: loss=186.5597 lr_lora=9.74e-05
17
+ ep 1 step 350: loss=163.8552 lr_lora=9.59e-05
18
+ ep 1 step 400: loss=147.2159 lr_lora=9.41e-05
19
+ ep 1 step 450: loss=133.7593 lr_lora=9.21e-05
20
+ ep 1 step 500: loss=123.0141 lr_lora=8.97e-05
21
+ ep 1 step 550: loss=113.9034 lr_lora=8.71e-05
22
+ ep 1 step 600: loss=106.4140 lr_lora=8.43e-05
23
+ ep 1 step 650: loss=100.2909 lr_lora=8.12e-05
24
+ ep 1 step 700: loss=94.6046 lr_lora=7.79e-05
25
+ ep 1 step 750: loss=89.8550 lr_lora=7.44e-05
26
+ ep 1 step 800: loss=85.4993 lr_lora=7.08e-05
27
+ ep 1 step 850: loss=81.6606 lr_lora=6.70e-05
28
+ ep 1 step 900: loss=77.9707 lr_lora=6.31e-05
29
+ ep 1 step 950: loss=75.1319 lr_lora=5.91e-05
30
+ ep 1 step 1000: loss=72.3647 lr_lora=5.51e-05
31
+ ep 1 step 1050: loss=69.8078 lr_lora=5.10e-05
32
+ ep 1 step 1100: loss=67.5791 lr_lora=4.69e-05
33
+ ep 1 step 1150: loss=65.4581 lr_lora=4.28e-05
34
+ ep 1 step 1200: loss=63.5737 lr_lora=3.88e-05
35
+ ep 1 step 1250: loss=61.9102 lr_lora=3.49e-05
36
+ ep 1 step 1300: loss=60.3259 lr_lora=3.10e-05
37
+ ep 1 step 1350: loss=58.7075 lr_lora=2.73e-05
38
+ ep 1 step 1400: loss=57.2739 lr_lora=2.38e-05
39
+ ep 1 step 1450: loss=55.9584 lr_lora=2.04e-05
40
+ ep 1 step 1500: loss=54.7973 lr_lora=1.72e-05
41
+ ep 1 step 1550: loss=53.5961 lr_lora=1.42e-05
42
+ ep 1 step 1600: loss=52.5554 lr_lora=1.15e-05
43
+ ep 1 step 1650: loss=51.6486 lr_lora=9.01e-06
44
+ ep 1 step 1700: loss=50.7810 lr_lora=6.81e-06
45
+ ep 1 step 1750: loss=49.9158 lr_lora=4.90e-06
46
+ ep 1 step 1800: loss=49.0940 lr_lora=3.28e-06
47
+ ep 1 step 1850: loss=48.2504 lr_lora=1.99e-06
48
+ ep 1 step 1900: loss=47.4683 lr_lora=1.01e-06
49
+ ep 1 step 1950: loss=46.7232 lr_lora=3.56e-07
50
+ ep 1 step 2000: loss=46.0539 lr_lora=3.53e-08
51
+ [lora] epoch 1 done in 19657s val={'temp': {'mae': 3.666003942489624, 'n': 8066}, 'ph': {'mae': 0.5597922205924988, 'n': 705}, 'salt': {'mae': 1.815351128578186, 'n': 424}, 'oxy': {'f1_macro': 0.9448227977645267, 'n': 2266}}
52
+ [lora] wrote artifacts/lora/fold0_results.json
53
+ Traceback (most recent call last):
54
+ File "/home/ubuntu/microbe-model/scripts/run_lora_lambda.py", line 16, in <module>
55
+ print(f"[lambda-lora] done best={results.get(best)}", flush=True)
56
+ ^^^^
57
+ NameError: name 'best' is not defined
artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [lambda-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D preset=oxygen
2
+ [lambda-lora] device=NVIDIA A100-SXM4-40GB
3
+ [lambda-lora] target_weights={'temp': 0.0, 'ph': 0.0, 'salt': 0.0, 'oxy': 1.0}
4
+ [lora] device = cuda
5
+ [lora] loaded 40,469 records with sequences + labels
6
+ [lora] fold 0: 32,375 train / 8,094 val
7
+ 2026-05-17 10:35:49.175226: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
8
+ WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
9
+ E0000 00:00:1779014149.193921 3697 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
10
+ E0000 00:00:1779014149.200030 3697 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
11
+ W0000 00:00:1779014149.218228 3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
12
+ W0000 00:00:1779014149.218259 3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
13
+ W0000 00:00:1779014149.218266 3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
14
+ W0000 00:00:1779014149.218271 3697 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
15
+ Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
16
+ You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
17
+ [lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
18
+ ep 1 step 50: loss=0.3476 lr_lora=4.95e-05
19
+ ep 1 step 100: loss=0.2605 lr_lora=9.90e-05
20
+ ep 1 step 150: loss=0.2177 lr_lora=9.98e-05
21
+ ep 1 step 200: loss=0.1936 lr_lora=9.93e-05
22
+ ep 1 step 250: loss=0.1740 lr_lora=9.85e-05
23
+ ep 1 step 300: loss=0.1574 lr_lora=9.74e-05
24
+ ep 1 step 350: loss=0.1420 lr_lora=9.59e-05
25
+ ep 1 step 400: loss=0.1378 lr_lora=9.41e-05
26
+ ep 1 step 450: loss=0.1294 lr_lora=9.21e-05
27
+ ep 1 step 500: loss=0.1307 lr_lora=8.97e-05
28
+ ep 1 step 550: loss=0.1256 lr_lora=8.71e-05
29
+ ep 1 step 600: loss=0.1195 lr_lora=8.43e-05
30
+ ep 1 step 650: loss=0.1169 lr_lora=8.12e-05
31
+ ep 1 step 700: loss=0.1113 lr_lora=7.79e-05
32
+ ep 1 step 750: loss=0.1076 lr_lora=7.44e-05
33
+ ep 1 step 800: loss=0.1036 lr_lora=7.08e-05
34
+ ep 1 step 850: loss=0.1003 lr_lora=6.70e-05
35
+ ep 1 step 900: loss=0.0982 lr_lora=6.31e-05
36
+ ep 1 step 950: loss=0.0990 lr_lora=5.91e-05
37
+ ep 1 step 1000: loss=0.0966 lr_lora=5.51e-05
38
+ ep 1 step 1050: loss=0.0965 lr_lora=5.10e-05
39
+ ep 1 step 1100: loss=0.0955 lr_lora=4.69e-05
40
+ ep 1 step 1150: loss=0.0954 lr_lora=4.28e-05
41
+ ep 1 step 1200: loss=0.0942 lr_lora=3.88e-05
42
+ ep 1 step 1250: loss=0.0939 lr_lora=3.49e-05
43
+ ep 1 step 1300: loss=0.0928 lr_lora=3.10e-05
44
+ ep 1 step 1350: loss=0.0911 lr_lora=2.73e-05
45
+ ep 1 step 1400: loss=0.0895 lr_lora=2.38e-05
46
+ ep 1 step 1450: loss=0.0890 lr_lora=2.04e-05
47
+ ep 1 step 1500: loss=0.0887 lr_lora=1.72e-05
48
+ ep 1 step 1550: loss=0.0869 lr_lora=1.42e-05
49
+ ep 1 step 1600: loss=0.0861 lr_lora=1.15e-05
50
+ ep 1 step 1650: loss=0.0856 lr_lora=9.01e-06
51
+ ep 1 step 1700: loss=0.0848 lr_lora=6.81e-06
52
+ ep 1 step 1750: loss=0.0842 lr_lora=4.90e-06
53
+ ep 1 step 1800: loss=0.0840 lr_lora=3.28e-06
54
+ ep 1 step 1850: loss=0.0832 lr_lora=1.99e-06
55
+ ep 1 step 1900: loss=0.0821 lr_lora=1.01e-06
56
+ ep 1 step 1950: loss=0.0818 lr_lora=3.56e-07
57
+ ep 1 step 2000: loss=0.0805 lr_lora=3.53e-08
58
+ [lora] epoch 1 done in 19711s val={'temp': {'mae': 32.25041580200195, 'n': 8062}, 'ph': {'mae': 7.046206951141357, 'n': 763}, 'salt': {'mae': 2.082266330718994, 'n': 446}, 'oxy': {'f1_macro': 0.916835889123625, 'n': 2214}}
59
+ [lora] wrote artifacts/lora/fold0_results.json
60
+ [lambda-lora] done best={'epoch': 1, 'val': {'temp': {'mae': 32.25041580200195, 'n': 8062}, 'ph': {'mae': 7.046206951141357, 'n': 763}, 'salt': {'mae': 2.082266330718994, 'n': 446}, 'oxy': {'f1_macro': 0.916835889123625, 'n': 2214}}, 'score': 40.46205319473868}
artifacts/lora/lambda_fold0_weighted_anaerobe_20260518T070340Z.log ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [lambda-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D preset=all
2
+ [lambda-lora] device=NVIDIA A100-SXM4-40GB
3
+ [lambda-lora] target_weights={'temp': 1.0, 'ph': 1.0, 'salt': 1.0, 'oxy': 1.0}
4
+ [lambda-lora] oxy_class_weights=(1.0, 1.5, 1.0, 1.0)
5
+ [lora] device = cuda
6
+ [lora] loaded 40,469 records with sequences + labels
7
+ [lora] fold 0: 32,375 train / 8,094 val
8
+ /home/ubuntu/.cache/uv/archive-v0/hIOP0T7h41YWWs-w/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
9
+ warnings.warn(
10
+ Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
11
+ You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
12
+ [lora] trainable params: 2,151,815 / total: 36,144,696 (5.95%)
13
+ /home/ubuntu/.cache/uv/archive-v0/hIOP0T7h41YWWs-w/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
14
+ with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
15
+ ep 1 step 50: loss=886.9809 lr_lora=4.95e-05
16
+ ep 1 step 100: loss=554.1884 lr_lora=9.90e-05
17
+ ep 1 step 150: loss=380.8741 lr_lora=9.98e-05
18
+ ep 1 step 200: loss=296.4814 lr_lora=9.93e-05
19
+ ep 1 step 250: loss=244.2839 lr_lora=9.85e-05
20
+ ep 1 step 300: loss=207.9945 lr_lora=9.74e-05
21
+ ep 1 step 350: loss=183.1828 lr_lora=9.59e-05
22
+ ep 1 step 400: loss=164.5015 lr_lora=9.41e-05
23
+ ep 1 step 450: loss=149.3741 lr_lora=9.21e-05
24
+ ep 1 step 500: loss=137.3078 lr_lora=8.97e-05
25
+ ep 1 step 550: loss=127.1643 lr_lora=8.71e-05
26
+ ep 1 step 600: loss=118.7969 lr_lora=8.43e-05
27
+ ep 1 step 650: loss=111.4764 lr_lora=8.12e-05
28
+ ep 1 step 700: loss=105.4436 lr_lora=7.79e-05
29
+ ep 1 step 750: loss=100.0075 lr_lora=7.44e-05
30
+ ep 1 step 800: loss=95.2808 lr_lora=7.08e-05
31
+ ep 1 step 850: loss=91.1569 lr_lora=6.70e-05
32
+ ep 1 step 900: loss=87.3118 lr_lora=6.31e-05
33
+ ep 1 step 950: loss=83.6892 lr_lora=5.91e-05
34
+ ep 1 step 1000: loss=80.7049 lr_lora=5.51e-05
35
+ ep 1 step 1050: loss=77.8812 lr_lora=5.10e-05
36
+ ep 1 step 1100: loss=75.2743 lr_lora=4.69e-05
37
+ ep 1 step 1150: loss=73.0166 lr_lora=4.28e-05
38
+ ep 1 step 1200: loss=70.9040 lr_lora=3.88e-05
39
+ ep 1 step 1250: loss=68.9807 lr_lora=3.49e-05
40
+ ep 1 step 1300: loss=67.0975 lr_lora=3.10e-05
41
+ ep 1 step 1350: loss=66.9951 lr_lora=2.73e-05
42
+ ep 1 step 1400: loss=65.1894 lr_lora=2.38e-05
43
+ ep 1 step 1450: loss=63.6373 lr_lora=2.04e-05
44
+ ep 1 step 1500: loss=62.0934 lr_lora=1.72e-05
45
+ ep 1 step 1550: loss=60.6855 lr_lora=1.42e-05
46
+ ep 1 step 1600: loss=59.3990 lr_lora=1.15e-05
47
+ ep 1 step 1650: loss=58.3970 lr_lora=9.01e-06
48
+ ep 1 step 1700: loss=57.2419 lr_lora=6.81e-06
49
+ ep 1 step 1750: loss=56.1781 lr_lora=4.90e-06
50
+ ep 1 step 1800: loss=55.1134 lr_lora=3.28e-06
51
+ ep 1 step 1850: loss=54.1673 lr_lora=1.99e-06
52
+ ep 1 step 1900: loss=53.2661 lr_lora=1.01e-06
53
+ ep 1 step 1950: loss=52.4271 lr_lora=3.56e-07
54
+ ep 1 step 2000: loss=51.6784 lr_lora=3.53e-08
55
+ [lora] epoch 1 done in 19573s val={'temp': {'mae': 3.659961462020874, 'n': 8066}, 'ph': {'mae': 0.564297616481781, 'n': 705}, 'salt': {'mae': 1.7560514211654663, 'n': 424}, 'oxy': {'f1_macro': 0.9447762839519784, 'n': 2266}}
56
+ [lora] wrote artifacts/lora_weighted_anaerobe/fold0_results.json
57
+ [lambda-lora] done best={'epoch': 1, 'val': {'temp': {'mae': 3.659961462020874, 'n': 8066}, 'ph': {'mae': 0.564297616481781, 'n': 705}, 'salt': {'mae': 1.7560514211654663, 'n': 424}, 'oxy': {'f1_macro': 0.9447762839519784, 'n': 2266}}, 'score': 5.035534215716143}
artifacts/lora_fold0_real.log ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✓ Initialized. View run at
2
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-yeeh8rm2U9bwVwD1nagjsS
3
+ ✓ Created objects.
4
+ ├── 🔨 Created mount
5
+ │ /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
6
+ ├── 🔨 Created mount PythonPackage:microbe_model
7
+ ├── 🔨 Created mount data/marker_sequences.jsonl
8
+ ├── 🔨 Created mount data/bacdive_phenotypes.parquet
9
+ ├── 🔨 Created mount data/strain_catalog.parquet
10
+ └── 🔨 Created function train.
11
+ [modal-lora] fold=0 epochs=3 model=facebook/esm2_t12_35M_UR50D smoke=False
12
+ [lora] device = cuda
13
+ [lora] loaded 40,469 records with sequences + labels
14
+ [lora] fold 0: 32,375 train / 8,094 val
15
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
16
+
17
+ Loading weights: 0%| | 0/198 [00:00<?, ?it/s]
18
+ Loading weights: 100%|██████████| 198/198 [00:00<00:00, 3043.05it/s]
19
+ [transformers] EsmModel LOAD REPORT from: facebook/esm2_t12_35M_UR50D
20
+ Key | Status |
21
+ --------------------------+------------+-
22
+ lm_head.layer_norm.bias | UNEXPECTED |
23
+ lm_head.layer_norm.weight | UNEXPECTED |
24
+ lm_head.dense.bias | UNEXPECTED |
25
+ lm_head.dense.weight | UNEXPECTED |
26
+ lm_head.bias | UNEXPECTED |
27
+ pooler.dense.weight | MISSING |
28
+ pooler.dense.bias | MISSING |
29
+
30
+ Notes:
31
+ - UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
32
+ - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
33
+ [lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
34
+ [transformers] Caching is incompatible with gradient checkpointing in EsmLayer. Setting `use_cache=False`.
35
+ ep 1 step 50: loss=1000.2195 lr_lora=1.65e-05
36
+ ep 1 step 100: loss=788.6148 lr_lora=3.30e-05
37
+ ep 1 step 150: loss=560.6263 lr_lora=4.95e-05
38
+ ep 1 step 200: loss=432.6623 lr_lora=6.60e-05
39
+ ep 1 step 250: loss=354.6971 lr_lora=8.25e-05
40
+ ep 1 step 300: loss=300.4281 lr_lora=9.90e-05
41
+ ep 1 step 350: loss=262.0346 lr_lora=1.00e-04
42
+ ep 1 step 400: loss=232.7668 lr_lora=9.99e-05
43
+ ep 1 step 450: loss=210.0698 lr_lora=9.98e-05
44
+ ep 1 step 500: loss=192.0890 lr_lora=9.97e-05
45
+ WARNING: Logs may not be continuous
46
+ WARNING: Logs may not be continuous
47
+ WARNING: Logs may not be continuous
48
+ WARNING: Logs may not be continuous
49
+ WARNING: Logs may not be continuous
50
+ WARNING: Logs may not be continuous
51
+ WARNING: Logs may not be continuous
52
+ WARNING: Logs may not be continuous
53
+ WARNING: Logs may not be continuous
54
+ WARNING: Logs may not be continuous
55
+ WARNING: Logs may not be continuous
56
+ WARNING: Logs may not be continuous
57
+ WARNING: Logs may not be continuous
58
+ WARNING: Logs may not be continuous
59
+ WARNING: Logs may not be continuous
60
+ WARNING: Logs may not be continuous
61
+ WARNING: Logs may not be continuous
62
+ WARNING: Logs may not be continuous
63
+ WARNING: Logs may not be continuous
64
+ WARNING: Logs may not be continuous
65
+ WARNING: Logs may not be continuous
66
+ WARNING: Logs may not be continuous
67
+ WARNING: Logs may not be continuous
68
+ WARNING: Logs may not be continuous
69
+ WARNING: Logs may not be continuous
70
+ WARNING: Logs may not be continuous
71
+ WARNING: Logs may not be continuous
72
+ WARNING: Logs may not be continuous
73
+ WARNING: Logs may not be continuous
74
+ WARNING: Logs may not be continuous
75
+ WARNING: Logs may not be continuous
76
+ WARNING: Logs may not be continuous
77
+ WARNING: Logs may not be continuous
78
+ WARNING: Logs may not be continuous
79
+ WARNING: Logs may not be continuous
80
+ WARNING: Logs may not be continuous
81
+ WARNING: Logs may not be continuous
82
+ WARNING: Logs may not be continuous
83
+ WARNING: Logs may not be continuous
84
+ WARNING: Logs may not be continuous
85
+ WARNING: Logs may not be continuous
86
+ WARNING: Logs may not be continuous
87
+ WARNING: Logs may not be continuous
88
+ WARNING: Logs may not be continuous
89
+ WARNING: Logs may not be continuous
90
+ WARNING: Logs may not be continuous
91
+ WARNING: Logs may not be continuous
92
+ WARNING: Logs may not be continuous
93
+ WARNING: Logs may not be continuous
94
+ WARNING: Logs may not be continuous
95
+ WARNING: Logs may not be continuous
96
+ WARNING: Logs may not be continuous
97
+ WARNING: Logs may not be continuous
98
+ WARNING: Logs may not be continuous
99
+ WARNING: Logs may not be continuous
100
+ WARNING: Logs may not be continuous
101
+ WARNING: Logs may not be continuous
102
+ WARNING: Logs may not be continuous
103
+ WARNING: Logs may not be continuous
104
+ WARNING: Logs may not be continuous
105
+ WARNING: Logs may not be continuous
106
+ WARNING: Logs may not be continuous
107
+ WARNING: Logs may not be continuous
108
+ WARNING: Logs may not be continuous
109
+ WARNING: Logs may not be continuous
110
+ WARNING: Logs may not be continuous
111
+ WARNING: Logs may not be continuous
112
+ WARNING: Logs may not be continuous
113
+ WARNING: Logs may not be continuous
114
+ WARNING: Logs may not be continuous
115
+ WARNING: Logs may not be continuous
116
+ WARNING: Logs may not be continuous
117
+ WARNING: Logs may not be continuous
118
+ WARNING: Logs may not be continuous
119
+ WARNING: Logs may not be continuous
120
+ WARNING: Logs may not be continuous
121
+ WARNING: Logs may not be continuous
122
+ WARNING: Logs may not be continuous
123
+ WARNING: Logs may not be continuous
124
+ WARNING: Logs may not be continuous
125
+ WARNING: Logs may not be continuous
126
+ WARNING: Logs may not be continuous
127
+ WARNING: Logs may not be continuous
128
+ WARNING: Logs may not be continuous
129
+ WARNING: Logs may not be continuous
130
+ WARNING: Logs may not be continuous
131
+ WARNING: Logs may not be continuous
132
+ WARNING: Logs may not be continuous
133
+ WARNING: Logs may not be continuous
134
+ WARNING: Logs may not be continuous
135
+ WARNING: Logs may not be continuous
136
+ WARNING: Logs may not be continuous
137
+ WARNING: Logs may not be continuous
138
+ WARNING: Logs may not be continuous
139
+ WARNING: Logs may not be continuous
140
+ WARNING: Logs may not be continuous
141
+ WARNING: Logs may not be continuous
142
+ WARNING: Logs may not be continuous
143
+ WARNING: Logs may not be continuous
144
+ WARNING: Logs may not be continuous
145
+ WARNING: Logs may not be continuous
146
+ WARNING: Logs may not be continuous
147
+ WARNING: Logs may not be continuous
artifacts/lora_oxygen_vs_all_task.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Oxygen-Only LoRA vs All-Task LoRA
2
+
3
+ Both runs use fold 0, ESM-2 t12, LoRA r=8, 1 epoch on Lambda A100 SXM4.
4
+
5
+ | Target | All-task LoRA | Oxygen-only LoRA | Baseline | Readout |
6
+ |---|---:|---:|---:|---|
7
+ | `optimal_temperature_c` MAE | 3.6660 | 32.2504 | 2.6743 | ignored by oxygen-only loss |
8
+ | `optimal_ph` MAE | 0.5598 | 7.0462 | 0.4685 | ignored by oxygen-only loss |
9
+ | `salt_tolerance_pct` MAE | 1.8154 | 2.0823 | 1.9171 | not improved by oxygen-only loss |
10
+ | `oxygen_requirement` macro F1 | 0.9448 | 0.9168 | 0.4020 | oxygen-only worse than all-task |
11
+
12
+ ## Conclusion
13
+
14
+ Oxygen-only training did not beat the all-task LoRA run on the oxygen target: `0.9168` macro F1 vs `0.9448` macro F1. The all-task LoRA checkpoint remains the better fold-0 result. The oxygen-only checkpoint is saved locally for inspection but should stay out of git.
15
+
16
+ ## Artifacts
17
+
18
+ - `artifacts/lora/fold0_results_oxygen.json`
19
+ - `artifacts/lora/lambda_fold0_oxygen_1ep_20260517T103524Z.log`
20
+ - local-only checkpoint: `artifacts/lora/fold0_best_oxygen.pt`
artifacts/lora_smoke.log ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✓ Initialized. View run at
2
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-9cMuz0t3SvGbABiswTUf1i
3
+ Building image im-0WStzS4903kICkIzhj6o3g
4
+
5
+ => Step 0: FROM base
6
+
7
+ => Step 1: COPY . /
8
+ Saving image...
9
+ Image saved, took 6.04s
10
+
11
+ Built image im-0WStzS4903kICkIzhj6o3g in 64.55s
12
+
13
+
14
+ Building image im-NdLGz3gVSHqOqDW9uEkE6S
15
+
16
+ => Step 0: FROM base
17
+
18
+ => Step 1: COPY . /
19
+ Saving image...
20
+ Image saved, took 1.27s
21
+
22
+ Built image im-NdLGz3gVSHqOqDW9uEkE6S in 2.77s
23
+
24
+
25
+ Building image im-7eAtiwQQ8BXwj3bXS3azON
26
+
27
+ => Step 0: FROM base
28
+
29
+ => Step 1: COPY . /
30
+ Saving image...
31
+ Image saved, took 1.21s
32
+
33
+ Built image im-7eAtiwQQ8BXwj3bXS3azON in 2.92s
34
+
35
+
36
+ ✓ Created objects.
37
+ ├── 🔨 Created mount
38
+ │ /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
39
+ ├── 🔨 Created mount PythonPackage:microbe_model
40
+ ├── 🔨 Created mount data/marker_sequences.jsonl
41
+ ├── 🔨 Created mount data/bacdive_phenotypes.parquet
42
+ ├── 🔨 Created mount data/strain_catalog.parquet
43
+ └── 🔨 Created function train.
44
+ [modal-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D smoke=True
45
+ [smoke] truncated to 200 sequences
46
+ [lora] device = cuda
47
+ [lora] loaded 200 records with sequences + labels
48
+ [lora] fold 0: 28 train / 172 val
49
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
50
+
51
+ Loading weights: 0%| | 0/198 [00:00<?, ?it/s]
52
+ Loading weights: 100%|██████████| 198/198 [00:00<00:00, 5756.73it/s]
53
+ [transformers] EsmModel LOAD REPORT from: facebook/esm2_t12_35M_UR50D
54
+ Key | Status |
55
+ --------------------------+------------+-
56
+ lm_head.dense.weight | UNEXPECTED |
57
+ lm_head.layer_norm.bias | UNEXPECTED |
58
+ lm_head.bias | UNEXPECTED |
59
+ lm_head.layer_norm.weight | UNEXPECTED |
60
+ lm_head.dense.bias | UNEXPECTED |
61
+ pooler.dense.weight | MISSING |
62
+ pooler.dense.bias | MISSING |
63
+
64
+ Notes:
65
+ - UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
66
+ - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
67
+ [lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
68
+ Traceback (most recent call last):
69
+ File "/pkg/modal/_runtime/container_io_manager.py", line 947, in handle_input_exception
70
+ yield
71
+ File "/pkg/modal/_container_entrypoint.py", line 172, in run_input_sync
72
+ values = io_context.call_function_sync()
73
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
74
+ File "/pkg/modal/_runtime/container_io_manager.py", line 225, in call_function_sync
75
+ expected_value_or_values = self.finalized_function.callable(*args, **kwargs)
76
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
77
+ File "/root/modal_train_lora.py", line 88, in train
78
+ results = train_lora(
79
+ ^^^^^^^^^^^
80
+ File "/root/microbe_model/train/lora_trainer.py", line 255, in train_lora
81
+ preds = model(batch["genomes"], device=device)
82
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
83
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
84
+ return self._call_impl(*args, **kwargs)
85
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
87
+ return forward_call(*args, **kwargs)
88
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
89
+ File "/root/microbe_model/train/lora_model.py", line 126, in forward
90
+ [self.encode_genome(g, device) for g in genomes],
91
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
92
+ File "/root/microbe_model/train/lora_model.py", line 126, in <listcomp>
93
+ [self.encode_genome(g, device) for g in genomes],
94
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
95
+ File "/root/microbe_model/train/lora_model.py", line 111, in encode_genome
96
+ per_protein = self.encode_proteins(proteins, device) # (n, D)
97
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
98
+ File "/root/microbe_model/train/lora_model.py", line 99, in encode_proteins
99
+ outputs = self.esm(**enc)
100
+ ^^^^^^^^^^^^^^^
101
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
102
+ return self._call_impl(*args, **kwargs)
103
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
104
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
105
+ return forward_call(*args, **kwargs)
106
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
107
+ File "/usr/local/lib/python3.11/site-packages/peft/peft_model.py", line 945, in forward
108
+ return self.get_base_model()(*args, **kwargs)
109
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
110
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
111
+ return self._call_impl(*args, **kwargs)
112
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
114
+ return forward_call(*args, **kwargs)
115
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116
+ File "/usr/local/lib/python3.11/site-packages/transformers/utils/generic.py", line 976, in wrapper
117
+ output = func(self, *args, **kwargs)
118
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
119
+ File "/usr/local/lib/python3.11/site-packages/transformers/utils/output_capturing.py", line 248, in wrapper
120
+ outputs = func(self, *args, **kwargs)
121
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
122
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 754, in forward
123
+ encoder_outputs = self.encoder(
124
+ ^^^^^^^^^^^^^
125
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
126
+ return self._call_impl(*args, **kwargs)
127
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
128
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
129
+ return forward_call(*args, **kwargs)
130
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131
+ File "/usr/local/lib/python3.11/site-packages/transformers/utils/generic.py", line 900, in wrapper
132
+ output = func(self, *args, **kwargs)
133
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
134
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 556, in forward
135
+ hidden_states = layer_module(
136
+ ^^^^^^^^^^^^^
137
+ File "/usr/local/lib/python3.11/site-packages/transformers/modeling_layers.py", line 93, in __call__
138
+ return super().__call__(*args, **kwargs)
139
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
140
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
141
+ return self._call_impl(*args, **kwargs)
142
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
144
+ return forward_call(*args, **kwargs)
145
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
146
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 527, in forward
147
+ layer_output = self.feed_forward_chunk(attention_output)
148
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
149
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 532, in feed_forward_chunk
150
+ intermediate_output = self.intermediate(attention_output_ln)
151
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
153
+ return self._call_impl(*args, **kwargs)
154
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
155
+ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
156
+ return forward_call(*args, **kwargs)
157
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
158
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 462, in forward
159
+ hidden_states = gelu(hidden_states)
160
+ ^^^^^^^^^^^^^^^^^^^
161
+ File "/usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm.py", line 86, in gelu
162
+ return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
163
+ ~~^~~~~
164
+ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
165
+ Stopping app - uncaught exception raised locally: OutOfMemoryError('CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)').
166
+ ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
167
+ │ /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py:124 in main │
168
+ │ │
169
+ │ 123 │ │
170
+ │ ❱ 124 │ payload = train.remote( │
171
+ │ 125 │ │ fold=fold, │
172
+ │ │
173
+ │ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
174
+ │ object.py:46 in wrapped │
175
+ │ │
176
+ │ 45 │ │ await self.hydrate() │
177
+ │ ❱ 46 │ │ return await method(self, *args, **kwargs) │
178
+ │ 47 │
179
+ │ │
180
+ │ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
181
+ │ functions.py:1699 in remote │
182
+ │ │
183
+ │ 1698 │ │ │
184
+ │ ❱ 1699 │ │ return await self._call_function(args, kwargs) │
185
+ │ 1700 │
186
+ │ │
187
+ │ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
188
+ │ functions.py:1643 in _call_function │
189
+ │ │
190
+ │ 1642 │ │ │
191
+ │ ❱ 1643 │ │ return await invocation.run_function() │
192
+ │ 1644 │
193
+ │ │
194
+ │ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
195
+ │ functions.py:291 in run_function │
196
+ │ │
197
+ │ 290 │ │ │ item = await self._get_single_output() │
198
+ │ ❱ 291 │ │ │ return await _process_result(item.result, item.data_forma │
199
+ │ 292 │
200
+ │ │
201
+ │ /Users/miyuhoriuchi/microbe-model/.venv/lib/python3.11/site-packages/modal/_ │
202
+ │ utils/function_utils.py:527 in _process_result │
203
+ │ │
204
+ │ 526 │ │ │ │
205
+ │ ❱ 527 │ │ │ raise exc_with_hints(exc) │
206
+ │ 528 │
207
+ │ │
208
+ │ ...Remote call to Modal Function (ta-01KRNVDM93QPX3HT2JT4Q0WHCY)... │
209
+ │ │
210
+ │ /root/modal_train_lora.py:88 in train │
211
+ │ │
212
+ │ ❱ 88 results = train_lora( │
213
+ │ │
214
+ │ │
215
+ │ /root/microbe_model/train/lora_trainer.py:255 in train_lora │
216
+ │ │
217
+ │ ❱ 255 preds = model(batch["genomes"], device=device) │
218
+ │ │
219
+ │ │
220
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
221
+ │ _wrapped_call_impl │
222
+ │ │
223
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
224
+ │ │
225
+ │ │
226
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
227
+ │ _call_impl │
228
+ │ │
229
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
230
+ │ │
231
+ │ │
232
+ │ /root/microbe_model/train/lora_model.py:126 in forward │
233
+ │ │
234
+ │ ❱ 126 [self.encode_genome(g, device) for g in genomes], │
235
+ │ │
236
+ │ │
237
+ │ /root/microbe_model/train/lora_model.py:126 in <listcomp> │
238
+ │ │
239
+ │ ❱ 126 [self.encode_genome(g, device) for g in genomes], │
240
+ │ │
241
+ │ │
242
+ │ /root/microbe_model/train/lora_model.py:111 in encode_genome │
243
+ │ │
244
+ │ ❱ 111 per_protein = self.encode_proteins(proteins, device) # (n, D) │
245
+ │ │
246
+ │ │
247
+ │ /root/microbe_model/train/lora_model.py:99 in encode_proteins │
248
+ │ │
249
+ │ ❱ 99 outputs = self.esm(**enc) │
250
+ │ │
251
+ │ │
252
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
253
+ │ _wrapped_call_impl │
254
+ │ │
255
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
256
+ │ │
257
+ │ │
258
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
259
+ │ _call_impl │
260
+ │ │
261
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
262
+ │ │
263
+ │ │
264
+ │ /usr/local/lib/python3.11/site-packages/peft/peft_model.py:945 in forward │
265
+ │ │
266
+ │ ❱ 945 return self.get_base_model()(*args, **kwargs) │
267
+ │ │
268
+ │ │
269
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
270
+ │ _wrapped_call_impl │
271
+ │ │
272
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
273
+ │ │
274
+ │ │
275
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
276
+ │ _call_impl │
277
+ │ │
278
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
279
+ │ │
280
+ │ │
281
+ │ /usr/local/lib/python3.11/site-packages/transformers/utils/generic.py:976 in │
282
+ │ wrapper │
283
+ │ │
284
+ │ ❱ 976 output = func(self, *args, **kwargs) │
285
+ │ │
286
+ │ │
287
+ │ /usr/local/lib/python3.11/site-packages/transformers/utils/output_capturing. │
288
+ │ py:248 in wrapper │
289
+ │ │
290
+ │ ❱ 248 outputs = func(self, *args, **kwargs) │
291
+ │ │
292
+ │ │
293
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
294
+ │ .py:754 in forward │
295
+ │ │
296
+ │ ❱ 754 encoder_outputs = self.encoder( │
297
+ │ │
298
+ │ │
299
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
300
+ │ _wrapped_call_impl │
301
+ │ │
302
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
303
+ │ │
304
+ │ │
305
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
306
+ │ _call_impl │
307
+ │ │
308
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
309
+ │ │
310
+ │ │
311
+ │ /usr/local/lib/python3.11/site-packages/transformers/utils/generic.py:900 in │
312
+ │ wrapper │
313
+ │ │
314
+ │ ❱ 900 output = func(self, *args, **kwargs) │
315
+ │ │
316
+ │ │
317
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
318
+ │ .py:556 in forward │
319
+ │ │
320
+ │ ❱ 556 hidden_states = layer_module( │
321
+ │ │
322
+ │ │
323
+ │ /usr/local/lib/python3.11/site-packages/transformers/modeling_layers.py:93 │
324
+ │ in __call__ │
325
+ │ │
326
+ │ ❱ 93 return super().__call__(*args, **kwargs) │
327
+ │ │
328
+ │ │
329
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
330
+ │ _wrapped_call_impl │
331
+ │ │
332
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
333
+ │ │
334
+ │ │
335
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
336
+ │ _call_impl │
337
+ │ │
338
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
339
+ │ │
340
+ │ │
341
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
342
+ │ .py:527 in forward │
343
+ │ │
344
+ │ ❱ 527 layer_output = self.feed_forward_chunk(attention_output) │
345
+ │ │
346
+ │ │
347
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
348
+ │ .py:532 in feed_forward_chunk │
349
+ │ │
350
+ │ ❱ 532 intermediate_output = self.intermediate(attention_output_ln) │
351
+ │ │
352
+ │ │
353
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1778 in │
354
+ │ _wrapped_call_impl │
355
+ │ │
356
+ │ ❱ 1778 return self._call_impl(*args, **kwargs) │
357
+ │ │
358
+ │ │
359
+ │ /usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py:1789 in │
360
+ │ _call_impl │
361
+ │ │
362
+ │ ❱ 1789 return forward_call(*args, **kwargs) │
363
+ │ │
364
+ │ │
365
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
366
+ │ .py:462 in forward │
367
+ │ │
368
+ │ ❱ 462 hidden_states = gelu(hidden_states) │
369
+ │ │
370
+ │ │
371
+ │ /usr/local/lib/python3.11/site-packages/transformers/models/esm/modeling_esm │
372
+ │ .py:86 in gelu │
373
+ │ │
374
+ │ ❱ 86 return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0))) │
375
+ │ │
376
+ ╰──────────────────────────────────────────────────────────────────────────────╯
377
+ OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a
378
+ total capacity of 22.06 GiB of which 3.44 MiB is free. Process 1 has 22.05 GiB
379
+ memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and
380
+ 593.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated
381
+ memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to
382
+ avoid fragmentation. See documentation for Memory Management
383
+ (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-wi
384
+ th-pytorch-cuda-alloc-conf)
artifacts/lora_smoke2.log ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✓ Initialized. View run at
2
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-14sZzX1W4YsqgIp2y29INw
3
+ ✓ Created objects.
4
+ ├── 🔨 Created mount
5
+ │ /Users/miyuhoriuchi/microbe-model/scripts/modal_train_lora.py
6
+ ├── 🔨 Created mount PythonPackage:microbe_model
7
+ ├── 🔨 Created mount data/marker_sequences.jsonl
8
+ ├── 🔨 Created mount data/bacdive_phenotypes.parquet
9
+ ├── 🔨 Created mount data/strain_catalog.parquet
10
+ └── 🔨 Created function train.
11
+ [modal-lora] fold=0 epochs=1 model=facebook/esm2_t12_35M_UR50D smoke=True
12
+ [smoke] truncated to 200 sequences
13
+ [lora] device = cuda
14
+ [lora] loaded 200 records with sequences + labels
15
+ [lora] fold 0: 28 train / 172 val
16
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
17
+
18
+ Loading weights: 0%| | 0/198 [00:00<?, ?it/s]
19
+ Loading weights: 100%|██████████| 198/198 [00:00<00:00, 5899.50it/s]
20
+ [transformers] EsmModel LOAD REPORT from: facebook/esm2_t12_35M_UR50D
21
+ Key | Status |
22
+ --------------------------+------------+-
23
+ lm_head.bias | UNEXPECTED |
24
+ lm_head.dense.weight | UNEXPECTED |
25
+ lm_head.layer_norm.weight | UNEXPECTED |
26
+ lm_head.dense.bias | UNEXPECTED |
27
+ lm_head.layer_norm.bias | UNEXPECTED |
28
+ pooler.dense.weight | MISSING |
29
+ pooler.dense.bias | MISSING |
30
+
31
+ Notes:
32
+ - UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
33
+ - MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
34
+ [lora] trainable params: 2,151,815 / total: 35,652,216 (6.04%)
35
+ [transformers] Caching is incompatible with gradient checkpointing in EsmLayer. Setting `use_cache=False`.
36
+ [lora] epoch 1 done in 61s val={'temp': {'mae': 28.067121505737305, 'n': 172}, 'ph': {'mae': None, 'n': 0}, 'salt': {'mae': None, 'n': 0}, 'oxy': {'f1_macro': None, 'n': 0}}
37
+ [lora] wrote /artifacts/lora/fold0_results.json
38
+ [modal-lora] wrote artifacts/lora/fold0_results_smoke.json
39
+ [modal-lora] wrote artifacts/lora/fold0_best_smoke.pt (8.6 MB)
40
+ Stopping app - local entrypoint completed.
41
+ ✓ App completed. View run at
42
+ https://modal.com/apps/miyu-wannaone-horiuchi/main/ap-14sZzX1W4YsqgIp2y29INw
artifacts/lora_vs_baseline.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Fold 0 vs Tabular Baseline
2
+
3
+ Caveat: LoRA is one group fold; baseline is the current five-fold mean.
4
+
5
+ | Target | LoRA | Baseline | Delta | Verdict |
6
+ |---|---:|---:|---:|---|
7
+ | `optimal_temperature_c` MAE | 3.6660 | 2.6743 | +0.9917 | worse |
8
+ | `optimal_ph` MAE | 0.5598 | 0.4685 | +0.0913 | worse |
9
+ | `salt_tolerance_pct` MAE | 1.8154 | 1.9171 | -0.1017 | better |
10
+ | `oxygen_requirement` macro F1 | 0.9448 | 0.4020 | +0.5428 | better |
11
+
12
+ ## Recommendation
13
+
14
+ The first LoRA pass is strongest for oxygen classification. For the next GPU run, use `scripts/lambda_train_lora.py --target-preset oxygen` instead of spending more A100 time optimizing regression losses that underperformed the tabular baseline.
15
+
16
+ Keep `artifacts/lora/fold0_best.pt` outside git unless it is published to a model store or release asset; the JSON metrics and log are enough for repo history.
artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.json ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint": "artifacts/lora_weighted_anaerobe/fold0_best.pt",
3
+ "n": 2266,
4
+ "classes": [
5
+ "aerobe",
6
+ "anaerobe",
7
+ "facultative_anaerobe",
8
+ "microaerobe"
9
+ ],
10
+ "accuracy": 0.953663,
11
+ "macro_f1": 0.944776,
12
+ "macro_f1_all_classes": 0.472388,
13
+ "confusion_matrix": [
14
+ [
15
+ 1535,
16
+ 38,
17
+ 0,
18
+ 0
19
+ ],
20
+ [
21
+ 67,
22
+ 626,
23
+ 0,
24
+ 0
25
+ ],
26
+ [
27
+ 0,
28
+ 0,
29
+ 0,
30
+ 0
31
+ ],
32
+ [
33
+ 0,
34
+ 0,
35
+ 0,
36
+ 0
37
+ ]
38
+ ],
39
+ "per_class": {
40
+ "aerobe": {
41
+ "precision": 0.958177,
42
+ "recall": 0.975842,
43
+ "f1": 0.966929,
44
+ "support": 1573,
45
+ "predicted": 1602
46
+ },
47
+ "anaerobe": {
48
+ "precision": 0.942771,
49
+ "recall": 0.903319,
50
+ "f1": 0.922623,
51
+ "support": 693,
52
+ "predicted": 664
53
+ },
54
+ "facultative_anaerobe": {
55
+ "precision": 0.0,
56
+ "recall": 0.0,
57
+ "f1": 0.0,
58
+ "support": 0,
59
+ "predicted": 0
60
+ },
61
+ "microaerobe": {
62
+ "precision": 0.0,
63
+ "recall": 0.0,
64
+ "f1": 0.0,
65
+ "support": 0,
66
+ "predicted": 0
67
+ }
68
+ },
69
+ "wrong_predictions": [
70
+ {
71
+ "bacdive_id": 499,
72
+ "genome_accession": "GCA_000429505",
73
+ "group": "Alteromonadaceae",
74
+ "true": "anaerobe",
75
+ "pred": "aerobe",
76
+ "confidence": 0.997351,
77
+ "true_probability": 0.002648,
78
+ "margin": 0.994702
79
+ },
80
+ {
81
+ "bacdive_id": 481,
82
+ "genome_accession": "GCA_003363485",
83
+ "group": "Alteromonadaceae",
84
+ "true": "anaerobe",
85
+ "pred": "aerobe",
86
+ "confidence": 0.996132,
87
+ "true_probability": 0.003865,
88
+ "margin": 0.992268
89
+ },
90
+ {
91
+ "bacdive_id": 168525,
92
+ "genome_accession": "GCA_006386545",
93
+ "group": "Sphaerotilaceae",
94
+ "true": "anaerobe",
95
+ "pred": "aerobe",
96
+ "confidence": 0.995269,
97
+ "true_probability": 0.004727,
98
+ "margin": 0.990543
99
+ },
100
+ {
101
+ "bacdive_id": 6314,
102
+ "genome_accession": "GCA_900475835",
103
+ "group": "Jonesiaceae",
104
+ "true": "anaerobe",
105
+ "pred": "aerobe",
106
+ "confidence": 0.995095,
107
+ "true_probability": 0.004901,
108
+ "margin": 0.990193
109
+ },
110
+ {
111
+ "bacdive_id": 498,
112
+ "genome_accession": "GCA_000429485",
113
+ "group": "Alteromonadaceae",
114
+ "true": "anaerobe",
115
+ "pred": "aerobe",
116
+ "confidence": 0.994422,
117
+ "true_probability": 0.005572,
118
+ "margin": 0.988851
119
+ },
120
+ {
121
+ "bacdive_id": 17841,
122
+ "genome_accession": "GCA_000975055",
123
+ "group": "Demequinaceae",
124
+ "true": "anaerobe",
125
+ "pred": "aerobe",
126
+ "confidence": 0.993803,
127
+ "true_probability": 0.006191,
128
+ "margin": 0.987612
129
+ },
130
+ {
131
+ "bacdive_id": 483,
132
+ "genome_accession": "GCA_000421165",
133
+ "group": "Alteromonadaceae",
134
+ "true": "anaerobe",
135
+ "pred": "aerobe",
136
+ "confidence": 0.990431,
137
+ "true_probability": 0.009555,
138
+ "margin": 0.980876
139
+ },
140
+ {
141
+ "bacdive_id": 23140,
142
+ "genome_accession": "GCA_016925555",
143
+ "group": "Mycoplasmataceae",
144
+ "true": "aerobe",
145
+ "pred": "anaerobe",
146
+ "confidence": 0.981044,
147
+ "true_probability": 0.018747,
148
+ "margin": 0.962297
149
+ },
150
+ {
151
+ "bacdive_id": 17840,
152
+ "genome_accession": "GCA_000975035",
153
+ "group": "Demequinaceae",
154
+ "true": "anaerobe",
155
+ "pred": "aerobe",
156
+ "confidence": 0.979909,
157
+ "true_probability": 0.020063,
158
+ "margin": 0.959846
159
+ },
160
+ {
161
+ "bacdive_id": 148058,
162
+ "genome_accession": "GCF_000005845.2",
163
+ "group": "Enterobacteriaceae",
164
+ "true": "anaerobe",
165
+ "pred": "aerobe",
166
+ "confidence": 0.977617,
167
+ "true_probability": 0.022345,
168
+ "margin": 0.955271
169
+ },
170
+ {
171
+ "bacdive_id": 149706,
172
+ "genome_accession": "GCF_000005845.2",
173
+ "group": "Enterobacteriaceae",
174
+ "true": "anaerobe",
175
+ "pred": "aerobe",
176
+ "confidence": 0.977617,
177
+ "true_probability": 0.022345,
178
+ "margin": 0.955271
179
+ },
180
+ {
181
+ "bacdive_id": 156346,
182
+ "genome_accession": "GCF_000005845.2",
183
+ "group": "Enterobacteriaceae",
184
+ "true": "anaerobe",
185
+ "pred": "aerobe",
186
+ "confidence": 0.977617,
187
+ "true_probability": 0.022345,
188
+ "margin": 0.955271
189
+ },
190
+ {
191
+ "bacdive_id": 140022,
192
+ "genome_accession": "GCF_014068355.1",
193
+ "group": "Mycoplasmataceae",
194
+ "true": "aerobe",
195
+ "pred": "anaerobe",
196
+ "confidence": 0.976127,
197
+ "true_probability": 0.02361,
198
+ "margin": 0.952517
199
+ },
200
+ {
201
+ "bacdive_id": 164735,
202
+ "genome_accession": "GCF_943590815.1",
203
+ "group": "Enterobacteriaceae",
204
+ "true": "anaerobe",
205
+ "pred": "aerobe",
206
+ "confidence": 0.969284,
207
+ "true_probability": 0.030661,
208
+ "margin": 0.938623
209
+ },
210
+ {
211
+ "bacdive_id": 133974,
212
+ "genome_accession": "GCA_001544515",
213
+ "group": "Burkholderiaceae",
214
+ "true": "anaerobe",
215
+ "pred": "aerobe",
216
+ "confidence": 0.965083,
217
+ "true_probability": 0.034798,
218
+ "margin": 0.930284
219
+ },
220
+ {
221
+ "bacdive_id": 133484,
222
+ "genome_accession": "GCA_900102145",
223
+ "group": "Thermaceae",
224
+ "true": "aerobe",
225
+ "pred": "anaerobe",
226
+ "confidence": 0.955376,
227
+ "true_probability": 0.044292,
228
+ "margin": 0.911083
229
+ },
230
+ {
231
+ "bacdive_id": 154004,
232
+ "genome_accession": "GCF_982443925.1",
233
+ "group": "Enterobacteriaceae",
234
+ "true": "anaerobe",
235
+ "pred": "aerobe",
236
+ "confidence": 0.953645,
237
+ "true_probability": 0.046276,
238
+ "margin": 0.907368
239
+ },
240
+ {
241
+ "bacdive_id": 168303,
242
+ "genome_accession": "GCA_004307015",
243
+ "group": "Thermaceae",
244
+ "true": "aerobe",
245
+ "pred": "anaerobe",
246
+ "confidence": 0.949521,
247
+ "true_probability": 0.050177,
248
+ "margin": 0.899345
249
+ },
250
+ {
251
+ "bacdive_id": 133976,
252
+ "genome_accession": "GCA_001544495",
253
+ "group": "Burkholderiaceae",
254
+ "true": "anaerobe",
255
+ "pred": "aerobe",
256
+ "confidence": 0.94727,
257
+ "true_probability": 0.052558,
258
+ "margin": 0.894712
259
+ },
260
+ {
261
+ "bacdive_id": 8608,
262
+ "genome_accession": "GCF_900476065.1",
263
+ "group": "Metamycoplasmataceae",
264
+ "true": "aerobe",
265
+ "pred": "anaerobe",
266
+ "confidence": 0.940865,
267
+ "true_probability": 0.058377,
268
+ "margin": 0.882487
269
+ },
270
+ {
271
+ "bacdive_id": 132346,
272
+ "genome_accession": "GCA_002259755",
273
+ "group": "Bifidobacteriaceae",
274
+ "true": "aerobe",
275
+ "pred": "anaerobe",
276
+ "confidence": 0.937469,
277
+ "true_probability": 0.062389,
278
+ "margin": 0.87508
279
+ },
280
+ {
281
+ "bacdive_id": 133975,
282
+ "genome_accession": "GCA_001544475",
283
+ "group": "Burkholderiaceae",
284
+ "true": "anaerobe",
285
+ "pred": "aerobe",
286
+ "confidence": 0.933766,
287
+ "true_probability": 0.06607,
288
+ "margin": 0.867696
289
+ },
290
+ {
291
+ "bacdive_id": 132158,
292
+ "genome_accession": "GCA_000807275",
293
+ "group": "Orbaceae",
294
+ "true": "anaerobe",
295
+ "pred": "aerobe",
296
+ "confidence": 0.933592,
297
+ "true_probability": 0.06628,
298
+ "margin": 0.867312
299
+ },
300
+ {
301
+ "bacdive_id": 140694,
302
+ "genome_accession": "GCA_003336745",
303
+ "group": "Thermaceae",
304
+ "true": "aerobe",
305
+ "pred": "anaerobe",
306
+ "confidence": 0.9297,
307
+ "true_probability": 0.069824,
308
+ "margin": 0.859876
309
+ },
310
+ {
311
+ "bacdive_id": 133991,
312
+ "genome_accession": "GCA_039544205",
313
+ "group": "Ferrimonadaceae",
314
+ "true": "anaerobe",
315
+ "pred": "aerobe",
316
+ "confidence": 0.925231,
317
+ "true_probability": 0.074638,
318
+ "margin": 0.850593
319
+ }
320
+ ]
321
+ }
artifacts/lora_weighted_anaerobe/fold0_oxygen_diagnostics.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Oxygen Diagnostics
2
+
3
+ Checkpoint: `artifacts/lora_weighted_anaerobe/fold0_best.pt`
4
+
5
+ - Labeled validation rows: `2266`
6
+ - Accuracy: `0.9537`
7
+ - Macro F1 (supported classes): `0.9448`
8
+ - Macro F1 (all configured classes): `0.4724`
9
+
10
+ ## Per-Class Metrics
11
+
12
+ | Class | Precision | Recall | F1 | Support | Predicted |
13
+ |---|---:|---:|---:|---:|---:|
14
+ | aerobe | 0.9582 | 0.9758 | 0.9669 | 1573 | 1602 |
15
+ | anaerobe | 0.9428 | 0.9033 | 0.9226 | 693 | 664 |
16
+ | facultative_anaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
17
+ | microaerobe | 0.0000 | 0.0000 | 0.0000 | 0 | 0 |
18
+
19
+ ## Confusion Matrix
20
+
21
+ | True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
22
+ |---|---:|---:|---:|---:|
23
+ | aerobe | 1535 | 38 | 0 | 0 |
24
+ | anaerobe | 67 | 626 | 0 | 0 |
25
+ | facultative_anaerobe | 0 | 0 | 0 | 0 |
26
+ | microaerobe | 0 | 0 | 0 | 0 |
27
+
28
+ ## High-Confidence Wrong Predictions
29
+
30
+ | BacDive ID | Genome | Group | True | Pred | Confidence | True Prob. | Margin |
31
+ |---:|---|---|---|---|---:|---:|---:|
32
+ | 499 | GCA_000429505 | Alteromonadaceae | anaerobe | aerobe | 0.9974 | 0.0026 | 0.9947 |
33
+ | 481 | GCA_003363485 | Alteromonadaceae | anaerobe | aerobe | 0.9961 | 0.0039 | 0.9923 |
34
+ | 168525 | GCA_006386545 | Sphaerotilaceae | anaerobe | aerobe | 0.9953 | 0.0047 | 0.9905 |
35
+ | 6314 | GCA_900475835 | Jonesiaceae | anaerobe | aerobe | 0.9951 | 0.0049 | 0.9902 |
36
+ | 498 | GCA_000429485 | Alteromonadaceae | anaerobe | aerobe | 0.9944 | 0.0056 | 0.9889 |
37
+ | 17841 | GCA_000975055 | Demequinaceae | anaerobe | aerobe | 0.9938 | 0.0062 | 0.9876 |
38
+ | 483 | GCA_000421165 | Alteromonadaceae | anaerobe | aerobe | 0.9904 | 0.0096 | 0.9809 |
39
+ | 23140 | GCA_016925555 | Mycoplasmataceae | aerobe | anaerobe | 0.9810 | 0.0187 | 0.9623 |
40
+ | 17840 | GCA_000975035 | Demequinaceae | anaerobe | aerobe | 0.9799 | 0.0201 | 0.9598 |
41
+ | 148058 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
42
+ | 149706 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
43
+ | 156346 | GCF_000005845.2 | Enterobacteriaceae | anaerobe | aerobe | 0.9776 | 0.0223 | 0.9553 |
44
+ | 140022 | GCF_014068355.1 | Mycoplasmataceae | aerobe | anaerobe | 0.9761 | 0.0236 | 0.9525 |
45
+ | 164735 | GCF_943590815.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9693 | 0.0307 | 0.9386 |
46
+ | 133974 | GCA_001544515 | Burkholderiaceae | anaerobe | aerobe | 0.9651 | 0.0348 | 0.9303 |
47
+ | 133484 | GCA_900102145 | Thermaceae | aerobe | anaerobe | 0.9554 | 0.0443 | 0.9111 |
48
+ | 154004 | GCF_982443925.1 | Enterobacteriaceae | anaerobe | aerobe | 0.9536 | 0.0463 | 0.9074 |
49
+ | 168303 | GCA_004307015 | Thermaceae | aerobe | anaerobe | 0.9495 | 0.0502 | 0.8993 |
50
+ | 133976 | GCA_001544495 | Burkholderiaceae | anaerobe | aerobe | 0.9473 | 0.0526 | 0.8947 |
51
+ | 8608 | GCF_900476065.1 | Metamycoplasmataceae | aerobe | anaerobe | 0.9409 | 0.0584 | 0.8825 |
52
+ | 132346 | GCA_002259755 | Bifidobacteriaceae | aerobe | anaerobe | 0.9375 | 0.0624 | 0.8751 |
53
+ | 133975 | GCA_001544475 | Burkholderiaceae | anaerobe | aerobe | 0.9338 | 0.0661 | 0.8677 |
54
+ | 132158 | GCA_000807275 | Orbaceae | anaerobe | aerobe | 0.9336 | 0.0663 | 0.8673 |
55
+ | 140694 | GCA_003336745 | Thermaceae | aerobe | anaerobe | 0.9297 | 0.0698 | 0.8599 |
56
+ | 133991 | GCA_039544205 | Ferrimonadaceae | anaerobe | aerobe | 0.9252 | 0.0746 | 0.8506 |
artifacts/lora_weighted_anaerobe/fold0_results.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_cfg": {
3
+ "esm_model_name": "facebook/esm2_t12_35M_UR50D",
4
+ "lora_r": 8,
5
+ "lora_alpha": 16,
6
+ "lora_dropout": 0.05,
7
+ "lora_target": [
8
+ "query",
9
+ "value"
10
+ ],
11
+ "head_hidden_dim": 128,
12
+ "head_dropout": 0.1,
13
+ "max_seq_len": 512,
14
+ "max_proteins_per_cat": 6,
15
+ "gradient_checkpointing": true
16
+ },
17
+ "train_cfg": {
18
+ "fold": 0,
19
+ "epochs": 1,
20
+ "batch_size": 2,
21
+ "grad_accum": 8,
22
+ "lora_lr": 0.0001,
23
+ "head_lr": 0.001,
24
+ "weight_decay": 0.01,
25
+ "warmup_frac": 0.05,
26
+ "bf16": true,
27
+ "max_proteins_per_category": 16,
28
+ "save_dir": "artifacts/lora_weighted_anaerobe",
29
+ "grad_clip": 1.0,
30
+ "temp_weight": 1.0,
31
+ "ph_weight": 1.0,
32
+ "salt_weight": 1.0,
33
+ "oxy_weight": 1.0,
34
+ "oxy_class_weights": [
35
+ 1.0,
36
+ 1.5,
37
+ 1.0,
38
+ 1.0
39
+ ]
40
+ },
41
+ "history": [
42
+ {
43
+ "epoch": 1,
44
+ "train_loss": 51.366838841350884,
45
+ "val": {
46
+ "temp": {
47
+ "mae": 3.659961462020874,
48
+ "n": 8066
49
+ },
50
+ "ph": {
51
+ "mae": 0.564297616481781,
52
+ "n": 705
53
+ },
54
+ "salt": {
55
+ "mae": 1.7560514211654663,
56
+ "n": 424
57
+ },
58
+ "oxy": {
59
+ "f1_macro": 0.9447762839519784,
60
+ "n": 2266
61
+ }
62
+ },
63
+ "elapsed_s": 19573.33940076828
64
+ }
65
+ ],
66
+ "best": {
67
+ "epoch": 1,
68
+ "val": {
69
+ "temp": {
70
+ "mae": 3.659961462020874,
71
+ "n": 8066
72
+ },
73
+ "ph": {
74
+ "mae": 0.564297616481781,
75
+ "n": 705
76
+ },
77
+ "salt": {
78
+ "mae": 1.7560514211654663,
79
+ "n": 424
80
+ },
81
+ "oxy": {
82
+ "f1_macro": 0.9447762839519784,
83
+ "n": 2266
84
+ }
85
+ },
86
+ "score": 5.035534215716143
87
+ }
88
+ }
artifacts/lora_weighted_anaerobe/fold0_weighted_anaerobe_comparison.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fold 0 Anaerobe-Weighted LoRA Comparison
2
+
3
+ This run tested whether increasing the oxygen cross-entropy weight for
4
+ `anaerobe` improves the fold 0 oxygen result.
5
+
6
+ Experiment:
7
+
8
+ - Base setup: all-task LoRA, 1 epoch, fold 0
9
+ - Oxygen class weights: `aerobe=1.0`, `anaerobe=1.5`,
10
+ `facultative_anaerobe=1.0`, `microaerobe=1.0`
11
+ - Output directory: `artifacts/lora_weighted_anaerobe`
12
+
13
+ ## Training Validation Metric
14
+
15
+ | Checkpoint | Oxygen macro F1 | Oxygen n | Notes |
16
+ |---|---:|---:|---|
17
+ | `artifacts/lora/fold0_best.pt` | 0.944823 | 2266 | Original all-task LoRA |
18
+ | `artifacts/lora_weighted_anaerobe/fold0_best.pt` | 0.944776 | 2266 | Anaerobe-weighted all-task LoRA |
19
+
20
+ ## Detailed Diagnostic
21
+
22
+ The weighted checkpoint diagnostic reports:
23
+
24
+ | Class | Precision | Recall | F1 | Support |
25
+ |---|---:|---:|---:|---:|
26
+ | aerobe | 0.958177 | 0.975842 | 0.966929 | 1573 |
27
+ | anaerobe | 0.942771 | 0.903319 | 0.922623 | 693 |
28
+
29
+ Confusion matrix:
30
+
31
+ | True \ Pred | aerobe | anaerobe | facultative_anaerobe | microaerobe |
32
+ |---|---:|---:|---:|---:|
33
+ | aerobe | 1535 | 38 | 0 | 0 |
34
+ | anaerobe | 67 | 626 | 0 | 0 |
35
+ | facultative_anaerobe | 0 | 0 | 0 | 0 |
36
+ | microaerobe | 0 | 0 | 0 | 0 |
37
+
38
+ ## Conclusion
39
+
40
+ The `anaerobe=1.5` class-weight experiment does not beat the original all-task
41
+ LoRA on fold 0. It slightly improves anaerobe recall in the detailed diagnostic,
42
+ but the overall oxygen macro F1 is fractionally lower than the original all-task
43
+ checkpoint.
44
+
45
+ Keep `artifacts/lora/fold0_best.pt` as the best fold 0 checkpoint for now.
artifacts/materialize_pme.log ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ read 5,000 lines, 3,186 unique genomes
2
+ read 10,000 lines, 6,012 unique genomes
3
+ read 15,000 lines, 8,193 unique genomes
4
+ read 20,000 lines, 9,381 unique genomes
5
+ read 25,000 lines, 11,269 unique genomes
6
+ read 30,000 lines, 13,720 unique genomes
7
+ read 35,000 lines, 15,993 unique genomes
8
+ read 40,000 lines, 18,247 unique genomes
9
+ read 45,000 lines, 19,703 unique genomes
10
+ read 50,000 lines, 21,570 unique genomes
11
+ read 55,000 lines, 22,300 unique genomes
12
+ Parsed 55,046 lines → 22,300 unique genomes (83.2s)
13
+ Wrote 22,300 rows × 5131 cols → /Users/miyuhoriuchi/microbe-model/data/per_marker_embeddings.parquet (615.9 MB, 170.5s)
artifacts/media_recommender_drylab_benchmark.json ADDED
@@ -0,0 +1,343 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "split_mode": "family",
3
+ "n_splits": 5,
4
+ "n_features": 1113,
5
+ "n_media": 40,
6
+ "n_estimators": 100,
7
+ "elapsed_s": 906.581169128418,
8
+ "model": {
9
+ "n_eval": 21050.0,
10
+ "mean_true_media": 1.3330166270783848,
11
+ "mrr": 0.5880700888181283,
12
+ "hit_at_1": 0.4498812351543943,
13
+ "recall_at_1": 0.3553784639746635,
14
+ "precision_at_1": 0.4498812351543943,
15
+ "hit_at_3": 0.6598574821852732,
16
+ "recall_at_3": 0.5818313539192399,
17
+ "precision_at_3": 0.25814726840855107,
18
+ "hit_at_5": 0.7752969121140143,
19
+ "recall_at_5": 0.7217062549485351,
20
+ "precision_at_5": 0.19352969121140143,
21
+ "hit_at_10": 0.8652256532066508,
22
+ "recall_at_10": 0.8449089469517023,
23
+ "precision_at_10": 0.11457957244655581,
24
+ "median_roc_auc": 0.9099485141368615,
25
+ "median_pr_auc": 0.18301671580977655
26
+ },
27
+ "global_popularity": {
28
+ "n_eval": 21050.0,
29
+ "mean_true_media": 1.3330166270783848,
30
+ "mrr": 0.24343189838094925,
31
+ "hit_at_1": 0.08019002375296912,
32
+ "recall_at_1": 0.05362549485352335,
33
+ "precision_at_1": 0.08019002375296912,
34
+ "hit_at_3": 0.25002375296912116,
35
+ "recall_at_3": 0.19952810768012663,
36
+ "precision_at_3": 0.08673000791765638,
37
+ "hit_at_5": 0.3660807600950119,
38
+ "recall_at_5": 0.3077403008709422,
39
+ "precision_at_5": 0.0798479809976247,
40
+ "hit_at_10": 0.7987648456057007,
41
+ "recall_at_10": 0.7146753760886778,
42
+ "precision_at_10": 0.09205225653206651
43
+ },
44
+ "taxonomy_popularity": {
45
+ "n_eval": 21050.0,
46
+ "mean_true_media": 1.3330166270783848,
47
+ "mrr": 0.2501139457470768,
48
+ "hit_at_1": 0.08646080760095012,
49
+ "recall_at_1": 0.05863737133808392,
50
+ "precision_at_1": 0.08646080760095012,
51
+ "hit_at_3": 0.25914489311163896,
52
+ "recall_at_3": 0.20742201108471894,
53
+ "precision_at_3": 0.09002375296912114,
54
+ "hit_at_5": 0.37244655581947744,
55
+ "recall_at_5": 0.31233254156769596,
56
+ "precision_at_5": 0.08098812351543942,
57
+ "hit_at_10": 0.8009976247030879,
58
+ "recall_at_10": 0.7152692003167063,
59
+ "precision_at_10": 0.09202375296912114
60
+ },
61
+ "per_medium_auc": [
62
+ {
63
+ "medium_id": "1",
64
+ "n_pos": 2102,
65
+ "n_neg": 23626,
66
+ "pr_auc": 0.27484617227464014,
67
+ "roc_auc": 0.8452226670886136
68
+ },
69
+ {
70
+ "medium_id": "104",
71
+ "n_pos": 472,
72
+ "n_neg": 25256,
73
+ "pr_auc": 0.18128419818337677,
74
+ "roc_auc": 0.9484343458577389
75
+ },
76
+ {
77
+ "medium_id": "104b",
78
+ "n_pos": 105,
79
+ "n_neg": 25623,
80
+ "pr_auc": 0.13838455803689761,
81
+ "roc_auc": 0.9674838268445575
82
+ },
83
+ {
84
+ "medium_id": "104c",
85
+ "n_pos": 104,
86
+ "n_neg": 25624,
87
+ "pr_auc": 0.03838926719297028,
88
+ "roc_auc": 0.9064871574725618
89
+ },
90
+ {
91
+ "medium_id": "1076b",
92
+ "n_pos": 116,
93
+ "n_neg": 25612,
94
+ "pr_auc": 0.7772565185763604,
95
+ "roc_auc": 0.9216801997447318
96
+ },
97
+ {
98
+ "medium_id": "11",
99
+ "n_pos": 495,
100
+ "n_neg": 25233,
101
+ "pr_auc": 0.42285303544912106,
102
+ "roc_auc": 0.9656590075446334
103
+ },
104
+ {
105
+ "medium_id": "110",
106
+ "n_pos": 257,
107
+ "n_neg": 25471,
108
+ "pr_auc": 0.11044863098458187,
109
+ "roc_auc": 0.953733680800031
110
+ },
111
+ {
112
+ "medium_id": "1203a",
113
+ "n_pos": 186,
114
+ "n_neg": 25542,
115
+ "pr_auc": 0.21768704520819743,
116
+ "roc_auc": 0.9528128033691925
117
+ },
118
+ {
119
+ "medium_id": "1611",
120
+ "n_pos": 102,
121
+ "n_neg": 25626,
122
+ "pr_auc": 0.09711080696747246,
123
+ "roc_auc": 0.9282099751630924
124
+ },
125
+ {
126
+ "medium_id": "1a",
127
+ "n_pos": 179,
128
+ "n_neg": 25549,
129
+ "pr_auc": 0.050322245145825104,
130
+ "roc_auc": 0.9074889723351185
131
+ },
132
+ {
133
+ "medium_id": "215",
134
+ "n_pos": 358,
135
+ "n_neg": 25370,
136
+ "pr_auc": 0.02965177531773624,
137
+ "roc_auc": 0.6736935808140085
138
+ },
139
+ {
140
+ "medium_id": "220",
141
+ "n_pos": 1274,
142
+ "n_neg": 24454,
143
+ "pr_auc": 0.18474923343617633,
144
+ "roc_auc": 0.7785027031177238
145
+ },
146
+ {
147
+ "medium_id": "252",
148
+ "n_pos": 592,
149
+ "n_neg": 25136,
150
+ "pr_auc": 0.1250091495334598,
151
+ "roc_auc": 0.9134151768433774
152
+ },
153
+ {
154
+ "medium_id": "31",
155
+ "n_pos": 109,
156
+ "n_neg": 25619,
157
+ "pr_auc": 0.014316010090645013,
158
+ "roc_auc": 0.1197419776248348
159
+ },
160
+ {
161
+ "medium_id": "339",
162
+ "n_pos": 162,
163
+ "n_neg": 25566,
164
+ "pr_auc": 0.2531128917725162,
165
+ "roc_auc": 0.9461261242989581
166
+ },
167
+ {
168
+ "medium_id": "354",
169
+ "n_pos": 117,
170
+ "n_neg": 25611,
171
+ "pr_auc": 0.004547574626865672,
172
+ "roc_auc": 0.09816094646831441
173
+ },
174
+ {
175
+ "medium_id": "381",
176
+ "n_pos": 521,
177
+ "n_neg": 25207,
178
+ "pr_auc": 0.1945320503586121,
179
+ "roc_auc": 0.8782904803505287
180
+ },
181
+ {
182
+ "medium_id": "428",
183
+ "n_pos": 103,
184
+ "n_neg": 25625,
185
+ "pr_auc": 0.09496790309585758,
186
+ "roc_auc": 0.8493817665166943
187
+ },
188
+ {
189
+ "medium_id": "457",
190
+ "n_pos": 150,
191
+ "n_neg": 25578,
192
+ "pr_auc": 0.0412603532866421,
193
+ "roc_auc": 0.8629237365444262
194
+ },
195
+ {
196
+ "medium_id": "514",
197
+ "n_pos": 1890,
198
+ "n_neg": 23838,
199
+ "pr_auc": 0.6372180504429193,
200
+ "roc_auc": 0.9468378153062271
201
+ },
202
+ {
203
+ "medium_id": "53",
204
+ "n_pos": 149,
205
+ "n_neg": 25579,
206
+ "pr_auc": 0.2046388525464773,
207
+ "roc_auc": 0.9388116195358452
208
+ },
209
+ {
210
+ "medium_id": "535",
211
+ "n_pos": 1573,
212
+ "n_neg": 24155,
213
+ "pr_auc": 0.22497502735528332,
214
+ "roc_auc": 0.7663368847332266
215
+ },
216
+ {
217
+ "medium_id": "545",
218
+ "n_pos": 219,
219
+ "n_neg": 25509,
220
+ "pr_auc": 0.02413857437574695,
221
+ "roc_auc": 0.7897148307043927
222
+ },
223
+ {
224
+ "medium_id": "553",
225
+ "n_pos": 374,
226
+ "n_neg": 25354,
227
+ "pr_auc": 0.15104363590267195,
228
+ "roc_auc": 0.9124080559386046
229
+ },
230
+ {
231
+ "medium_id": "554",
232
+ "n_pos": 365,
233
+ "n_neg": 25363,
234
+ "pr_auc": 0.11893469404694698,
235
+ "roc_auc": 0.9315081455620554
236
+ },
237
+ {
238
+ "medium_id": "58",
239
+ "n_pos": 216,
240
+ "n_neg": 25512,
241
+ "pr_auc": 0.05833314429889305,
242
+ "roc_auc": 0.927297466406513
243
+ },
244
+ {
245
+ "medium_id": "585",
246
+ "n_pos": 112,
247
+ "n_neg": 25616,
248
+ "pr_auc": 0.278482148554742,
249
+ "roc_auc": 0.8558551226354063
250
+ },
251
+ {
252
+ "medium_id": "645",
253
+ "n_pos": 276,
254
+ "n_neg": 25452,
255
+ "pr_auc": 0.036581621082776454,
256
+ "roc_auc": 0.7953502842520276
257
+ },
258
+ {
259
+ "medium_id": "65",
260
+ "n_pos": 3434,
261
+ "n_neg": 22294,
262
+ "pr_auc": 0.9006928683634222,
263
+ "roc_auc": 0.9755952498821933
264
+ },
265
+ {
266
+ "medium_id": "693",
267
+ "n_pos": 2503,
268
+ "n_neg": 23225,
269
+ "pr_auc": 0.4188285561288795,
270
+ "roc_auc": 0.8783924651021573
271
+ },
272
+ {
273
+ "medium_id": "78",
274
+ "n_pos": 257,
275
+ "n_neg": 25471,
276
+ "pr_auc": 0.2721170882429022,
277
+ "roc_auc": 0.9676798073707691
278
+ },
279
+ {
280
+ "medium_id": "81",
281
+ "n_pos": 115,
282
+ "n_neg": 25613,
283
+ "pr_auc": 0.27618279435467347,
284
+ "roc_auc": 0.9262594232887851
285
+ },
286
+ {
287
+ "medium_id": "83",
288
+ "n_pos": 150,
289
+ "n_neg": 25578,
290
+ "pr_auc": 0.020746118950763597,
291
+ "roc_auc": 0.8187215575885527
292
+ },
293
+ {
294
+ "medium_id": "830",
295
+ "n_pos": 1782,
296
+ "n_neg": 23946,
297
+ "pr_auc": 0.4005468147385281,
298
+ "roc_auc": 0.8818929642762434
299
+ },
300
+ {
301
+ "medium_id": "84",
302
+ "n_pos": 1365,
303
+ "n_neg": 24363,
304
+ "pr_auc": 0.5583327533020991,
305
+ "roc_auc": 0.9681294174090628
306
+ },
307
+ {
308
+ "medium_id": "9",
309
+ "n_pos": 2465,
310
+ "n_neg": 23263,
311
+ "pr_auc": 0.8004815953628244,
312
+ "roc_auc": 0.9217055368025154
313
+ },
314
+ {
315
+ "medium_id": "92",
316
+ "n_pos": 2899,
317
+ "n_neg": 22829,
318
+ "pr_auc": 0.4467899949837726,
319
+ "roc_auc": 0.8689824346226291
320
+ },
321
+ {
322
+ "medium_id": "98",
323
+ "n_pos": 151,
324
+ "n_neg": 25577,
325
+ "pr_auc": 0.3434343543833669,
326
+ "roc_auc": 0.9693860403865538
327
+ },
328
+ {
329
+ "medium_id": "987",
330
+ "n_pos": 155,
331
+ "n_neg": 25573,
332
+ "pr_auc": 0.03815021128397991,
333
+ "roc_auc": 0.8637134931877497
334
+ },
335
+ {
336
+ "medium_id": "pdf",
337
+ "n_pos": 106,
338
+ "n_neg": 25622,
339
+ "pr_auc": 0.011293218617366114,
340
+ "roc_auc": 0.6703395003998627
341
+ }
342
+ ]
343
+ }
artifacts/media_recommender_drylab_benchmark.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Media Recommender Dry-Lab Benchmark
2
+
3
+ This benchmark hides known BacDive/MediaDive strain-medium links and asks
4
+ whether the genome-only recommender recovers at least one known medium in
5
+ the top-k ranked recommendations.
6
+
7
+ ## Setup
8
+
9
+ - Split mode: `family`
10
+ - Folds: 5
11
+ - Evaluation strains: 21050
12
+ - Media labels: 40
13
+ - Feature columns: 1113
14
+ - XGBoost trees per medium per fold: 100
15
+
16
+ ## Ranking Metrics
17
+
18
+ | Method | MRR | Hit@1 | Hit@3 | Hit@5 | Recall@5 | Precision@5 |
19
+ |---|---:|---:|---:|---:|---:|---:|
20
+ | XGBoost medium recommender | 0.588 | 0.450 | 0.660 | 0.775 | 0.722 | 0.194 |
21
+ | Global popularity baseline | 0.243 | 0.080 | 0.250 | 0.366 | 0.308 | 0.080 |
22
+ | Taxonomic popularity baseline | 0.250 | 0.086 | 0.259 | 0.372 | 0.312 | 0.081 |
23
+
24
+ ## Per-Medium AUC
25
+
26
+ - Valid media with both classes: 40
27
+ - Median ROC-AUC: 0.910
28
+ - Median PR-AUC: 0.183
29
+
30
+ Top media by PR-AUC:
31
+
32
+ | Medium | Positives | PR-AUC | ROC-AUC |
33
+ |---|---:|---:|---:|
34
+ | 65 GYM STREPTOMYCES MEDIUM | 3434 | 0.901 | 0.976 |
35
+ | 9 VY/2 AGAR | 2465 | 0.800 | 0.922 |
36
+ | 1076b SP4-Z MEDIUM | 116 | 0.777 | 0.922 |
37
+ | 514 BACTO MARINE BROTH (DIFCO 2216) | 1890 | 0.637 | 0.947 |
38
+ | 84 ROLLED OATS MINERAL MEDIUM | 1365 | 0.558 | 0.968 |
39
+ | 92 TRYPTICASE SOY YEAST EXTRACT MEDIUM | 2899 | 0.447 | 0.869 |
40
+ | 11 MRS MEDIUM | 495 | 0.423 | 0.966 |
41
+ | 693 COLUMBIA BLOOD MEDIUM | 2503 | 0.419 | 0.878 |
42
+ | 830 R2A MEDIUM | 1782 | 0.401 | 0.882 |
43
+ | 98 RHIZOBIUM MEDIUM | 151 | 0.343 | 0.969 |
44
+
45
+ ## Interpretation
46
+
47
+ Hit@k is the easiest practical readout: it measures whether at least one
48
+ known growth medium appears in the top-k suggestions. PR-AUC is expected
49
+ to be much lower than ROC-AUC because medium labels are sparse and heavily
50
+ imbalanced; a high ROC-AUC with modest PR-AUC means the model is useful for
51
+ ranking candidates, not for guaranteeing growth.
artifacts/retrain_with_pme.log ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Encoded 65 isolation-category features (104558 non-zero entries)
2
+ Joined MediaDive features (5 cols) — 25,728/46,029 training rows have MediaDive data
3
+ Joined HMM features (144 cols) — 46,029/46,029 training rows have HMM data
4
+ Joined KEGG module completeness (570 cols) — 43,457/46,029 training rows have KEGG data
5
+ Joined per-marker ESM-2 embeddings (5129 cols) — 46,026/46,029 training rows have PME data
6
+ Joined isolation metadata (46 cols)
7
+ Training table: 46,029 strains × 6312 features
8
+ Distinct groups: 634
9
+ Group sizes (top 10): {'Enterobacteriaceae': 2667, 'Streptomycetaceae': 2212, 'Bacillaceae': 1887, 'Lactobacillaceae': 1749, 'Pseudomonadaceae': 1623, 'Myxococcaceae': 1546, 'Streptococcaceae': 1174, 'Staphylococcaceae': 1071, 'Flavobacteriaceae': 984, 'Corynebacteriaceae': 903}
10
+
11
+ Wrote training table to /Users/miyuhoriuchi/microbe-model/data/training_table.parquet
12
+ Wrote per-strain predictions to /Users/miyuhoriuchi/microbe-model/artifacts/predictions.parquet
13
+
14
+ Results summary (16647.1s):
15
+
16
+ optimal_temperature_c mae = 2.6743 (n_folds=5)
17
+ optimal_ph mae = 0.4685 (n_folds=5)
18
+ oxygen_requirement f1_macro = 0.4020 (n_folds=5)
19
+ salt_tolerance_pct mae = 1.9171 (n_folds=5)
cerebrium/embed/cerebrium.toml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [cerebrium.deployment]
2
+ name = "embed"
3
+ python_version = "3.11"
4
+ docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
5
+ disable_auth = false
6
+ include = ['./*', 'main.py', 'cerebrium.toml', 'markers.hmm']
7
+ exclude = ['.*', '__pycache__', '*.pyc']
8
+
9
+ [cerebrium.hardware]
10
+ cpu = 4.0
11
+ memory = 16.0
12
+ compute = "ADA_L4"
13
+ gpu_count = 1
14
+ region = "us-east-1"
15
+
16
+ [cerebrium.scaling]
17
+ min_replicas = 0
18
+ max_replicas = 3
19
+ cooldown = 120
20
+ replica_concurrency = 1
21
+ scaling_metric = "concurrency_utilization"
22
+ scaling_target = 80
23
+ response_grace_period = 600
24
+
25
+ [cerebrium.dependencies.pip]
26
+ torch = ">=2.2"
27
+ transformers = ">=4.40"
28
+ accelerate = ">=0.30"
29
+ pyhmmer = ">=0.12"
30
+ pyrodigal = ">=3.5"
31
+ requests = ">=2.32"
32
+ numpy = ">=1.26"
cerebrium/embed/main.py ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Per-marker ESM-2 t30 embedding service — runs on a Cerebrium L4 GPU container.
2
+
3
+ embed_genome(bacdive_id, accession) → {"ok": bool, "row": {pme_<cat>_<dim>: float, ...}}
4
+ or {"ok": False, "reason": ...}.
5
+
6
+ The unified-marker HMM library is baked into the image. Each replica loads
7
+ ESM-2 + HMMs once at startup, then serves multiple genomes from the warm
8
+ container.
9
+ """
10
+ from __future__ import annotations
11
+
12
+ import io
13
+ import os
14
+ import time
15
+ import zipfile
16
+ from typing import Any
17
+
18
+ import numpy as np
19
+ import pyhmmer
20
+ import pyhmmer.easel
21
+ import pyhmmer.plan7
22
+ import pyrodigal
23
+ import requests
24
+ import torch
25
+ from transformers import AutoModel, AutoTokenizer
26
+
27
+ DATASETS_URL = "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/{acc}/download"
28
+ VERSION_FALLBACKS = (".1", ".2", ".3", ".4")
29
+ EMPTY_ZIP_BYTES = 2_000
30
+ EVALUE_THRESHOLD = 1e-5
31
+
32
+ MARKER_TO_CATEGORY: dict[str, str] = {
33
+ "Hsp70_DnaK": "temperature", "Hsp90": "temperature", "Cpn60_GroEL": "temperature",
34
+ "Hsp20": "temperature", "CSD_cold_shock": "temperature", "TGS_thermosome": "temperature",
35
+ "ATP_synth_alphabeta": "ph", "ATP_synth_alphabeta_C": "ph", "ATP_synth_F0_B": "ph",
36
+ "NhaA_Na_H_exch": "ph", "NhaB_Na_H_exch": "ph", "Pyridoxal_decarbox": "ph",
37
+ "MotA_TolQ_ExbB": "ph", "V_ATPase_subH_N": "ph",
38
+ "COX1_aerobic": "oxygen", "COX2_TM_aerobic": "oxygen", "COX2_periplasm_aero": "oxygen",
39
+ "Cyt_CBB3_microaero": "oxygen", "Rieske_2Fe2S": "oxygen", "Catalase": "oxygen",
40
+ "SOD_FeMn": "oxygen", "SOD_CuZn": "oxygen", "FeFe_hyd_anaerobic": "oxygen",
41
+ "NiFe_hyd_anaerobic": "oxygen", "FAD_binding_FrdA": "oxygen", "Fer4_FeS_4Fe4S": "oxygen",
42
+ "KdpD_osmosensor": "salt", "TrkH_K_channel": "salt", "BCCT_compatible": "salt",
43
+ "BPD_transp_1": "salt", "EctC_ectoine_synth": "salt", "Bact_rhodopsin": "salt",
44
+ "TP_methylase_B12": "vitamin", "Peripla_BP_2": "vitamin", "THF_DHG_CYH_folate": "vitamin",
45
+ "FolB_folate": "vitamin", "PdxJ_pyridoxine": "vitamin", "DHBP_riboflavin": "vitamin",
46
+ "NifH_nitrogenase": "nitrogen", "NifDK_nitrogenase": "nitrogen",
47
+ "NIR_SIR_ferredoxin": "nitrogen",
48
+ "RuBisCO_large_form1": "carbon", "RuBisCO_small_form1": "carbon",
49
+ "Alpha_amylase": "carbon", "Cellulase_GH5": "carbon", "CBM_cellulose": "carbon",
50
+ "Molybdopterin_OR": "special", "UvrD_helicase_C": "special",
51
+ }
52
+ CATEGORIES = ["temperature", "ph", "oxygen", "salt", "vitamin", "nitrogen", "carbon", "special"]
53
+
54
+ _model_name = os.environ.get("ESM2_MODEL", "facebook/esm2_t30_150M_UR50D")
55
+ _batch_size = int(os.environ.get("ESM2_BATCH_SIZE", "16"))
56
+ _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
57
+ _dtype = torch.float16 if _device.type == "cuda" else torch.float32
58
+
59
+ print(f"[boot] loading {_model_name} on {_device} ({_dtype})", flush=True)
60
+ _tokenizer = AutoTokenizer.from_pretrained(_model_name)
61
+ _model = AutoModel.from_pretrained(_model_name, dtype=_dtype)
62
+ _model.to(_device)
63
+ _model.train(False)
64
+ _embed_dim = _model.config.hidden_size
65
+
66
+ _alphabet = pyhmmer.easel.Alphabet.amino()
67
+ with pyhmmer.plan7.HMMFile("/cortex/app/markers.hmm") as _fh:
68
+ _hmms = list(_fh)
69
+ _ncbi_key = os.environ.get("NCBI_API_KEY")
70
+ print(f"[boot] loaded {len(_hmms)} marker HMMs, embed_dim={_embed_dim}, "
71
+ f"ncbi_key={'yes' if _ncbi_key else 'no'}", flush=True)
72
+
73
+
74
+ def _has_version(acc: str) -> bool:
75
+ return "." in acc and acc.rsplit(".", 1)[-1].isdigit()
76
+
77
+
78
+ def _candidates(acc: str) -> list[str]:
79
+ return [acc] if _has_version(acc) else [acc + v for v in VERSION_FALLBACKS]
80
+
81
+
82
+ def _fetch_fasta(acc: str) -> list[tuple[str, str]] | None:
83
+ rate = 0.1 if _ncbi_key else 0.34
84
+ headers = {"Accept": "application/zip"}
85
+ if _ncbi_key:
86
+ headers["api-key"] = _ncbi_key
87
+ params = {"include_annotation_type": "GENOME_FASTA"}
88
+ for cand in _candidates(acc):
89
+ zip_bytes: bytes | None = None
90
+ for attempt in range(3):
91
+ try:
92
+ time.sleep(rate)
93
+ resp = requests.get(
94
+ DATASETS_URL.format(acc=cand), params=params,
95
+ headers=headers, timeout=120,
96
+ )
97
+ if resp.status_code == 404:
98
+ break
99
+ if resp.status_code in (429, 502, 503):
100
+ time.sleep(2 ** attempt)
101
+ continue
102
+ resp.raise_for_status()
103
+ except requests.RequestException:
104
+ if attempt == 2:
105
+ break
106
+ time.sleep(2 ** attempt)
107
+ continue
108
+ if len(resp.content) < EMPTY_ZIP_BYTES:
109
+ break
110
+ zip_bytes = resp.content
111
+ break
112
+ if zip_bytes is None:
113
+ continue
114
+ try:
115
+ with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
116
+ fna = [n for n in zf.namelist() if n.endswith(".fna")]
117
+ if not fna:
118
+ continue
119
+ with zf.open(fna[0]) as src:
120
+ raw = src.read()
121
+ except zipfile.BadZipFile:
122
+ continue
123
+ return _parse_fasta(raw)
124
+ return None
125
+
126
+
127
+ def _parse_fasta(raw: bytes) -> list[tuple[str, str]]:
128
+ contigs: list[tuple[str, str]] = []
129
+ cur: str | None = None
130
+ chunks: list[str] = []
131
+ for line in raw.splitlines():
132
+ if not line:
133
+ continue
134
+ if line.startswith(b">"):
135
+ if cur is not None:
136
+ contigs.append((cur, "".join(chunks).upper()))
137
+ cur = line[1:].decode("ascii", errors="replace").split()[0]
138
+ chunks = []
139
+ else:
140
+ chunks.append(line.decode("ascii", errors="replace"))
141
+ if cur is not None:
142
+ contigs.append((cur, "".join(chunks).upper()))
143
+ return contigs
144
+
145
+
146
+ def _predict_proteins(contigs: list[tuple[str, str]]) -> list[str]:
147
+ encoded = [(n, s.encode("ascii")) for n, s in contigs]
148
+ total_nt = sum(len(s) for _, s in encoded)
149
+ if total_nt >= 20_000:
150
+ finder = pyrodigal.GeneFinder(meta=False)
151
+ try:
152
+ finder.train(b"TTAATTAATTAA".join(s for _, s in encoded))
153
+ except Exception:
154
+ finder = pyrodigal.GeneFinder(meta=True)
155
+ else:
156
+ finder = pyrodigal.GeneFinder(meta=True)
157
+ proteins: list[str] = []
158
+ for _, s in encoded:
159
+ for gene in finder.find_genes(s):
160
+ proteins.append(gene.translate().rstrip("*"))
161
+ return proteins
162
+
163
+
164
+ def _embed_proteins(proteins: list[str]) -> np.ndarray:
165
+ if not proteins:
166
+ return np.zeros((0, _embed_dim), dtype=np.float32)
167
+ out: list = []
168
+ for i in range(0, len(proteins), _batch_size):
169
+ batch = proteins[i : i + _batch_size]
170
+ enc = _tokenizer(batch, return_tensors="pt", padding=True,
171
+ truncation=True, max_length=1024)
172
+ enc = {k: v.to(_device) for k, v in enc.items()}
173
+ with torch.inference_mode():
174
+ outs = _model(**enc)
175
+ last_hidden = outs.last_hidden_state
176
+ mask = enc["attention_mask"].unsqueeze(-1).to(last_hidden.dtype)
177
+ pooled = (last_hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
178
+ out.append(pooled.float().cpu().numpy())
179
+ return np.concatenate(out, axis=0)
180
+
181
+
182
+ def _scan_markers(proteins: list[str]) -> dict[str, list[int]]:
183
+ seqs = []
184
+ for i, prot in enumerate(proteins):
185
+ if not prot:
186
+ continue
187
+ ts = pyhmmer.easel.TextSequence(name=f"p{i}".encode(), sequence=prot)
188
+ seqs.append(ts.digitize(_alphabet))
189
+ result: dict[str, list[int]] = {name: [] for name in MARKER_TO_CATEGORY}
190
+ if not seqs:
191
+ return result
192
+ for top_hits in pyhmmer.hmmer.hmmsearch(_hmms, seqs, E=EVALUE_THRESHOLD):
193
+ raw = top_hits.query.name
194
+ marker = raw.decode() if isinstance(raw, bytes) else raw
195
+ if marker not in result:
196
+ continue
197
+ for hit in top_hits:
198
+ if hit.evalue > EVALUE_THRESHOLD:
199
+ continue
200
+ name = hit.name.decode() if isinstance(hit.name, bytes) else hit.name
201
+ if name.startswith("p"):
202
+ try:
203
+ result[marker].append(int(name[1:]))
204
+ except ValueError:
205
+ pass
206
+ return result
207
+
208
+
209
+ def embed_genome(bacdive_id: int, accession: str) -> dict[str, Any]:
210
+ try:
211
+ contigs = _fetch_fasta(accession)
212
+ if not contigs:
213
+ return {"ok": False, "reason": "fetch_empty", "bacdive_id": bacdive_id, "accession": accession}
214
+ proteins = _predict_proteins(contigs)
215
+ if not proteins:
216
+ return {"ok": False, "reason": "no_proteins", "bacdive_id": bacdive_id, "accession": accession}
217
+ marker_idx = _scan_markers(proteins)
218
+ hit_indices = sorted({i for ids in marker_idx.values() for i in ids})
219
+ row: dict[str, Any] = {
220
+ "bacdive_id": int(bacdive_id),
221
+ "genome_accession": accession,
222
+ "pme_marker_proteins_total": len(hit_indices),
223
+ }
224
+ if not hit_indices:
225
+ for cat in CATEGORIES:
226
+ row[f"pme_{cat}_n"] = 0
227
+ for d in range(_embed_dim):
228
+ row[f"pme_{cat}_{d}"] = 0.0
229
+ return {"ok": True, "row": row}
230
+ hit_proteins = [proteins[i] for i in hit_indices]
231
+ hit_matrix = _embed_proteins(hit_proteins)
232
+ gi_to_ri = {gi: ri for ri, gi in enumerate(hit_indices)}
233
+ for cat in CATEGORIES:
234
+ idxs: set[int] = set()
235
+ for marker, gis in marker_idx.items():
236
+ if MARKER_TO_CATEGORY.get(marker) == cat:
237
+ idxs.update(gis)
238
+ row[f"pme_{cat}_n"] = len(idxs)
239
+ if idxs:
240
+ rows = [gi_to_ri[gi] for gi in idxs if gi in gi_to_ri]
241
+ if rows:
242
+ cat_mean = hit_matrix[rows].mean(axis=0).astype(np.float32)
243
+ for d, v in enumerate(cat_mean):
244
+ row[f"pme_{cat}_{d}"] = float(v)
245
+ continue
246
+ for d in range(_embed_dim):
247
+ row[f"pme_{cat}_{d}"] = 0.0
248
+ return {"ok": True, "row": row}
249
+ except Exception as exc:
250
+ return {"ok": False, "reason": f"{type(exc).__name__}: {exc}",
251
+ "bacdive_id": bacdive_id, "accession": accession}
cerebrium/kofam/cerebrium.toml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [cerebrium.deployment]
2
+ name = "kofam"
3
+ python_version = "3.11"
4
+ docker_base_image_url = "debian:bookworm-slim"
5
+ disable_auth = false
6
+ include = ['./*', 'main.py', 'cerebrium.toml', 'kofam_relevant.hmm', 'ko_thresholds.tsv']
7
+ exclude = ['.*', '__pycache__', '*.pyc']
8
+
9
+ [cerebrium.hardware]
10
+ cpu = 8.0
11
+ memory = 8.0
12
+ compute = "CPU"
13
+ region = "us-east-1"
14
+
15
+ [cerebrium.scaling]
16
+ min_replicas = 2
17
+ max_replicas = 10
18
+ cooldown = 300
19
+ replica_concurrency = 1
20
+ scaling_metric = "concurrency_utilization"
21
+ scaling_target = 80
22
+ response_grace_period = 600
23
+
24
+ [cerebrium.dependencies.pip]
25
+ pyhmmer = ">=0.12"
26
+ pyrodigal = ">=3.5"
27
+ requests = ">=2.32"
cerebrium/kofam/main.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """KOfam scan service — runs on a Cerebrium CPU container.
2
+
3
+ scan_genome(accession) → {"ok": bool, "ko_hits": [...]} or {"ok": False, "reason": ...}.
4
+
5
+ The relevant-KO HMM library (~734 MB) and per-KO bitscore thresholds are baked
6
+ into the image via `include` in cerebrium.toml, so each replica loads them
7
+ once at startup.
8
+ """
9
+ import io
10
+ import os
11
+ import time
12
+ import zipfile
13
+
14
+ import pyhmmer
15
+ import pyhmmer.easel
16
+ import pyhmmer.plan7
17
+ import pyrodigal
18
+ import requests
19
+
20
+ DATASETS_URL = "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/{acc}/download"
21
+ VERSION_FALLBACKS = (".1", ".2", ".3", ".4")
22
+ EMPTY_ZIP_BYTES = 2_000
23
+ DEFAULT_EVALUE = 1e-5
24
+
25
+ def _find(name: str) -> str:
26
+ here = os.path.dirname(os.path.abspath(__file__))
27
+ for candidate in (
28
+ os.path.join(here, name),
29
+ os.path.join("/cortex", name),
30
+ os.path.join("/cortex/app", name),
31
+ name,
32
+ ):
33
+ if os.path.exists(candidate):
34
+ return candidate
35
+ raise FileNotFoundError(f"{name} not found (tried cwd, /cortex, /cortex/app)")
36
+
37
+
38
+ HMM_PATH = _find("kofam_relevant.hmm")
39
+ THRESHOLDS_PATH = _find("ko_thresholds.tsv")
40
+
41
+ _alphabet = pyhmmer.easel.Alphabet.amino()
42
+ with pyhmmer.plan7.HMMFile(HMM_PATH) as _fh:
43
+ _hmms = list(_fh)
44
+ _thresholds: dict[str, float] = {}
45
+ with open(THRESHOLDS_PATH) as _fh:
46
+ next(_fh)
47
+ for _line in _fh:
48
+ _parts = _line.rstrip("\n").split("\t")
49
+ if len(_parts) < 2:
50
+ continue
51
+ try:
52
+ _thresholds[_parts[0]] = float(_parts[1])
53
+ except (TypeError, ValueError):
54
+ _thresholds[_parts[0]] = 0.0
55
+ _ncbi_key = os.environ.get("NCBI_API_KEY")
56
+ print(f"[boot] loaded {len(_hmms):,} HMMs, {len(_thresholds):,} thresholds, "
57
+ f"ncbi_key={'yes' if _ncbi_key else 'no'}", flush=True)
58
+
59
+
60
+ def _has_version(acc: str) -> bool:
61
+ return "." in acc and acc.rsplit(".", 1)[-1].isdigit()
62
+
63
+
64
+ def _candidates(acc: str) -> list[str]:
65
+ return [acc] if _has_version(acc) else [acc + v for v in VERSION_FALLBACKS]
66
+
67
+
68
+ def _fetch_fasta(acc: str) -> list[tuple[str, str]] | None:
69
+ rate = 0.1 if _ncbi_key else 0.34
70
+ headers = {"Accept": "application/zip"}
71
+ if _ncbi_key:
72
+ headers["api-key"] = _ncbi_key
73
+ params = {"include_annotation_type": "GENOME_FASTA"}
74
+ for cand in _candidates(acc):
75
+ zip_bytes: bytes | None = None
76
+ for attempt in range(3):
77
+ try:
78
+ time.sleep(rate)
79
+ resp = requests.get(
80
+ DATASETS_URL.format(acc=cand), params=params,
81
+ headers=headers, timeout=120,
82
+ )
83
+ if resp.status_code == 404:
84
+ break
85
+ if resp.status_code in (429, 502, 503):
86
+ time.sleep(2 ** attempt)
87
+ continue
88
+ resp.raise_for_status()
89
+ except requests.RequestException:
90
+ if attempt == 2:
91
+ break
92
+ time.sleep(2 ** attempt)
93
+ continue
94
+ if len(resp.content) < EMPTY_ZIP_BYTES:
95
+ break
96
+ zip_bytes = resp.content
97
+ break
98
+ if zip_bytes is None:
99
+ continue
100
+ try:
101
+ with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
102
+ fna = [n for n in zf.namelist() if n.endswith(".fna")]
103
+ if not fna:
104
+ continue
105
+ with zf.open(fna[0]) as src:
106
+ raw = src.read()
107
+ except zipfile.BadZipFile:
108
+ continue
109
+ return _parse_fasta(raw)
110
+ return None
111
+
112
+
113
+ def _parse_fasta(raw: bytes) -> list[tuple[str, str]]:
114
+ contigs: list[tuple[str, str]] = []
115
+ cur: str | None = None
116
+ chunks: list[str] = []
117
+ for line in raw.splitlines():
118
+ if not line:
119
+ continue
120
+ if line.startswith(b">"):
121
+ if cur is not None:
122
+ contigs.append((cur, "".join(chunks).upper()))
123
+ cur = line[1:].decode("ascii", errors="replace").split()[0]
124
+ chunks = []
125
+ else:
126
+ chunks.append(line.decode("ascii", errors="replace"))
127
+ if cur is not None:
128
+ contigs.append((cur, "".join(chunks).upper()))
129
+ return contigs
130
+
131
+
132
+ def _predict_proteins(contigs: list[tuple[str, str]]) -> list[str]:
133
+ encoded = [(n, s.encode("ascii")) for n, s in contigs]
134
+ total_nt = sum(len(s) for _, s in encoded)
135
+ if total_nt >= 20_000:
136
+ finder = pyrodigal.GeneFinder(meta=False)
137
+ try:
138
+ finder.train(b"TTAATTAATTAA".join(s for _, s in encoded))
139
+ except Exception:
140
+ finder = pyrodigal.GeneFinder(meta=True)
141
+ else:
142
+ finder = pyrodigal.GeneFinder(meta=True)
143
+ proteins: list[str] = []
144
+ for _, s in encoded:
145
+ for gene in finder.find_genes(s):
146
+ proteins.append(gene.translate().rstrip("*"))
147
+ return proteins
148
+
149
+
150
+ def _scan(proteins: list[str]) -> set[str]:
151
+ seqs = []
152
+ for i, prot in enumerate(proteins):
153
+ if not prot:
154
+ continue
155
+ ts = pyhmmer.easel.TextSequence(name=f"p{i}".encode(), sequence=prot)
156
+ seqs.append(ts.digitize(_alphabet))
157
+ found: set[str] = set()
158
+ if not seqs:
159
+ return found
160
+ for top_hits in pyhmmer.hmmer.hmmsearch(_hmms, seqs, E=DEFAULT_EVALUE, cpus=8):
161
+ raw = top_hits.query.name
162
+ ko = raw.decode() if isinstance(raw, bytes) else raw
163
+ thr = _thresholds.get(ko, 0.0)
164
+ for hit in top_hits:
165
+ if hit.score >= thr and hit.evalue <= DEFAULT_EVALUE:
166
+ found.add(ko)
167
+ break
168
+ return found
169
+
170
+
171
+ def scan_genome(accession: str) -> dict:
172
+ try:
173
+ contigs = _fetch_fasta(accession)
174
+ if not contigs:
175
+ return {"ok": False, "reason": "fetch_empty", "accession": accession}
176
+ proteins = _predict_proteins(contigs)
177
+ if not proteins:
178
+ return {"ok": False, "reason": "no_proteins", "accession": accession}
179
+ ko_hits = _scan(proteins)
180
+ return {"ok": True, "accession": accession, "ko_hits": sorted(ko_hits)}
181
+ except Exception as exc:
182
+ return {"ok": False, "reason": f"{type(exc).__name__}: {exc}", "accession": accession}
docs/hybrid_predictor.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hybrid Predictor
2
+
3
+ The hybrid predictor combines the strongest current model for each phenotype:
4
+
5
+ - Temperature: saved tabular XGBoost phenotype head
6
+ - pH: saved tabular XGBoost phenotype head
7
+ - Salt: saved tabular XGBoost phenotype head
8
+ - Oxygen: fold-0 all-task LoRA checkpoint
9
+
10
+ Script:
11
+
12
+ ```bash
13
+ PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
14
+ --features data/training_table.parquet \
15
+ --marker-sequences data/marker_sequences.jsonl \
16
+ --limit 25 \
17
+ --device mps \
18
+ --output artifacts/hybrid_predictions.parquet
19
+ ```
20
+
21
+ Use `--device cuda` on a GPU host, `--device mps` on Apple Silicon, or omit the flag
22
+ to let PyTorch choose CUDA when available and CPU otherwise. CPU LoRA inference is
23
+ slow because ESM-2 encodes multiple marker proteins per genome.
24
+
25
+ For larger uncultured-genome batches, use chunked output so progress is durable:
26
+
27
+ ```bash
28
+ PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
29
+ --features artifacts/uncultured_predictions.parquet \
30
+ --marker-sequences data/uncultured_marker_sequences.jsonl \
31
+ --join left \
32
+ --reuse-existing-tabular \
33
+ --device mps \
34
+ --batch-size 2 \
35
+ --chunk-size 250 \
36
+ --chunk-output-dir artifacts/hybrid_chunks \
37
+ --resume-chunks \
38
+ --progress-every 25 \
39
+ --output artifacts/hybrid_predictions.parquet
40
+ ```
41
+
42
+ `--resume-chunks` skips existing chunk files and combines all expected chunks into
43
+ the final output when the run finishes. `--reuse-existing-tabular` keeps previously
44
+ materialized temperature, pH, salt, and media outputs while replacing oxygen with
45
+ LoRA where marker sequences are available.
46
+
47
+ ## Inputs
48
+
49
+ `--features` accepts `.parquet`, `.csv`, `.json`, or `.jsonl` feature rows. Rows must
50
+ include `genome_accession` by default. The tabular heads load their feature column
51
+ order from `models/phenotype/feature_cols.json`.
52
+
53
+ `--marker-sequences` must be a JSONL file with the same schema as
54
+ `data/marker_sequences.jsonl`:
55
+
56
+ ```json
57
+ {
58
+ "bacdive_id": 1,
59
+ "genome_accession": "GCF_004341595.1",
60
+ "by_category": {
61
+ "temperature": ["..."],
62
+ "ph": ["..."],
63
+ "oxygen": ["..."],
64
+ "salt": ["..."],
65
+ "vitamin": ["..."],
66
+ "nitrogen": ["..."],
67
+ "carbon": ["..."],
68
+ "special": ["..."]
69
+ }
70
+ }
71
+ ```
72
+
73
+ For uncultured genomes, first prepare a marker-sequence JSONL for those same
74
+ `genome_accession` values. The existing `data/marker_sequences.jsonl` is keyed to
75
+ BacDive training rows, not the uncultured candidate table.
76
+
77
+ ## Output
78
+
79
+ The output includes:
80
+
81
+ - `pred_optimal_temperature_c` plus 80% interval columns
82
+ - `pred_optimal_ph` plus 80% interval columns
83
+ - `pred_salt_tolerance_pct` plus 80% interval columns
84
+ - `pred_oxygen_requirement`
85
+ - `pred_oxygen_requirement_confidence`
86
+ - one probability column per LoRA oxygen class
87
+
88
+ The default output is `artifacts/hybrid_predictions.parquet`; `.csv`, `.json`, and
89
+ `.jsonl` outputs are also supported.
docs/lora_results.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Fold 0 Results
2
+
3
+ This page records the completed fold 0 LoRA experiments and the checkpoint to use.
4
+
5
+ ## Recommendation
6
+
7
+ Use the original all-task fold 0 checkpoint:
8
+
9
+ - Local checkpoint path: `artifacts/lora/fold0_best.pt`
10
+ - Durable release asset: `fold0_best_all_task.pt`
11
+ - GitHub release: https://github.com/miyu-horiuchi/microbe-model/releases/tag/lora-fold0-20260518
12
+
13
+ The all-task checkpoint is the best current fold 0 LoRA result. Oxygen-only training
14
+ and the anaerobe-weighted run were useful checks, but neither improved the clean
15
+ validation comparison enough to replace the original checkpoint.
16
+
17
+ ## Experiments
18
+
19
+ All runs used fold 0, ESM-2 t12, LoRA `r=8`, one epoch, batch size 2, gradient
20
+ accumulation 8, and Lambda A100 SXM4 GPU training.
21
+
22
+ | Run | Local result file | Oxygen macro F1 | Oxygen n | Use? |
23
+ |---|---|---:|---:|---|
24
+ | All-task LoRA | `artifacts/lora/fold0_results.json` | 0.944823 | 2266 | Yes |
25
+ | Oxygen-only LoRA | `artifacts/lora/fold0_results_oxygen.json` | 0.916836 | 2214 | No |
26
+ | Anaerobe-weighted all-task LoRA | `artifacts/lora_weighted_anaerobe/fold0_results.json` | 0.944776 | 2266 | No |
27
+
28
+ The anaerobe-weighted run used oxygen class weights:
29
+
30
+ ```text
31
+ aerobe=1.0, anaerobe=1.5, facultative_anaerobe=1.0, microaerobe=1.0
32
+ ```
33
+
34
+ It slightly improved anaerobe recall in the detailed diagnostic, but its fold 0
35
+ training-validation oxygen macro F1 was fractionally lower than the all-task run.
36
+
37
+ ## Checkpoint Assets
38
+
39
+ The `.pt` files are not committed to git. They are stored as GitHub Release assets:
40
+
41
+ | Asset | SHA256 |
42
+ |---|---|
43
+ | `fold0_best_all_task.pt` | `8a73ee20252b1aa710b0480abd307ffbc38b788b1a152a7e63298c525a04be23` |
44
+ | `fold0_best_oxygen_only.pt` | `fd10d4a2a7cba5d564fb9ba1f730cace07a0a2173d3622f1f572cfd29306fc95` |
45
+ | `fold0_best_weighted_anaerobe.pt` | `c8d34999f570663e020e5644a994f821bf9539a6fcc3e029d5942b8dc7709826` |
46
+
47
+ ## Loading The Best Checkpoint
48
+
49
+ The checkpoint is a PyTorch dictionary with these keys:
50
+
51
+ - `epoch`
52
+ - `model_cfg`
53
+ - `train_cfg`
54
+ - `state_dict`
55
+
56
+ Minimal load pattern:
57
+
58
+ ```python
59
+ import torch
60
+
61
+ from microbe_model.train.lora_model import LoraModelConfig, PhenoLoRAModel
62
+
63
+ checkpoint = torch.load("artifacts/lora/fold0_best.pt", map_location="cpu")
64
+ model_cfg = LoraModelConfig(**checkpoint["model_cfg"])
65
+ model = PhenoLoRAModel(model_cfg)
66
+ model.load_state_dict(checkpoint["state_dict"], strict=False)
67
+ model.eval()
68
+ ```
69
+
70
+ To regenerate oxygen diagnostics:
71
+
72
+ ```bash
73
+ PYTHONPATH=src uv run --python 3.11 --extra dev python scripts/38_eval_lora_checkpoint.py \
74
+ --checkpoint artifacts/lora/fold0_best.pt \
75
+ --output-json artifacts/lora/fold0_oxygen_diagnostics.json \
76
+ --output-md artifacts/lora/fold0_oxygen_diagnostics.md \
77
+ --batch-size 2
78
+ ```
79
+
80
+ ## Next GPU Work
81
+
82
+ Do not spend more GPU on fold 0 variants unless there is a new hypothesis. The next
83
+ meaningful validation step is to run the selected all-task LoRA setup across folds
84
+ 1-4 and report the mean and variance across all five folds. That is a stronger
85
+ scientific result, but it is also the next major GPU-cost item.
kaggle/README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kaggle migration — LoRA fine-tune of ESM-2 on BacDive phenotypes
2
+
3
+ This directory packages everything you need to run the LoRA fine-tune on Kaggle's
4
+ free P100 GPU (30 hr/week quota). Costs $0; one epoch fits in a single 12-h session.
5
+
6
+ ## One-time setup
7
+
8
+ 1. **Make a Kaggle account** and verify it via phone (required to enable internet +
9
+ GPU). https://www.kaggle.com/account
10
+ 2. **Install the Kaggle CLI** locally so you can upload datasets without dragging
11
+ 1.3 GB through the browser:
12
+
13
+ ```bash
14
+ uv pip install kaggle
15
+ ```
16
+
17
+ 3. **Download your Kaggle API token** from https://www.kaggle.com/settings → "Create
18
+ new API token". Save it to `~/.kaggle/kaggle.json` and `chmod 600 ~/.kaggle/kaggle.json`.
19
+
20
+ ## Upload the three datasets
21
+
22
+ Run the helper, which packages and uploads all three at once (and re-runs as
23
+ "new version" pushes on subsequent invocations so the Kaggle URLs stay stable):
24
+
25
+ ```bash
26
+ export KAGGLE_USERNAME=<your-kaggle-handle>
27
+ bash kaggle/upload.sh
28
+ ```
29
+
30
+ This creates (or updates) three datasets under your account:
31
+
32
+ | Slug | Size | License | Contents |
33
+ |---|---|---|---|
34
+ | `bacdive-marker-sequences` | 1.3 GB | CC0-1.0 | `marker_sequences.jsonl` |
35
+ | `bacdive-tables` | ~50 MB | CC-BY-4.0 | `bacdive_phenotypes.parquet`, `strain_catalog.parquet` |
36
+ | `microbe-model-code` | ~120 KB | MIT | The `microbe_model/` Python package |
37
+
38
+ If you ever need to wipe + re-stage the local copies, run with `FORCE_RECREATE=1 bash kaggle/upload.sh`.
39
+
40
+ ## Running on Kaggle
41
+
42
+ 1. Open https://www.kaggle.com → "Create" → "New Notebook".
43
+ 2. Upload `kaggle/lora_train_kaggle.ipynb` (or paste the contents).
44
+ 3. **Settings (right rail):**
45
+ - Accelerator → **GPU P100**
46
+ - Persistence → **Files only** (so checkpoints survive between sessions)
47
+ - Internet → **on** (needed to fetch ESM-2 weights from HuggingFace)
48
+ 4. **Add inputs (right rail):**
49
+ - `<YOUR-KAGGLE-USERNAME>/bacdive-marker-sequences`
50
+ - `<YOUR-KAGGLE-USERNAME>/bacdive-tables`
51
+ - `<YOUR-KAGGLE-USERNAME>/microbe-model-code`
52
+ 5. Adjust the input paths in cell 2 of the notebook to match the names of the
53
+ datasets you uploaded (Kaggle slugifies them so the folder name under
54
+ `/kaggle/input/` will be e.g. `bacdive-marker-sequences/`).
55
+ 6. **Run all cells.** Training will print loss every 50 steps; expect ~10 h for
56
+ one epoch of fold 0 on P100. The trainer saves `fold0_best.pt` and
57
+ `fold0_results.json` to `/kaggle/working/`, both downloadable from the Output
58
+ tab when the session ends.
59
+
60
+ ## Resuming across sessions (only if you want >1 epoch)
61
+
62
+ If you need 3 epochs total, your simplest path is three separate Kaggle sessions,
63
+ each running 1 epoch starting from the previous session's checkpoint:
64
+
65
+ 1. After session 1 finishes, **download** `/kaggle/working/fold0_best.pt`.
66
+ 2. **Create a new Kaggle Dataset** called `lora-fold0-ckpt-epoch1` containing it.
67
+ 3. In session 2, add this dataset as input and load the checkpoint state-dict
68
+ into the model before training begins. Add a cell like:
69
+
70
+ ```python
71
+ import torch
72
+ ckpt = torch.load(\"/kaggle/input/lora-fold0-ckpt-epoch1/fold0_best.pt\")
73
+ model.load_state_dict(ckpt[\"state_dict\"], strict=False)
74
+ ```
75
+ 4. Repeat for session 3.
76
+
77
+ For a "publishable, modest-cost LoRA result", running 1 epoch in 1 session is
78
+ usually enough — LoRA reaches most of its gain in the first pass through the data.
79
+
80
+ ## What to do with the result
81
+
82
+ When the Kaggle run finishes, download `fold0_results.json` from
83
+ `/kaggle/working/` and drop it at `artifacts/lora/fold0_results.json` locally.
84
+ A follow-up script will compare the LoRA per-target metrics to the
85
+ frozen-PTPE XGBoost baseline in `artifacts/baseline_results.json`.