Spaces:

miyuiu
/

microbe-model

Running

App Files Files Community

Miyu Horiuchi commited on 13 days ago

Commit

a4202fc

1 Parent(s): 572e624

Update README for hybrid UI

Browse files

Files changed (1) hide show

README.md +178 -45

README.md CHANGED Viewed

@@ -19,8 +19,42 @@ in pure culture.
 ## Status
-v0 — tabular baseline against the full BacDive corpus. No deep model yet; the point of the
-v0 is to establish the ceiling on tabular performance before investing in transformers.
 ## Approach
@@ -31,19 +65,27 @@ NCBI Datasets v2 (genomes) ─────┘
                                           │
                                           ▼
                                  streaming featurize
-                                 (download → pyrodigal → AA-composition → discard FASTA)
                                           │
                                           ▼
                                  multi-task XGBoost
-                                 (5-fold GroupKFold by taxonomic family)
                                           │
                                           ▼
-                                 eval report (`artifacts/eval_report.md`)
 ```
-The genome→phenotype features used here have well-established correlations with the target
-properties (proteome amino acid composition correlates with optimal growth temperature, GC
-content with thermophily, etc.), so even a tabular model has a real signal to learn from.
 ## Setup
@@ -58,75 +100,166 @@ cp .env.example .env
 ## Running the pipeline
 ```bash
-# 1. Scan BacDive (~10 min for full corpus). Writes data/bacdive_phenotypes.parquet.
-uv run python scripts/01_fetch_bacdive.py --end 200000
-# 2. Streaming fetch + featurize all training-ready strains (~5 hr for ~17K strains
-#    with 7 worker processes). Writes data/features.parquet. Resumable.
-uv run python scripts/02_fetch_and_featurize.py --require-target any
-# 3. Multi-task XGBoost. Writes artifacts/baseline_results.json + predictions.parquet.
-uv run python scripts/03_train_baseline.py
-# 4. Render the eval report. Writes artifacts/eval_report.md.
-uv run python scripts/04_eval.py
-# 5. Top-level overnight summary. Writes OVERNIGHT_SUMMARY.md.
-uv run python scripts/05_overnight_summary.py
 ```
-For overnight runs, `scripts/run_train_and_eval.sh` chains 03 → 04 → 05.
 ## Architecture
 - **`src/microbe_model/data/bacdive.py`** — v2 REST client (public, no auth). Discovers
   strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
 - **`src/microbe_model/pipeline.py`** — streaming fetch + featurize. Each worker process
   downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA —
   no persistent genome storage. Resumable via the JSONL append log.
 - **`src/microbe_model/features/genome.py`** — pyrodigal CDS prediction + amino-acid
-  composition features. Uses single-genome+train mode (~7× faster than meta on assembled
-  genomes) with a meta fallback for very short input.
 - **`src/microbe_model/train/baseline.py`** — multi-task XGBoost with per-fold class
-  re-encoding for classification (handles non-contiguous class subsets).
-- **`src/microbe_model/eval.py`** — markdown report renderer. TL;DR + corpus + per-target
-  metrics + per-family error breakdown + limitations + next steps.
 ## Layout
 ```
 src/microbe_model/
-  config.py          # paths, env vars, prediction targets
-  data/bacdive.py    # BacDive v2 client
-  features/genome.py # pyrodigal + AA-composition feature extraction
-  pipeline.py        # streaming async fetch + featurize
-  train/baseline.py  # multi-task XGBoost + GroupKFold
-  eval.py            # markdown report renderer
-scripts/             # numbered pipeline entry points (01–05)
-tests/               # 15 unit + integration tests
-data/                # (gitignored) parquet tables, JSONL features, BacDive cache
-artifacts/           # eval report, training results, logs
 ```
 ## What this is *not* yet
-- Not a foundation model. No transformer. No genome language model.
-- Not a platform. There is no upload UI or active-learning loop.
 - Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
   to organisms that have been cultured at least once).
 These are deliberate v0 boundaries — see `OVERNIGHT_SUMMARY.md` after a run for the
 headline result and `artifacts/eval_report.md` for the full eval.
-## v1 backlog
-1. Tetranucleotide and codon-usage features
-2. LPSN/GTDB family proper join (for tighter GroupKFold)
-3. KOMODO media DB integration as a richer label source
-4. Pyrodigal-GV for atypical genetic codes
-5. Genome-LM embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2)
-6. Active learning loop: highlight novel-family strains where the model is uncertain,
-   prioritize for wet-lab cultivation testing.
 ## Environment variables

 ## Status
+v5 — hybrid predictor on top of the v4 multi-source feature stack.
+46,029 BacDive strains over 22,300 unique genomes, each described by **six parallel
+feature paths** that XGBoost then combines (6,312 features total per genome):
+1. **Composition / codon / tetranucleotide statistics** (~355 cols, v0)
+2. **MediaDive recipe metadata** (medium pH, NaCl content of media this strain grows on)
+3. **Curated Pfam HMM markers** (144 cols, 8 categories: T_opt, pH, oxygen, salt,
+   vitamins, nitrogen, carbon, special)
+4. **KEGG module completeness** — fractional 0–1 score for 570 metabolic pathways via
+   completed KOfam scan + KEGG module rules
+5. **Isolation metadata** — country / continent / lat-lon / collection year / inferred
+   host kingdom from raw BacDive JSONs (46 cols + 65 one-hot category encodings)
+6. **Phenotype-targeted ESM-2 embeddings (PTPE)** — for each genome, embed only the
+   proteins matching curated phenotype-relevant HMMs (cytochromes for oxygen, heat-shock
+   for temperature, Na⁺/H⁺ antiporters for pH/salt, etc.) with frozen ESM-2 t30, then
+   mean-pool per category. 8 markers × 641 dims = 5,128 cols.
+Current tabular 5-fold GroupKFold CV (full feature stack):
+| Target | Metric | v3 (pre-PTPE) | v4 (+PTPE) | Δ |
+|---|---|---|---|---|
+| optimal_temperature_c | MAE °C | 2.74 | **2.67** | −2.4% |
+| optimal_ph | MAE | 0.47 | **0.47** | −1.0% |
+| oxygen_requirement | F1-macro | 0.41 | **0.40** | −2.4% (slight regression) |
+| salt_tolerance_pct | MAE % | 1.94 | **1.92** | −1.1% |
+PTPE adds modest, mixed lift on the regressors and slightly hurts oxygen F1. Frozen
+mean-pool may not be unlocking the PLM signal. A first fold-0 LoRA fine-tune of
+ESM-2 t12 is now complete and is strongest for oxygen classification: the best
+all-task LoRA checkpoint reaches `0.9448` oxygen macro F1 on fold 0, versus `0.4020`
+for the current tabular five-fold mean. Oxygen-only and anaerobe-weighted variants
+did not beat the original all-task checkpoint. See [docs/lora_results.md](docs/lora_results.md) for the
+checkpoint release, metrics, and load instructions.
+For practical prediction, use the hybrid predictor in [docs/hybrid_predictor.md](docs/hybrid_predictor.md):
+tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
+surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
 ## Approach
                                           │
                                           ▼
                                  streaming featurize
+                                 (download → pyrodigal → discard FASTA)
                                           │
+            ┌─────────────────────────────┼─────────────────────────────┐
+            ▼              ▼              ▼              ▼              ▼
+   composition /   Pfam HMM scan  KEGG/KOfam scan   ESM-2 mean-pool   isolation
+   codon / tetra   (48 markers)   (570 modules)     (t30 on Modal)    metadata
+            └─────────────────────────────┼─────────────────────────────┘
                                           ▼
                                  multi-task XGBoost
+                                 (5-fold GroupKFold by family)
                                           │
                                           ▼
+                      phenotype heads + medium recommender + LoRA oxygen head
 ```
+The six feature paths are **independent**: each describes the same genome a different
+way, and XGBoost decides which to weight per phenotype. The marker-importance diagnostic
+shows oxygen leans hard on Pfam HMMs (COX1, hydrogenases), T_opt leans on composition
+(IVYWREL fraction), salt uses both, and KEGG modules are expected to dominate the
+medium-recommendation side because every "missing biosynthetic pathway" maps directly
+to "this compound goes in the recipe."
 ## Setup
 ## Running the pipeline
 ```bash
+# === core pipeline (composition + tabular features) ===
+uv run python scripts/01_fetch_bacdive.py --end 200000        # ~10 min
+uv run python scripts/02_fetch_and_featurize.py --require-target any  # ~5 hr, resumable
+uv run python scripts/03_train_baseline.py                    # multi-task XGBoost
+uv run python scripts/04_eval.py                              # eval report
+uv run python scripts/05_overnight_summary.py                 # OVERNIGHT_SUMMARY.md
+# === Pfam HMM markers (curated, ~5 hr scan once) ===
+uv run python scripts/23_verify_markers.py                    # validate Pfam IDs
+uv run python scripts/24_unified_hmm_scan.py --workers 8      # scan 22K genomes
+uv run python scripts/25_evaluate_all_targets.py              # A/B lift report
+uv run python scripts/26_marker_importance.py                 # which markers paid off
+# === KEGG module completeness (~570 modules) ===
+uv run python scripts/27_fetch_kegg_modules.py                # ~1 min, REST API
+uv run python scripts/28_kofam_scan.py --fetch-only           # download KOfam, ~10 min
+uv run python scripts/28_kofam_scan.py --workers 8            # full scan, ~14-18 hr
+uv run python scripts/29_compute_kegg_completeness.py         # ~1 min, materialize parquet
+# === Environment-of-origin enrichment (no compute) ===
+uv run python scripts/30_parse_isolation_metadata.py          # ~30 sec; lat/lon/host/etc
+# === ESM-2 protein embeddings ===
+# Local on MPS (slow):
+uv run --extra embeddings python scripts/11_extract_embeddings.py \
+    --model facebook/esm2_t30_150M_UR50D --sample-n 50
+# Or on Modal A10G GPUs (much faster, requires `modal setup` + ncbi-key secret):
+modal run scripts/modal_embed.py
+uv run python scripts/_materialize_embeddings.py             # JSONL → parquet
+# === Phenotype-targeted ESM-2 embeddings (PTPE) ===
+# Embed only proteins matching curated phenotype HMMs, pool per category.
+uv run modal run scripts/modal_per_marker_embed.py --model facebook/esm2_t30_150M_UR50D
+uv run python scripts/_materialize_per_marker_embeddings.py  # JSONL → parquet
+# === Hybrid predictor ===
+# Tabular XGBoost for temperature/pH/salt, fold-0 LoRA for oxygen.
+PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
+    --features data/training_table.parquet \
+    --marker-sequences data/marker_sequences.jsonl \
+    --device mps \
+    --output artifacts/hybrid_predictions.parquet
 ```
+For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
+KEGG, and embedding paths are independent — once their per-genome parquets exist
+(`data/hmm_features.parquet`, `data/kegg_modules.parquet`, `data/embeddings.parquet`),
+`03_train_baseline.py` and `10_train_media_recommender.py` auto-merge them.
 ## Architecture
+### Core
 - **`src/microbe_model/data/bacdive.py`** — v2 REST client (public, no auth). Discovers
   strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
 - **`src/microbe_model/pipeline.py`** — streaming fetch + featurize. Each worker process
   downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA —
   no persistent genome storage. Resumable via the JSONL append log.
 - **`src/microbe_model/features/genome.py`** — pyrodigal CDS prediction + amino-acid
+  composition / codon / tetranucleotide features.
 - **`src/microbe_model/train/baseline.py`** — multi-task XGBoost with per-fold class
+  re-encoding for classification.
+- **`src/microbe_model/eval.py`** — markdown report renderer.
+### Feature paths
+- **`src/microbe_model/features/markers.py`** — 48 verified Pfam markers across 8 categories
+  (T_opt, pH, oxygen, salt, vitamins, nitrogen, carbon, special). All IDs validated via
+  `scripts/23_verify_markers.py` against InterPro DESC fields.
+- **`src/microbe_model/features/kegg_modules.py`** — KEGG module rule parser (boolean
+  AND / OR / parens grammar) + AST evaluator for fractional & strict completeness scoring.
+- **`src/microbe_model/features/embeddings.py`** — frozen ESM-2 forward pass + mean-pool
+  per protein → per-proteome 320/640-dim vector (model-size dependent).
+### Scanners (numbered scripts)
+- **`24_unified_hmm_scan.py`** — pyhmmer scan over the 48-marker Pfam library, dedup'd
+  by genome accession, streams to `data/hmm_features.parquet`.
+- **`28_kofam_scan.py`** — same architecture but against KOfam (~3K KEGG-relevant HMMs);
+  output is per-genome KO sets.
+- **`29_compute_kegg_completeness.py`** — applies the KEGG module rules to KO hits,
+  yields ~570 fractional-completeness columns per genome.
+- **`30_parse_isolation_metadata.py`** — parses raw BacDive JSONs for lat/lon/country/
+  host species; outputs `data/isolation_metadata.parquet` with one-hot encodings.
+- **`modal_embed.py`** — Modal app for ESM-2 t30 (or t33) extraction on A10G GPUs.
+### UI and API
+- **`api/main.py`** — FastAPI backend for the Hugging Face Space. It serves the React
+  build, recommender models, catalog API, NCBI lookup, and on-demand genome prediction.
+- **`web/`** — React/Vite frontend used by the Docker Space at
+  <https://huggingface.co/spaces/miyuiu/microbe-model>.
+- **Hybrid catalog behavior** — `/api/catalog` always loads
+  `artifacts/uncultured_predictions.parquet`; if `artifacts/hybrid_predictions.parquet`
+  exists, the API overlays matching `pred_*` columns by `genome_accession`.
+  Oxygen rows include `O2_source` so the UI can show `LoRA` vs `tabular`.
+- **Live `/api/predict` behavior** — on-demand predictions currently use the deployed
+  tabular phenotype heads and return per-phenotype `source` metadata. LoRA-backed
+  oxygen is used for precomputed hybrid catalog rows when the hybrid artifact is present.
 ## Layout
 ```
 src/microbe_model/
+  config.py            # paths, env vars, prediction targets
+  data/bacdive.py      # BacDive v2 client
+  features/
+    genome.py          # pyrodigal + composition / codon / tetra
+    composition.py     # tetranucleotide + codon-usage helpers
+    markers.py         # 48 verified Pfam markers (8 categories)
+    kegg_modules.py    # KEGG module rule parser + AST evaluator
+    embeddings.py      # ESM-2 mean-pool helpers
+  pipeline.py          # streaming async fetch + featurize
+  train/
+    baseline.py        # multi-task XGBoost + GroupKFold
+    media_recommender.py  # per-medium binary classifiers
+  eval.py              # markdown report renderer
+scripts/               # numbered pipeline entry points (01–39 + modal_*.py)
+api/                   # FastAPI backend for the Docker/Hugging Face Space
+web/                   # React/Vite frontend for the deployed UI
+tests/                 # unit + integration tests
+data/                  # (gitignored) parquet tables, JSONL features, BacDive cache
+artifacts/             # eval report, training results, logs
+models/                # trained phenotype heads + per-medium recommender models (LFS)
 ```
 ## What this is *not* yet
+- Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
+  encoder for phenotype prediction; the system is still mostly tabular XGBoost plus
+  a targeted oxygen LoRA head.
+- Not a full active-learning platform. The UI can score an accession/name/FASTA, but it
+  does not yet store experiments, close the lab feedback loop, or retrain from new assays.
 - Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
   to organisms that have been cultured at least once).
 These are deliberate v0 boundaries — see `OVERNIGHT_SUMMARY.md` after a run for the
 headline result and `artifacts/eval_report.md` for the full eval.
+## v1 backlog (partially shipped — see Status)
+✅ Done:
+- Tetranucleotide and codon-usage features (v0.1)
+- MediaDive recipe metadata as a richer label source (v0.2)
+- ESM-2 t6 mean-pooled embeddings (v2)
+- 48 verified Pfam markers across 8 categories (v3)
+- KEGG module completeness pipeline — full KOfam scan complete (v4)
+- Isolation metadata enrichment from raw BacDive JSON (v3)
+- Modal-based GPU embedding extraction (t30 complete, 22,300 genomes)
+- **Phenotype-targeted ESM-2 embeddings (PTPE)** — HMM-gated mean-pool per category (v4)
+- **Fold-0 LoRA fine-tune of ESM-2 t12** — best result is the all-task checkpoint,
+  stored in the `lora-fold0-20260518` GitHub Release
+🔬 Open:
+- **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
+  is promising for oxygen, but it is still only one group fold
+- **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
+  methodological direction)
+- LPSN/GTDB family proper join (for tighter GroupKFold)
+- Pyrodigal-GV for atypical genetic codes
+- Co-occurrence / cross-feeding context from public metagenomes (MGnify, EMP, HMP)
+- Active learning loop: highlight novel-family strains where the model is uncertain,
+  prioritize for wet-lab cultivation testing.
 ## Environment variables