Spaces:
Running
Running
Miyu Horiuchi commited on
Commit Β·
a4202fc
1
Parent(s): 572e624
Update README for hybrid UI
Browse files
README.md
CHANGED
|
@@ -19,8 +19,42 @@ in pure culture.
|
|
| 19 |
|
| 20 |
## Status
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Approach
|
| 26 |
|
|
@@ -31,19 +65,27 @@ NCBI Datasets v2 (genomes) ββββββ
|
|
| 31 |
β
|
| 32 |
βΌ
|
| 33 |
streaming featurize
|
| 34 |
-
(download β pyrodigal β
|
| 35 |
β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
βΌ
|
| 37 |
multi-task XGBoost
|
| 38 |
-
(5-fold GroupKFold by
|
| 39 |
β
|
| 40 |
βΌ
|
| 41 |
-
|
| 42 |
```
|
| 43 |
|
| 44 |
-
The
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## Setup
|
| 49 |
|
|
@@ -58,75 +100,166 @@ cp .env.example .env
|
|
| 58 |
## Running the pipeline
|
| 59 |
|
| 60 |
```bash
|
| 61 |
-
#
|
| 62 |
-
uv run python scripts/01_fetch_bacdive.py --end 200000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
#
|
| 65 |
-
|
| 66 |
-
uv run python scripts/02_fetch_and_featurize.py --require-target any
|
| 67 |
|
| 68 |
-
#
|
| 69 |
-
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
#
|
| 72 |
-
|
|
|
|
| 73 |
|
| 74 |
-
#
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
```
|
| 77 |
|
| 78 |
-
For overnight runs, `scripts/run_train_and_eval.sh` chains
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
## Architecture
|
| 81 |
|
|
|
|
| 82 |
- **`src/microbe_model/data/bacdive.py`** β v2 REST client (public, no auth). Discovers
|
| 83 |
strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
|
| 84 |
- **`src/microbe_model/pipeline.py`** β streaming fetch + featurize. Each worker process
|
| 85 |
downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β
|
| 86 |
no persistent genome storage. Resumable via the JSONL append log.
|
| 87 |
- **`src/microbe_model/features/genome.py`** β pyrodigal CDS prediction + amino-acid
|
| 88 |
-
composition
|
| 89 |
-
genomes) with a meta fallback for very short input.
|
| 90 |
- **`src/microbe_model/train/baseline.py`** β multi-task XGBoost with per-fold class
|
| 91 |
-
re-encoding for classification
|
| 92 |
-
- **`src/microbe_model/eval.py`** β markdown report renderer.
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
## Layout
|
| 96 |
|
| 97 |
```
|
| 98 |
src/microbe_model/
|
| 99 |
-
config.py
|
| 100 |
-
data/bacdive.py
|
| 101 |
-
features/
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
```
|
| 110 |
|
| 111 |
## What this is *not* yet
|
| 112 |
|
| 113 |
-
- Not
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
| 115 |
- Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
|
| 116 |
to organisms that have been cultured at least once).
|
| 117 |
|
| 118 |
These are deliberate v0 boundaries β see `OVERNIGHT_SUMMARY.md` after a run for the
|
| 119 |
headline result and `artifacts/eval_report.md` for the full eval.
|
| 120 |
|
| 121 |
-
## v1 backlog
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## Environment variables
|
| 132 |
|
|
|
|
| 19 |
|
| 20 |
## Status
|
| 21 |
|
| 22 |
+
v5 β hybrid predictor on top of the v4 multi-source feature stack.
|
| 23 |
+
46,029 BacDive strains over 22,300 unique genomes, each described by **six parallel
|
| 24 |
+
feature paths** that XGBoost then combines (6,312 features total per genome):
|
| 25 |
+
|
| 26 |
+
1. **Composition / codon / tetranucleotide statistics** (~355 cols, v0)
|
| 27 |
+
2. **MediaDive recipe metadata** (medium pH, NaCl content of media this strain grows on)
|
| 28 |
+
3. **Curated Pfam HMM markers** (144 cols, 8 categories: T_opt, pH, oxygen, salt,
|
| 29 |
+
vitamins, nitrogen, carbon, special)
|
| 30 |
+
4. **KEGG module completeness** β fractional 0β1 score for 570 metabolic pathways via
|
| 31 |
+
completed KOfam scan + KEGG module rules
|
| 32 |
+
5. **Isolation metadata** β country / continent / lat-lon / collection year / inferred
|
| 33 |
+
host kingdom from raw BacDive JSONs (46 cols + 65 one-hot category encodings)
|
| 34 |
+
6. **Phenotype-targeted ESM-2 embeddings (PTPE)** β for each genome, embed only the
|
| 35 |
+
proteins matching curated phenotype-relevant HMMs (cytochromes for oxygen, heat-shock
|
| 36 |
+
for temperature, NaβΊ/HβΊ antiporters for pH/salt, etc.) with frozen ESM-2 t30, then
|
| 37 |
+
mean-pool per category. 8 markers Γ 641 dims = 5,128 cols.
|
| 38 |
+
|
| 39 |
+
Current tabular 5-fold GroupKFold CV (full feature stack):
|
| 40 |
+
|
| 41 |
+
| Target | Metric | v3 (pre-PTPE) | v4 (+PTPE) | Ξ |
|
| 42 |
+
|---|---|---|---|---|
|
| 43 |
+
| optimal_temperature_c | MAE Β°C | 2.74 | **2.67** | β2.4% |
|
| 44 |
+
| optimal_ph | MAE | 0.47 | **0.47** | β1.0% |
|
| 45 |
+
| oxygen_requirement | F1-macro | 0.41 | **0.40** | β2.4% (slight regression) |
|
| 46 |
+
| salt_tolerance_pct | MAE % | 1.94 | **1.92** | β1.1% |
|
| 47 |
+
|
| 48 |
+
PTPE adds modest, mixed lift on the regressors and slightly hurts oxygen F1. Frozen
|
| 49 |
+
mean-pool may not be unlocking the PLM signal. A first fold-0 LoRA fine-tune of
|
| 50 |
+
ESM-2 t12 is now complete and is strongest for oxygen classification: the best
|
| 51 |
+
all-task LoRA checkpoint reaches `0.9448` oxygen macro F1 on fold 0, versus `0.4020`
|
| 52 |
+
for the current tabular five-fold mean. Oxygen-only and anaerobe-weighted variants
|
| 53 |
+
did not beat the original all-task checkpoint. See [docs/lora_results.md](docs/lora_results.md) for the
|
| 54 |
+
checkpoint release, metrics, and load instructions.
|
| 55 |
+
For practical prediction, use the hybrid predictor in [docs/hybrid_predictor.md](docs/hybrid_predictor.md):
|
| 56 |
+
tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
|
| 57 |
+
surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
|
| 58 |
|
| 59 |
## Approach
|
| 60 |
|
|
|
|
| 65 |
β
|
| 66 |
βΌ
|
| 67 |
streaming featurize
|
| 68 |
+
(download β pyrodigal β discard FASTA)
|
| 69 |
β
|
| 70 |
+
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
|
| 71 |
+
βΌ βΌ βΌ βΌ βΌ
|
| 72 |
+
composition / Pfam HMM scan KEGG/KOfam scan ESM-2 mean-pool isolation
|
| 73 |
+
codon / tetra (48 markers) (570 modules) (t30 on Modal) metadata
|
| 74 |
+
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
|
| 75 |
βΌ
|
| 76 |
multi-task XGBoost
|
| 77 |
+
(5-fold GroupKFold by family)
|
| 78 |
β
|
| 79 |
βΌ
|
| 80 |
+
phenotype heads + medium recommender + LoRA oxygen head
|
| 81 |
```
|
| 82 |
|
| 83 |
+
The six feature paths are **independent**: each describes the same genome a different
|
| 84 |
+
way, and XGBoost decides which to weight per phenotype. The marker-importance diagnostic
|
| 85 |
+
shows oxygen leans hard on Pfam HMMs (COX1, hydrogenases), T_opt leans on composition
|
| 86 |
+
(IVYWREL fraction), salt uses both, and KEGG modules are expected to dominate the
|
| 87 |
+
medium-recommendation side because every "missing biosynthetic pathway" maps directly
|
| 88 |
+
to "this compound goes in the recipe."
|
| 89 |
|
| 90 |
## Setup
|
| 91 |
|
|
|
|
| 100 |
## Running the pipeline
|
| 101 |
|
| 102 |
```bash
|
| 103 |
+
# === core pipeline (composition + tabular features) ===
|
| 104 |
+
uv run python scripts/01_fetch_bacdive.py --end 200000 # ~10 min
|
| 105 |
+
uv run python scripts/02_fetch_and_featurize.py --require-target any # ~5 hr, resumable
|
| 106 |
+
uv run python scripts/03_train_baseline.py # multi-task XGBoost
|
| 107 |
+
uv run python scripts/04_eval.py # eval report
|
| 108 |
+
uv run python scripts/05_overnight_summary.py # OVERNIGHT_SUMMARY.md
|
| 109 |
+
|
| 110 |
+
# === Pfam HMM markers (curated, ~5 hr scan once) ===
|
| 111 |
+
uv run python scripts/23_verify_markers.py # validate Pfam IDs
|
| 112 |
+
uv run python scripts/24_unified_hmm_scan.py --workers 8 # scan 22K genomes
|
| 113 |
+
uv run python scripts/25_evaluate_all_targets.py # A/B lift report
|
| 114 |
+
uv run python scripts/26_marker_importance.py # which markers paid off
|
| 115 |
+
|
| 116 |
+
# === KEGG module completeness (~570 modules) ===
|
| 117 |
+
uv run python scripts/27_fetch_kegg_modules.py # ~1 min, REST API
|
| 118 |
+
uv run python scripts/28_kofam_scan.py --fetch-only # download KOfam, ~10 min
|
| 119 |
+
uv run python scripts/28_kofam_scan.py --workers 8 # full scan, ~14-18 hr
|
| 120 |
+
uv run python scripts/29_compute_kegg_completeness.py # ~1 min, materialize parquet
|
| 121 |
|
| 122 |
+
# === Environment-of-origin enrichment (no compute) ===
|
| 123 |
+
uv run python scripts/30_parse_isolation_metadata.py # ~30 sec; lat/lon/host/etc
|
|
|
|
| 124 |
|
| 125 |
+
# === ESM-2 protein embeddings ===
|
| 126 |
+
# Local on MPS (slow):
|
| 127 |
+
uv run --extra embeddings python scripts/11_extract_embeddings.py \
|
| 128 |
+
--model facebook/esm2_t30_150M_UR50D --sample-n 50
|
| 129 |
|
| 130 |
+
# Or on Modal A10G GPUs (much faster, requires `modal setup` + ncbi-key secret):
|
| 131 |
+
modal run scripts/modal_embed.py
|
| 132 |
+
uv run python scripts/_materialize_embeddings.py # JSONL β parquet
|
| 133 |
|
| 134 |
+
# === Phenotype-targeted ESM-2 embeddings (PTPE) ===
|
| 135 |
+
# Embed only proteins matching curated phenotype HMMs, pool per category.
|
| 136 |
+
uv run modal run scripts/modal_per_marker_embed.py --model facebook/esm2_t30_150M_UR50D
|
| 137 |
+
uv run python scripts/_materialize_per_marker_embeddings.py # JSONL β parquet
|
| 138 |
+
|
| 139 |
+
# === Hybrid predictor ===
|
| 140 |
+
# Tabular XGBoost for temperature/pH/salt, fold-0 LoRA for oxygen.
|
| 141 |
+
PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
|
| 142 |
+
--features data/training_table.parquet \
|
| 143 |
+
--marker-sequences data/marker_sequences.jsonl \
|
| 144 |
+
--device mps \
|
| 145 |
+
--output artifacts/hybrid_predictions.parquet
|
| 146 |
```
|
| 147 |
|
| 148 |
+
For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
|
| 149 |
+
KEGG, and embedding paths are independent β once their per-genome parquets exist
|
| 150 |
+
(`data/hmm_features.parquet`, `data/kegg_modules.parquet`, `data/embeddings.parquet`),
|
| 151 |
+
`03_train_baseline.py` and `10_train_media_recommender.py` auto-merge them.
|
| 152 |
|
| 153 |
## Architecture
|
| 154 |
|
| 155 |
+
### Core
|
| 156 |
- **`src/microbe_model/data/bacdive.py`** β v2 REST client (public, no auth). Discovers
|
| 157 |
strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
|
| 158 |
- **`src/microbe_model/pipeline.py`** β streaming fetch + featurize. Each worker process
|
| 159 |
downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β
|
| 160 |
no persistent genome storage. Resumable via the JSONL append log.
|
| 161 |
- **`src/microbe_model/features/genome.py`** β pyrodigal CDS prediction + amino-acid
|
| 162 |
+
composition / codon / tetranucleotide features.
|
|
|
|
| 163 |
- **`src/microbe_model/train/baseline.py`** β multi-task XGBoost with per-fold class
|
| 164 |
+
re-encoding for classification.
|
| 165 |
+
- **`src/microbe_model/eval.py`** β markdown report renderer.
|
| 166 |
+
|
| 167 |
+
### Feature paths
|
| 168 |
+
- **`src/microbe_model/features/markers.py`** β 48 verified Pfam markers across 8 categories
|
| 169 |
+
(T_opt, pH, oxygen, salt, vitamins, nitrogen, carbon, special). All IDs validated via
|
| 170 |
+
`scripts/23_verify_markers.py` against InterPro DESC fields.
|
| 171 |
+
- **`src/microbe_model/features/kegg_modules.py`** β KEGG module rule parser (boolean
|
| 172 |
+
AND / OR / parens grammar) + AST evaluator for fractional & strict completeness scoring.
|
| 173 |
+
- **`src/microbe_model/features/embeddings.py`** β frozen ESM-2 forward pass + mean-pool
|
| 174 |
+
per protein β per-proteome 320/640-dim vector (model-size dependent).
|
| 175 |
+
|
| 176 |
+
### Scanners (numbered scripts)
|
| 177 |
+
- **`24_unified_hmm_scan.py`** β pyhmmer scan over the 48-marker Pfam library, dedup'd
|
| 178 |
+
by genome accession, streams to `data/hmm_features.parquet`.
|
| 179 |
+
- **`28_kofam_scan.py`** β same architecture but against KOfam (~3K KEGG-relevant HMMs);
|
| 180 |
+
output is per-genome KO sets.
|
| 181 |
+
- **`29_compute_kegg_completeness.py`** β applies the KEGG module rules to KO hits,
|
| 182 |
+
yields ~570 fractional-completeness columns per genome.
|
| 183 |
+
- **`30_parse_isolation_metadata.py`** β parses raw BacDive JSONs for lat/lon/country/
|
| 184 |
+
host species; outputs `data/isolation_metadata.parquet` with one-hot encodings.
|
| 185 |
+
- **`modal_embed.py`** β Modal app for ESM-2 t30 (or t33) extraction on A10G GPUs.
|
| 186 |
+
|
| 187 |
+
### UI and API
|
| 188 |
+
- **`api/main.py`** β FastAPI backend for the Hugging Face Space. It serves the React
|
| 189 |
+
build, recommender models, catalog API, NCBI lookup, and on-demand genome prediction.
|
| 190 |
+
- **`web/`** β React/Vite frontend used by the Docker Space at
|
| 191 |
+
<https://huggingface.co/spaces/miyuiu/microbe-model>.
|
| 192 |
+
- **Hybrid catalog behavior** β `/api/catalog` always loads
|
| 193 |
+
`artifacts/uncultured_predictions.parquet`; if `artifacts/hybrid_predictions.parquet`
|
| 194 |
+
exists, the API overlays matching `pred_*` columns by `genome_accession`.
|
| 195 |
+
Oxygen rows include `O2_source` so the UI can show `LoRA` vs `tabular`.
|
| 196 |
+
- **Live `/api/predict` behavior** β on-demand predictions currently use the deployed
|
| 197 |
+
tabular phenotype heads and return per-phenotype `source` metadata. LoRA-backed
|
| 198 |
+
oxygen is used for precomputed hybrid catalog rows when the hybrid artifact is present.
|
| 199 |
|
| 200 |
## Layout
|
| 201 |
|
| 202 |
```
|
| 203 |
src/microbe_model/
|
| 204 |
+
config.py # paths, env vars, prediction targets
|
| 205 |
+
data/bacdive.py # BacDive v2 client
|
| 206 |
+
features/
|
| 207 |
+
genome.py # pyrodigal + composition / codon / tetra
|
| 208 |
+
composition.py # tetranucleotide + codon-usage helpers
|
| 209 |
+
markers.py # 48 verified Pfam markers (8 categories)
|
| 210 |
+
kegg_modules.py # KEGG module rule parser + AST evaluator
|
| 211 |
+
embeddings.py # ESM-2 mean-pool helpers
|
| 212 |
+
pipeline.py # streaming async fetch + featurize
|
| 213 |
+
train/
|
| 214 |
+
baseline.py # multi-task XGBoost + GroupKFold
|
| 215 |
+
media_recommender.py # per-medium binary classifiers
|
| 216 |
+
eval.py # markdown report renderer
|
| 217 |
+
scripts/ # numbered pipeline entry points (01β39 + modal_*.py)
|
| 218 |
+
api/ # FastAPI backend for the Docker/Hugging Face Space
|
| 219 |
+
web/ # React/Vite frontend for the deployed UI
|
| 220 |
+
tests/ # unit + integration tests
|
| 221 |
+
data/ # (gitignored) parquet tables, JSONL features, BacDive cache
|
| 222 |
+
artifacts/ # eval report, training results, logs
|
| 223 |
+
models/ # trained phenotype heads + per-medium recommender models (LFS)
|
| 224 |
```
|
| 225 |
|
| 226 |
## What this is *not* yet
|
| 227 |
|
| 228 |
+
- Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
|
| 229 |
+
encoder for phenotype prediction; the system is still mostly tabular XGBoost plus
|
| 230 |
+
a targeted oxygen LoRA head.
|
| 231 |
+
- Not a full active-learning platform. The UI can score an accession/name/FASTA, but it
|
| 232 |
+
does not yet store experiments, close the lab feedback loop, or retrain from new assays.
|
| 233 |
- Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
|
| 234 |
to organisms that have been cultured at least once).
|
| 235 |
|
| 236 |
These are deliberate v0 boundaries β see `OVERNIGHT_SUMMARY.md` after a run for the
|
| 237 |
headline result and `artifacts/eval_report.md` for the full eval.
|
| 238 |
|
| 239 |
+
## v1 backlog (partially shipped β see Status)
|
| 240 |
+
|
| 241 |
+
β
Done:
|
| 242 |
+
- Tetranucleotide and codon-usage features (v0.1)
|
| 243 |
+
- MediaDive recipe metadata as a richer label source (v0.2)
|
| 244 |
+
- ESM-2 t6 mean-pooled embeddings (v2)
|
| 245 |
+
- 48 verified Pfam markers across 8 categories (v3)
|
| 246 |
+
- KEGG module completeness pipeline β full KOfam scan complete (v4)
|
| 247 |
+
- Isolation metadata enrichment from raw BacDive JSON (v3)
|
| 248 |
+
- Modal-based GPU embedding extraction (t30 complete, 22,300 genomes)
|
| 249 |
+
- **Phenotype-targeted ESM-2 embeddings (PTPE)** β HMM-gated mean-pool per category (v4)
|
| 250 |
+
- **Fold-0 LoRA fine-tune of ESM-2 t12** β best result is the all-task checkpoint,
|
| 251 |
+
stored in the `lora-fold0-20260518` GitHub Release
|
| 252 |
|
| 253 |
+
π¬ Open:
|
| 254 |
+
- **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
|
| 255 |
+
is promising for oxygen, but it is still only one group fold
|
| 256 |
+
- **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
|
| 257 |
+
methodological direction)
|
| 258 |
+
- LPSN/GTDB family proper join (for tighter GroupKFold)
|
| 259 |
+
- Pyrodigal-GV for atypical genetic codes
|
| 260 |
+
- Co-occurrence / cross-feeding context from public metagenomes (MGnify, EMP, HMP)
|
| 261 |
+
- Active learning loop: highlight novel-family strains where the model is uncertain,
|
| 262 |
+
prioritize for wet-lab cultivation testing.
|
| 263 |
|
| 264 |
## Environment variables
|
| 265 |
|