Miyu Horiuchi commited on
Commit
a4202fc
Β·
1 Parent(s): 572e624

Update README for hybrid UI

Browse files
Files changed (1) hide show
  1. README.md +178 -45
README.md CHANGED
@@ -19,8 +19,42 @@ in pure culture.
19
 
20
  ## Status
21
 
22
- v0 β€” tabular baseline against the full BacDive corpus. No deep model yet; the point of the
23
- v0 is to establish the ceiling on tabular performance before investing in transformers.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Approach
26
 
@@ -31,19 +65,27 @@ NCBI Datasets v2 (genomes) β”€β”€β”€β”€β”€β”˜
31
  β”‚
32
  β–Ό
33
  streaming featurize
34
- (download β†’ pyrodigal β†’ AA-composition β†’ discard FASTA)
35
  β”‚
 
 
 
 
 
36
  β–Ό
37
  multi-task XGBoost
38
- (5-fold GroupKFold by taxonomic family)
39
  β”‚
40
  β–Ό
41
- eval report (`artifacts/eval_report.md`)
42
  ```
43
 
44
- The genome→phenotype features used here have well-established correlations with the target
45
- properties (proteome amino acid composition correlates with optimal growth temperature, GC
46
- content with thermophily, etc.), so even a tabular model has a real signal to learn from.
 
 
 
47
 
48
  ## Setup
49
 
@@ -58,75 +100,166 @@ cp .env.example .env
58
  ## Running the pipeline
59
 
60
  ```bash
61
- # 1. Scan BacDive (~10 min for full corpus). Writes data/bacdive_phenotypes.parquet.
62
- uv run python scripts/01_fetch_bacdive.py --end 200000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- # 2. Streaming fetch + featurize all training-ready strains (~5 hr for ~17K strains
65
- # with 7 worker processes). Writes data/features.parquet. Resumable.
66
- uv run python scripts/02_fetch_and_featurize.py --require-target any
67
 
68
- # 3. Multi-task XGBoost. Writes artifacts/baseline_results.json + predictions.parquet.
69
- uv run python scripts/03_train_baseline.py
 
 
70
 
71
- # 4. Render the eval report. Writes artifacts/eval_report.md.
72
- uv run python scripts/04_eval.py
 
73
 
74
- # 5. Top-level overnight summary. Writes OVERNIGHT_SUMMARY.md.
75
- uv run python scripts/05_overnight_summary.py
 
 
 
 
 
 
 
 
 
 
76
  ```
77
 
78
- For overnight runs, `scripts/run_train_and_eval.sh` chains 03 β†’ 04 β†’ 05.
 
 
 
79
 
80
  ## Architecture
81
 
 
82
  - **`src/microbe_model/data/bacdive.py`** β€” v2 REST client (public, no auth). Discovers
83
  strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
84
  - **`src/microbe_model/pipeline.py`** β€” streaming fetch + featurize. Each worker process
85
  downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β€”
86
  no persistent genome storage. Resumable via the JSONL append log.
87
  - **`src/microbe_model/features/genome.py`** β€” pyrodigal CDS prediction + amino-acid
88
- composition features. Uses single-genome+train mode (~7Γ— faster than meta on assembled
89
- genomes) with a meta fallback for very short input.
90
  - **`src/microbe_model/train/baseline.py`** β€” multi-task XGBoost with per-fold class
91
- re-encoding for classification (handles non-contiguous class subsets).
92
- - **`src/microbe_model/eval.py`** β€” markdown report renderer. TL;DR + corpus + per-target
93
- metrics + per-family error breakdown + limitations + next steps.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## Layout
96
 
97
  ```
98
  src/microbe_model/
99
- config.py # paths, env vars, prediction targets
100
- data/bacdive.py # BacDive v2 client
101
- features/genome.py # pyrodigal + AA-composition feature extraction
102
- pipeline.py # streaming async fetch + featurize
103
- train/baseline.py # multi-task XGBoost + GroupKFold
104
- eval.py # markdown report renderer
105
- scripts/ # numbered pipeline entry points (01–05)
106
- tests/ # 15 unit + integration tests
107
- data/ # (gitignored) parquet tables, JSONL features, BacDive cache
108
- artifacts/ # eval report, training results, logs
 
 
 
 
 
 
 
 
 
 
109
  ```
110
 
111
  ## What this is *not* yet
112
 
113
- - Not a foundation model. No transformer. No genome language model.
114
- - Not a platform. There is no upload UI or active-learning loop.
 
 
 
115
  - Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
116
  to organisms that have been cultured at least once).
117
 
118
  These are deliberate v0 boundaries β€” see `OVERNIGHT_SUMMARY.md` after a run for the
119
  headline result and `artifacts/eval_report.md` for the full eval.
120
 
121
- ## v1 backlog
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
- 1. Tetranucleotide and codon-usage features
124
- 2. LPSN/GTDB family proper join (for tighter GroupKFold)
125
- 3. KOMODO media DB integration as a richer label source
126
- 4. Pyrodigal-GV for atypical genetic codes
127
- 5. Genome-LM embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2)
128
- 6. Active learning loop: highlight novel-family strains where the model is uncertain,
129
- prioritize for wet-lab cultivation testing.
 
 
 
130
 
131
  ## Environment variables
132
 
 
19
 
20
  ## Status
21
 
22
+ v5 β€” hybrid predictor on top of the v4 multi-source feature stack.
23
+ 46,029 BacDive strains over 22,300 unique genomes, each described by **six parallel
24
+ feature paths** that XGBoost then combines (6,312 features total per genome):
25
+
26
+ 1. **Composition / codon / tetranucleotide statistics** (~355 cols, v0)
27
+ 2. **MediaDive recipe metadata** (medium pH, NaCl content of media this strain grows on)
28
+ 3. **Curated Pfam HMM markers** (144 cols, 8 categories: T_opt, pH, oxygen, salt,
29
+ vitamins, nitrogen, carbon, special)
30
+ 4. **KEGG module completeness** β€” fractional 0–1 score for 570 metabolic pathways via
31
+ completed KOfam scan + KEGG module rules
32
+ 5. **Isolation metadata** β€” country / continent / lat-lon / collection year / inferred
33
+ host kingdom from raw BacDive JSONs (46 cols + 65 one-hot category encodings)
34
+ 6. **Phenotype-targeted ESM-2 embeddings (PTPE)** β€” for each genome, embed only the
35
+ proteins matching curated phenotype-relevant HMMs (cytochromes for oxygen, heat-shock
36
+ for temperature, Na⁺/H⁺ antiporters for pH/salt, etc.) with frozen ESM-2 t30, then
37
+ mean-pool per category. 8 markers Γ— 641 dims = 5,128 cols.
38
+
39
+ Current tabular 5-fold GroupKFold CV (full feature stack):
40
+
41
+ | Target | Metric | v3 (pre-PTPE) | v4 (+PTPE) | Ξ” |
42
+ |---|---|---|---|---|
43
+ | optimal_temperature_c | MAE Β°C | 2.74 | **2.67** | βˆ’2.4% |
44
+ | optimal_ph | MAE | 0.47 | **0.47** | βˆ’1.0% |
45
+ | oxygen_requirement | F1-macro | 0.41 | **0.40** | βˆ’2.4% (slight regression) |
46
+ | salt_tolerance_pct | MAE % | 1.94 | **1.92** | βˆ’1.1% |
47
+
48
+ PTPE adds modest, mixed lift on the regressors and slightly hurts oxygen F1. Frozen
49
+ mean-pool may not be unlocking the PLM signal. A first fold-0 LoRA fine-tune of
50
+ ESM-2 t12 is now complete and is strongest for oxygen classification: the best
51
+ all-task LoRA checkpoint reaches `0.9448` oxygen macro F1 on fold 0, versus `0.4020`
52
+ for the current tabular five-fold mean. Oxygen-only and anaerobe-weighted variants
53
+ did not beat the original all-task checkpoint. See [docs/lora_results.md](docs/lora_results.md) for the
54
+ checkpoint release, metrics, and load instructions.
55
+ For practical prediction, use the hybrid predictor in [docs/hybrid_predictor.md](docs/hybrid_predictor.md):
56
+ tabular XGBoost heads for temperature/pH/salt plus LoRA for oxygen. The deployed UI
57
+ surfaces whether each oxygen value came from `LoRA` or the `tabular` fallback.
58
 
59
  ## Approach
60
 
 
65
  β”‚
66
  β–Ό
67
  streaming featurize
68
+ (download β†’ pyrodigal β†’ discard FASTA)
69
  β”‚
70
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
71
+ β–Ό β–Ό β–Ό β–Ό β–Ό
72
+ composition / Pfam HMM scan KEGG/KOfam scan ESM-2 mean-pool isolation
73
+ codon / tetra (48 markers) (570 modules) (t30 on Modal) metadata
74
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
75
  β–Ό
76
  multi-task XGBoost
77
+ (5-fold GroupKFold by family)
78
  β”‚
79
  β–Ό
80
+ phenotype heads + medium recommender + LoRA oxygen head
81
  ```
82
 
83
+ The six feature paths are **independent**: each describes the same genome a different
84
+ way, and XGBoost decides which to weight per phenotype. The marker-importance diagnostic
85
+ shows oxygen leans hard on Pfam HMMs (COX1, hydrogenases), T_opt leans on composition
86
+ (IVYWREL fraction), salt uses both, and KEGG modules are expected to dominate the
87
+ medium-recommendation side because every "missing biosynthetic pathway" maps directly
88
+ to "this compound goes in the recipe."
89
 
90
  ## Setup
91
 
 
100
  ## Running the pipeline
101
 
102
  ```bash
103
+ # === core pipeline (composition + tabular features) ===
104
+ uv run python scripts/01_fetch_bacdive.py --end 200000 # ~10 min
105
+ uv run python scripts/02_fetch_and_featurize.py --require-target any # ~5 hr, resumable
106
+ uv run python scripts/03_train_baseline.py # multi-task XGBoost
107
+ uv run python scripts/04_eval.py # eval report
108
+ uv run python scripts/05_overnight_summary.py # OVERNIGHT_SUMMARY.md
109
+
110
+ # === Pfam HMM markers (curated, ~5 hr scan once) ===
111
+ uv run python scripts/23_verify_markers.py # validate Pfam IDs
112
+ uv run python scripts/24_unified_hmm_scan.py --workers 8 # scan 22K genomes
113
+ uv run python scripts/25_evaluate_all_targets.py # A/B lift report
114
+ uv run python scripts/26_marker_importance.py # which markers paid off
115
+
116
+ # === KEGG module completeness (~570 modules) ===
117
+ uv run python scripts/27_fetch_kegg_modules.py # ~1 min, REST API
118
+ uv run python scripts/28_kofam_scan.py --fetch-only # download KOfam, ~10 min
119
+ uv run python scripts/28_kofam_scan.py --workers 8 # full scan, ~14-18 hr
120
+ uv run python scripts/29_compute_kegg_completeness.py # ~1 min, materialize parquet
121
 
122
+ # === Environment-of-origin enrichment (no compute) ===
123
+ uv run python scripts/30_parse_isolation_metadata.py # ~30 sec; lat/lon/host/etc
 
124
 
125
+ # === ESM-2 protein embeddings ===
126
+ # Local on MPS (slow):
127
+ uv run --extra embeddings python scripts/11_extract_embeddings.py \
128
+ --model facebook/esm2_t30_150M_UR50D --sample-n 50
129
 
130
+ # Or on Modal A10G GPUs (much faster, requires `modal setup` + ncbi-key secret):
131
+ modal run scripts/modal_embed.py
132
+ uv run python scripts/_materialize_embeddings.py # JSONL β†’ parquet
133
 
134
+ # === Phenotype-targeted ESM-2 embeddings (PTPE) ===
135
+ # Embed only proteins matching curated phenotype HMMs, pool per category.
136
+ uv run modal run scripts/modal_per_marker_embed.py --model facebook/esm2_t30_150M_UR50D
137
+ uv run python scripts/_materialize_per_marker_embeddings.py # JSONL β†’ parquet
138
+
139
+ # === Hybrid predictor ===
140
+ # Tabular XGBoost for temperature/pH/salt, fold-0 LoRA for oxygen.
141
+ PYTHONPATH=src uv run --python 3.11 --extra dev --extra embeddings python scripts/39_predict_hybrid.py \
142
+ --features data/training_table.parquet \
143
+ --marker-sequences data/marker_sequences.jsonl \
144
+ --device mps \
145
+ --output artifacts/hybrid_predictions.parquet
146
  ```
147
 
148
+ For overnight runs, `scripts/run_train_and_eval.sh` chains the core pipeline. The HMM,
149
+ KEGG, and embedding paths are independent β€” once their per-genome parquets exist
150
+ (`data/hmm_features.parquet`, `data/kegg_modules.parquet`, `data/embeddings.parquet`),
151
+ `03_train_baseline.py` and `10_train_media_recommender.py` auto-merge them.
152
 
153
  ## Architecture
154
 
155
+ ### Core
156
  - **`src/microbe_model/data/bacdive.py`** β€” v2 REST client (public, no auth). Discovers
157
  strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
158
  - **`src/microbe_model/pipeline.py`** β€” streaming fetch + featurize. Each worker process
159
  downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β€”
160
  no persistent genome storage. Resumable via the JSONL append log.
161
  - **`src/microbe_model/features/genome.py`** β€” pyrodigal CDS prediction + amino-acid
162
+ composition / codon / tetranucleotide features.
 
163
  - **`src/microbe_model/train/baseline.py`** β€” multi-task XGBoost with per-fold class
164
+ re-encoding for classification.
165
+ - **`src/microbe_model/eval.py`** β€” markdown report renderer.
166
+
167
+ ### Feature paths
168
+ - **`src/microbe_model/features/markers.py`** β€” 48 verified Pfam markers across 8 categories
169
+ (T_opt, pH, oxygen, salt, vitamins, nitrogen, carbon, special). All IDs validated via
170
+ `scripts/23_verify_markers.py` against InterPro DESC fields.
171
+ - **`src/microbe_model/features/kegg_modules.py`** β€” KEGG module rule parser (boolean
172
+ AND / OR / parens grammar) + AST evaluator for fractional & strict completeness scoring.
173
+ - **`src/microbe_model/features/embeddings.py`** β€” frozen ESM-2 forward pass + mean-pool
174
+ per protein β†’ per-proteome 320/640-dim vector (model-size dependent).
175
+
176
+ ### Scanners (numbered scripts)
177
+ - **`24_unified_hmm_scan.py`** β€” pyhmmer scan over the 48-marker Pfam library, dedup'd
178
+ by genome accession, streams to `data/hmm_features.parquet`.
179
+ - **`28_kofam_scan.py`** β€” same architecture but against KOfam (~3K KEGG-relevant HMMs);
180
+ output is per-genome KO sets.
181
+ - **`29_compute_kegg_completeness.py`** β€” applies the KEGG module rules to KO hits,
182
+ yields ~570 fractional-completeness columns per genome.
183
+ - **`30_parse_isolation_metadata.py`** β€” parses raw BacDive JSONs for lat/lon/country/
184
+ host species; outputs `data/isolation_metadata.parquet` with one-hot encodings.
185
+ - **`modal_embed.py`** β€” Modal app for ESM-2 t30 (or t33) extraction on A10G GPUs.
186
+
187
+ ### UI and API
188
+ - **`api/main.py`** β€” FastAPI backend for the Hugging Face Space. It serves the React
189
+ build, recommender models, catalog API, NCBI lookup, and on-demand genome prediction.
190
+ - **`web/`** β€” React/Vite frontend used by the Docker Space at
191
+ <https://huggingface.co/spaces/miyuiu/microbe-model>.
192
+ - **Hybrid catalog behavior** β€” `/api/catalog` always loads
193
+ `artifacts/uncultured_predictions.parquet`; if `artifacts/hybrid_predictions.parquet`
194
+ exists, the API overlays matching `pred_*` columns by `genome_accession`.
195
+ Oxygen rows include `O2_source` so the UI can show `LoRA` vs `tabular`.
196
+ - **Live `/api/predict` behavior** β€” on-demand predictions currently use the deployed
197
+ tabular phenotype heads and return per-phenotype `source` metadata. LoRA-backed
198
+ oxygen is used for precomputed hybrid catalog rows when the hybrid artifact is present.
199
 
200
  ## Layout
201
 
202
  ```
203
  src/microbe_model/
204
+ config.py # paths, env vars, prediction targets
205
+ data/bacdive.py # BacDive v2 client
206
+ features/
207
+ genome.py # pyrodigal + composition / codon / tetra
208
+ composition.py # tetranucleotide + codon-usage helpers
209
+ markers.py # 48 verified Pfam markers (8 categories)
210
+ kegg_modules.py # KEGG module rule parser + AST evaluator
211
+ embeddings.py # ESM-2 mean-pool helpers
212
+ pipeline.py # streaming async fetch + featurize
213
+ train/
214
+ baseline.py # multi-task XGBoost + GroupKFold
215
+ media_recommender.py # per-medium binary classifiers
216
+ eval.py # markdown report renderer
217
+ scripts/ # numbered pipeline entry points (01–39 + modal_*.py)
218
+ api/ # FastAPI backend for the Docker/Hugging Face Space
219
+ web/ # React/Vite frontend for the deployed UI
220
+ tests/ # unit + integration tests
221
+ data/ # (gitignored) parquet tables, JSONL features, BacDive cache
222
+ artifacts/ # eval report, training results, logs
223
+ models/ # trained phenotype heads + per-medium recommender models (LFS)
224
  ```
225
 
226
  ## What this is *not* yet
227
 
228
+ - Not an end-to-end foundation model. LoRA only fine-tunes an ESM-2 marker-protein
229
+ encoder for phenotype prediction; the system is still mostly tabular XGBoost plus
230
+ a targeted oxygen LoRA head.
231
+ - Not a full active-learning platform. The UI can score an accession/name/FASTA, but it
232
+ does not yet store experiments, close the lab feedback loop, or retrain from new assays.
233
  - Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
234
  to organisms that have been cultured at least once).
235
 
236
  These are deliberate v0 boundaries β€” see `OVERNIGHT_SUMMARY.md` after a run for the
237
  headline result and `artifacts/eval_report.md` for the full eval.
238
 
239
+ ## v1 backlog (partially shipped β€” see Status)
240
+
241
+ βœ… Done:
242
+ - Tetranucleotide and codon-usage features (v0.1)
243
+ - MediaDive recipe metadata as a richer label source (v0.2)
244
+ - ESM-2 t6 mean-pooled embeddings (v2)
245
+ - 48 verified Pfam markers across 8 categories (v3)
246
+ - KEGG module completeness pipeline β€” full KOfam scan complete (v4)
247
+ - Isolation metadata enrichment from raw BacDive JSON (v3)
248
+ - Modal-based GPU embedding extraction (t30 complete, 22,300 genomes)
249
+ - **Phenotype-targeted ESM-2 embeddings (PTPE)** β€” HMM-gated mean-pool per category (v4)
250
+ - **Fold-0 LoRA fine-tune of ESM-2 t12** β€” best result is the all-task checkpoint,
251
+ stored in the `lora-fold0-20260518` GitHub Release
252
 
253
+ πŸ”¬ Open:
254
+ - **Run LoRA across folds 1-4** if publication-grade validation is needed; fold 0
255
+ is promising for oxygen, but it is still only one group fold
256
+ - **Attention-pooled** per-category genome encoder instead of mean-pool (most novel
257
+ methodological direction)
258
+ - LPSN/GTDB family proper join (for tighter GroupKFold)
259
+ - Pyrodigal-GV for atypical genetic codes
260
+ - Co-occurrence / cross-feeding context from public metagenomes (MGnify, EMP, HMP)
261
+ - Active learning loop: highlight novel-family strains where the model is uncertain,
262
+ prioritize for wet-lab cultivation testing.
263
 
264
  ## Environment variables
265