Spaces:
Running
Running
Miyu Horiuchi commited on
Commit Β·
d278e18
1
Parent(s): 4b79970
Update README to reflect current state
Browse filesThe README was stale β referenced a 5-script pipeline that's now 4 scripts
(plus the summary), recommended a 5K smoke test that's not the right starting
point now that the v2 BacDive API is public, and described an architecture
that no longer matches (separate fetch + featurize is now streaming).
New README:
- 5-step pipeline (01 scan β 02 fetch+featurize β 03 train β 04 eval β 05 summary)
- Streaming architecture with no persistent genome storage
- Updated layout to match current directory structure
- v1 backlog with the same 6 items in the project_goal memory file
README.md
CHANGED
|
@@ -1,95 +1,125 @@
|
|
| 1 |
# microbe-model
|
| 2 |
|
| 3 |
-
Predict cultivation conditions (optimal temperature, pH, oxygen requirement, salt tolerance)
|
| 4 |
-
microbial isolates from genome sequence alone. The long-term aim is to lower the cost of
|
| 5 |
-
"microbial dark matter" β the >99% of microbial diversity that has not yet been grown
|
|
|
|
| 6 |
|
| 7 |
## Status
|
| 8 |
|
| 9 |
-
v0 β
|
|
|
|
| 10 |
|
| 11 |
## Approach
|
| 12 |
|
| 13 |
```
|
| 14 |
-
BacDive (phenotype labels) βββ
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
eval report (MAE, F1, importances)
|
| 29 |
```
|
| 30 |
|
| 31 |
The genomeβphenotype features used here have well-established correlations with the target
|
| 32 |
-
properties (
|
| 33 |
-
so even a tabular model has a real signal to learn from.
|
| 34 |
-
a ceiling before investing in transformer-based approaches.
|
| 35 |
|
| 36 |
## Setup
|
| 37 |
|
| 38 |
```bash
|
| 39 |
# Requires Python 3.11 and uv (https://docs.astral.sh/uv/)
|
| 40 |
uv sync --all-extras
|
|
|
|
|
|
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
## Running the pipeline
|
| 44 |
|
| 45 |
```bash
|
| 46 |
-
# 1.
|
| 47 |
-
|
| 48 |
-
uv run python scripts/01_fetch_bacdive.py --end 5000 # smoke test, ~5 min
|
| 49 |
-
# uv run python scripts/01_fetch_bacdive.py --end 200000 # full BacDive, ~30 min
|
| 50 |
|
| 51 |
-
# 2.
|
| 52 |
-
|
|
|
|
| 53 |
|
| 54 |
-
# 3.
|
| 55 |
-
uv run python scripts/
|
| 56 |
|
| 57 |
-
# 4.
|
| 58 |
-
uv run python scripts/
|
| 59 |
|
| 60 |
-
# 5.
|
| 61 |
-
uv run python scripts/
|
| 62 |
```
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## Layout
|
| 65 |
|
| 66 |
```
|
| 67 |
src/microbe_model/
|
| 68 |
-
config.py # paths,
|
| 69 |
-
data/
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
data/ # (gitignored) cached API responses, genomes, parquet tables
|
| 79 |
```
|
| 80 |
|
| 81 |
## What this is *not* yet
|
| 82 |
|
| 83 |
- Not a foundation model. No transformer. No genome language model.
|
| 84 |
- Not a platform. There is no upload UI or active-learning loop.
|
| 85 |
-
- Not validated against
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
## Environment variables
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
- `NCBI_API_KEY` β optional, raises NCBI rate limit from 3 req/s to 10 req/s.
|
| 94 |
|
| 95 |
-
(BacDive's v2 API
|
|
|
|
| 1 |
# microbe-model
|
| 2 |
|
| 3 |
+
Predict cultivation conditions (optimal temperature, pH, oxygen requirement, salt tolerance)
|
| 4 |
+
for microbial isolates from genome sequence alone. The long-term aim is to lower the cost of
|
| 5 |
+
culturing "microbial dark matter" β the >99% of microbial diversity that has not yet been grown
|
| 6 |
+
in pure culture.
|
| 7 |
|
| 8 |
## Status
|
| 9 |
|
| 10 |
+
v0 β tabular baseline against the full BacDive corpus. No deep model yet; the point of the
|
| 11 |
+
v0 is to establish the ceiling on tabular performance before investing in transformers.
|
| 12 |
|
| 13 |
## Approach
|
| 14 |
|
| 15 |
```
|
| 16 |
+
BacDive v2 (phenotype labels) βββ
|
| 17 |
+
βββ> joined table (strain, genome_accession, phenotypes)
|
| 18 |
+
NCBI Datasets v2 (genomes) ββββββ
|
| 19 |
+
β
|
| 20 |
+
βΌ
|
| 21 |
+
streaming featurize
|
| 22 |
+
(download β pyrodigal β AA-composition β discard FASTA)
|
| 23 |
+
β
|
| 24 |
+
βΌ
|
| 25 |
+
multi-task XGBoost
|
| 26 |
+
(5-fold GroupKFold by taxonomic family)
|
| 27 |
+
β
|
| 28 |
+
βΌ
|
| 29 |
+
eval report (`artifacts/eval_report.md`)
|
|
|
|
| 30 |
```
|
| 31 |
|
| 32 |
The genomeβphenotype features used here have well-established correlations with the target
|
| 33 |
+
properties (proteome amino acid composition correlates with optimal growth temperature, GC
|
| 34 |
+
content with thermophily, etc.), so even a tabular model has a real signal to learn from.
|
|
|
|
| 35 |
|
| 36 |
## Setup
|
| 37 |
|
| 38 |
```bash
|
| 39 |
# Requires Python 3.11 and uv (https://docs.astral.sh/uv/)
|
| 40 |
uv sync --all-extras
|
| 41 |
+
|
| 42 |
+
# (optional) NCBI API key raises rate limits; copy .env.example to .env and fill in.
|
| 43 |
+
cp .env.example .env
|
| 44 |
```
|
| 45 |
|
| 46 |
## Running the pipeline
|
| 47 |
|
| 48 |
```bash
|
| 49 |
+
# 1. Scan BacDive (~10 min for full corpus). Writes data/bacdive_phenotypes.parquet.
|
| 50 |
+
uv run python scripts/01_fetch_bacdive.py --end 200000
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
# 2. Streaming fetch + featurize all training-ready strains (~5 hr for ~17K strains
|
| 53 |
+
# with 7 worker processes). Writes data/features.parquet. Resumable.
|
| 54 |
+
uv run python scripts/02_fetch_and_featurize.py --require-target any
|
| 55 |
|
| 56 |
+
# 3. Multi-task XGBoost. Writes artifacts/baseline_results.json + predictions.parquet.
|
| 57 |
+
uv run python scripts/03_train_baseline.py
|
| 58 |
|
| 59 |
+
# 4. Render the eval report. Writes artifacts/eval_report.md.
|
| 60 |
+
uv run python scripts/04_eval.py
|
| 61 |
|
| 62 |
+
# 5. Top-level overnight summary. Writes OVERNIGHT_SUMMARY.md.
|
| 63 |
+
uv run python scripts/05_overnight_summary.py
|
| 64 |
```
|
| 65 |
|
| 66 |
+
For overnight runs, `scripts/run_train_and_eval.sh` chains 03 β 04 β 05.
|
| 67 |
+
|
| 68 |
+
## Architecture
|
| 69 |
+
|
| 70 |
+
- **`src/microbe_model/data/bacdive.py`** β v2 REST client (public, no auth). Discovers
|
| 71 |
+
strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
|
| 72 |
+
- **`src/microbe_model/pipeline.py`** β streaming fetch + featurize. Each worker process
|
| 73 |
+
downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β
|
| 74 |
+
no persistent genome storage. Resumable via the JSONL append log.
|
| 75 |
+
- **`src/microbe_model/features/genome.py`** β pyrodigal CDS prediction + amino-acid
|
| 76 |
+
composition features. Uses single-genome+train mode (~7Γ faster than meta on assembled
|
| 77 |
+
genomes) with a meta fallback for very short input.
|
| 78 |
+
- **`src/microbe_model/train/baseline.py`** β multi-task XGBoost with per-fold class
|
| 79 |
+
re-encoding for classification (handles non-contiguous class subsets).
|
| 80 |
+
- **`src/microbe_model/eval.py`** β markdown report renderer. TL;DR + corpus + per-target
|
| 81 |
+
metrics + per-family error breakdown + limitations + next steps.
|
| 82 |
+
|
| 83 |
## Layout
|
| 84 |
|
| 85 |
```
|
| 86 |
src/microbe_model/
|
| 87 |
+
config.py # paths, env vars, prediction targets
|
| 88 |
+
data/bacdive.py # BacDive v2 client
|
| 89 |
+
features/genome.py # pyrodigal + AA-composition feature extraction
|
| 90 |
+
pipeline.py # streaming async fetch + featurize
|
| 91 |
+
train/baseline.py # multi-task XGBoost + GroupKFold
|
| 92 |
+
eval.py # markdown report renderer
|
| 93 |
+
scripts/ # numbered pipeline entry points (01β05)
|
| 94 |
+
tests/ # 15 unit + integration tests
|
| 95 |
+
data/ # (gitignored) parquet tables, JSONL features, BacDive cache
|
| 96 |
+
artifacts/ # eval report, training results, logs
|
|
|
|
| 97 |
```
|
| 98 |
|
| 99 |
## What this is *not* yet
|
| 100 |
|
| 101 |
- Not a foundation model. No transformer. No genome language model.
|
| 102 |
- Not a platform. There is no upload UI or active-learning loop.
|
| 103 |
+
- Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
|
| 104 |
+
to organisms that have been cultured at least once).
|
| 105 |
+
|
| 106 |
+
These are deliberate v0 boundaries β see `OVERNIGHT_SUMMARY.md` after a run for the
|
| 107 |
+
headline result and `artifacts/eval_report.md` for the full eval.
|
| 108 |
+
|
| 109 |
+
## v1 backlog
|
| 110 |
|
| 111 |
+
1. Tetranucleotide and codon-usage features
|
| 112 |
+
2. LPSN/GTDB family proper join (for tighter GroupKFold)
|
| 113 |
+
3. KOMODO media DB integration as a richer label source
|
| 114 |
+
4. Pyrodigal-GV for atypical genetic codes
|
| 115 |
+
5. Genome-LM embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2)
|
| 116 |
+
6. Active learning loop: highlight novel-family strains where the model is uncertain,
|
| 117 |
+
prioritize for wet-lab cultivation testing.
|
| 118 |
|
| 119 |
## Environment variables
|
| 120 |
|
| 121 |
+
`.env` (gitignored) holds:
|
| 122 |
|
| 123 |
+
- `NCBI_API_KEY` β optional, raises NCBI Datasets rate limit from 3 req/s to 10 req/s.
|
| 124 |
|
| 125 |
+
(BacDive's v2 API is public β no token needed.)
|