Miyu Horiuchi commited on
Commit
d278e18
Β·
1 Parent(s): 4b79970

Update README to reflect current state

Browse files

The README was stale β€” referenced a 5-script pipeline that's now 4 scripts
(plus the summary), recommended a 5K smoke test that's not the right starting
point now that the v2 BacDive API is public, and described an architecture
that no longer matches (separate fetch + featurize is now streaming).

New README:
- 5-step pipeline (01 scan β†’ 02 fetch+featurize β†’ 03 train β†’ 04 eval β†’ 05 summary)
- Streaming architecture with no persistent genome storage
- Updated layout to match current directory structure
- v1 backlog with the same 6 items in the project_goal memory file

Files changed (1) hide show
  1. README.md +80 -50
README.md CHANGED
@@ -1,95 +1,125 @@
1
  # microbe-model
2
 
3
- Predict cultivation conditions (optimal temperature, pH, oxygen requirement, salt tolerance) for
4
- microbial isolates from genome sequence alone. The long-term aim is to lower the cost of culturing
5
- "microbial dark matter" β€” the >99% of microbial diversity that has not yet been grown in pure culture.
 
6
 
7
  ## Status
8
 
9
- v0 β€” scaffolding the data pipeline + a non-deep-learning baseline. No trained model yet.
 
10
 
11
  ## Approach
12
 
13
  ```
14
- BacDive (phenotype labels) ──┐
15
- β”œβ”€β”€> joined table (strain, genome_accession, phenotypes)
16
- GTDB / NCBI (genomes) β”€β”€β”€β”€β”€β”€β”€β”˜
17
- β”‚
18
- β–Ό
19
- feature extraction
20
- (genome statistics, codon usage,
21
- proteome-level amino acid stats)
22
- β”‚
23
- β–Ό
24
- XGBoost multi-task baseline
25
- (group K-fold by family)
26
- β”‚
27
- β–Ό
28
- eval report (MAE, F1, importances)
29
  ```
30
 
31
  The genomeβ†’phenotype features used here have well-established correlations with the target
32
- properties (e.g. proteome amino acid composition correlates with optimal growth temperature),
33
- so even a tabular model has a real signal to learn from. The point of the v0 is to establish
34
- a ceiling before investing in transformer-based approaches.
35
 
36
  ## Setup
37
 
38
  ```bash
39
  # Requires Python 3.11 and uv (https://docs.astral.sh/uv/)
40
  uv sync --all-extras
 
 
 
41
  ```
42
 
43
  ## Running the pipeline
44
 
45
  ```bash
46
- # 1. Pull strain metadata + phenotype labels from BacDive
47
- # (BacDive v2 API is public as of Feb 2026 β€” no registration needed)
48
- uv run python scripts/01_fetch_bacdive.py --end 5000 # smoke test, ~5 min
49
- # uv run python scripts/01_fetch_bacdive.py --end 200000 # full BacDive, ~30 min
50
 
51
- # 2. Download genomes for strains that have an accession
52
- uv run python scripts/02_fetch_genomes.py
 
53
 
54
- # 3. Extract genome-level features (CDS prediction + amino acid stats)
55
- uv run python scripts/03_extract_features.py
56
 
57
- # 4. Train multi-task XGBoost baseline
58
- uv run python scripts/04_train_baseline.py
59
 
60
- # 5. Render eval report
61
- uv run python scripts/05_eval.py
62
  ```
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Layout
65
 
66
  ```
67
  src/microbe_model/
68
- config.py # paths, constants
69
- data/
70
- bacdive.py # BacDive REST API client
71
- ncbi.py # NCBI genome fetcher (Datasets API)
72
- features/
73
- genome.py # gene prediction + tabular feature extraction
74
- train/
75
- baseline.py # multi-task XGBoost + group K-fold eval
76
- scripts/ # runnable entry points (numbered by pipeline order)
77
- tests/ # smoke tests on small fixtures
78
- data/ # (gitignored) cached API responses, genomes, parquet tables
79
  ```
80
 
81
  ## What this is *not* yet
82
 
83
  - Not a foundation model. No transformer. No genome language model.
84
  - Not a platform. There is no upload UI or active-learning loop.
85
- - Not validated against held-out organisms. The eval scaffolding exists; the data does not.
 
 
 
 
 
 
86
 
87
- These are deliberate v0 boundaries. See the project notes for the longer-term plan.
 
 
 
 
 
 
88
 
89
  ## Environment variables
90
 
91
- Copy `.env.example` to `.env` and fill in:
92
 
93
- - `NCBI_API_KEY` β€” optional, raises NCBI rate limit from 3 req/s to 10 req/s.
94
 
95
- (BacDive's v2 API was opened to the public in February 2026 β€” no registration or token needed.)
 
1
  # microbe-model
2
 
3
+ Predict cultivation conditions (optimal temperature, pH, oxygen requirement, salt tolerance)
4
+ for microbial isolates from genome sequence alone. The long-term aim is to lower the cost of
5
+ culturing "microbial dark matter" β€” the >99% of microbial diversity that has not yet been grown
6
+ in pure culture.
7
 
8
  ## Status
9
 
10
+ v0 β€” tabular baseline against the full BacDive corpus. No deep model yet; the point of the
11
+ v0 is to establish the ceiling on tabular performance before investing in transformers.
12
 
13
  ## Approach
14
 
15
  ```
16
+ BacDive v2 (phenotype labels) ──┐
17
+ β”œβ”€β”€> joined table (strain, genome_accession, phenotypes)
18
+ NCBI Datasets v2 (genomes) β”€β”€β”€β”€β”€β”˜
19
+ β”‚
20
+ β–Ό
21
+ streaming featurize
22
+ (download β†’ pyrodigal β†’ AA-composition β†’ discard FASTA)
23
+ β”‚
24
+ β–Ό
25
+ multi-task XGBoost
26
+ (5-fold GroupKFold by taxonomic family)
27
+ β”‚
28
+ β–Ό
29
+ eval report (`artifacts/eval_report.md`)
 
30
  ```
31
 
32
  The genomeβ†’phenotype features used here have well-established correlations with the target
33
+ properties (proteome amino acid composition correlates with optimal growth temperature, GC
34
+ content with thermophily, etc.), so even a tabular model has a real signal to learn from.
 
35
 
36
  ## Setup
37
 
38
  ```bash
39
  # Requires Python 3.11 and uv (https://docs.astral.sh/uv/)
40
  uv sync --all-extras
41
+
42
+ # (optional) NCBI API key raises rate limits; copy .env.example to .env and fill in.
43
+ cp .env.example .env
44
  ```
45
 
46
  ## Running the pipeline
47
 
48
  ```bash
49
+ # 1. Scan BacDive (~10 min for full corpus). Writes data/bacdive_phenotypes.parquet.
50
+ uv run python scripts/01_fetch_bacdive.py --end 200000
 
 
51
 
52
+ # 2. Streaming fetch + featurize all training-ready strains (~5 hr for ~17K strains
53
+ # with 7 worker processes). Writes data/features.parquet. Resumable.
54
+ uv run python scripts/02_fetch_and_featurize.py --require-target any
55
 
56
+ # 3. Multi-task XGBoost. Writes artifacts/baseline_results.json + predictions.parquet.
57
+ uv run python scripts/03_train_baseline.py
58
 
59
+ # 4. Render the eval report. Writes artifacts/eval_report.md.
60
+ uv run python scripts/04_eval.py
61
 
62
+ # 5. Top-level overnight summary. Writes OVERNIGHT_SUMMARY.md.
63
+ uv run python scripts/05_overnight_summary.py
64
  ```
65
 
66
+ For overnight runs, `scripts/run_train_and_eval.sh` chains 03 β†’ 04 β†’ 05.
67
+
68
+ ## Architecture
69
+
70
+ - **`src/microbe_model/data/bacdive.py`** β€” v2 REST client (public, no auth). Discovers
71
+ strains by batch-scanning the integer ID range; ~150K live records in 2K calls.
72
+ - **`src/microbe_model/pipeline.py`** β€” streaming fetch + featurize. Each worker process
73
+ downloads a genome FASTA, runs pyrodigal, extracts features, and discards the FASTA β€”
74
+ no persistent genome storage. Resumable via the JSONL append log.
75
+ - **`src/microbe_model/features/genome.py`** β€” pyrodigal CDS prediction + amino-acid
76
+ composition features. Uses single-genome+train mode (~7Γ— faster than meta on assembled
77
+ genomes) with a meta fallback for very short input.
78
+ - **`src/microbe_model/train/baseline.py`** β€” multi-task XGBoost with per-fold class
79
+ re-encoding for classification (handles non-contiguous class subsets).
80
+ - **`src/microbe_model/eval.py`** β€” markdown report renderer. TL;DR + corpus + per-target
81
+ metrics + per-family error breakdown + limitations + next steps.
82
+
83
  ## Layout
84
 
85
  ```
86
  src/microbe_model/
87
+ config.py # paths, env vars, prediction targets
88
+ data/bacdive.py # BacDive v2 client
89
+ features/genome.py # pyrodigal + AA-composition feature extraction
90
+ pipeline.py # streaming async fetch + featurize
91
+ train/baseline.py # multi-task XGBoost + GroupKFold
92
+ eval.py # markdown report renderer
93
+ scripts/ # numbered pipeline entry points (01–05)
94
+ tests/ # 15 unit + integration tests
95
+ data/ # (gitignored) parquet tables, JSONL features, BacDive cache
96
+ artifacts/ # eval report, training results, logs
 
97
  ```
98
 
99
  ## What this is *not* yet
100
 
101
  - Not a foundation model. No transformer. No genome language model.
102
  - Not a platform. There is no upload UI or active-learning loop.
103
+ - Not validated against truly out-of-distribution organisms (BacDive is survivorship-biased
104
+ to organisms that have been cultured at least once).
105
+
106
+ These are deliberate v0 boundaries β€” see `OVERNIGHT_SUMMARY.md` after a run for the
107
+ headline result and `artifacts/eval_report.md` for the full eval.
108
+
109
+ ## v1 backlog
110
 
111
+ 1. Tetranucleotide and codon-usage features
112
+ 2. LPSN/GTDB family proper join (for tighter GroupKFold)
113
+ 3. KOMODO media DB integration as a richer label source
114
+ 4. Pyrodigal-GV for atypical genetic codes
115
+ 5. Genome-LM embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2)
116
+ 6. Active learning loop: highlight novel-family strains where the model is uncertain,
117
+ prioritize for wet-lab cultivation testing.
118
 
119
  ## Environment variables
120
 
121
+ `.env` (gitignored) holds:
122
 
123
+ - `NCBI_API_KEY` β€” optional, raises NCBI Datasets rate limit from 3 req/s to 10 req/s.
124
 
125
+ (BacDive's v2 API is public β€” no token needed.)