File size: 9,669 Bytes
2ae21d7
 
5df9ef8
2ae21d7
 
 
5df9ef8
 
 
 
2ae21d7
5df9ef8
2ae21d7
 
 
f0f1d93
2ae21d7
f0f1d93
 
 
56b0c4e
2ae21d7
 
 
f0f1d93
 
56b0c4e
2ae21d7
f0f1d93
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
 
 
 
 
5df9ef8
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
2ae21d7
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
2ae21d7
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
2ae21d7
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
2ae21d7
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
 
 
 
 
f0f1d93
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
f0f1d93
 
 
5df9ef8
 
f0f1d93
 
 
 
2ae21d7
 
 
 
 
56b0c4e
 
 
 
 
5df9ef8
 
56b0c4e
 
 
2ae21d7
 
 
 
 
 
 
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
5df9ef8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ae21d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# microbe-model β€” v0 baseline eval report

_Generated: 2026-05-05T10:42:09+00:00_

## TL;DR

- **`optimal_temperature_c`**: MAE = **2.86** (vs always-predict-mean 4.98, **+43%**)
- **`optimal_ph`**: MAE = **0.48** (vs always-predict-mean 0.55, **+12%**)
- **`oxygen_requirement`**: macro-F1 = **0.357** (vs always-predict-majority 0.059, **+507%**)
- **`salt_tolerance_pct`**: MAE = **2.11** (vs always-predict-mean 2.51, **+16%**)

Trained on **46,029** strains with **423** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.

## Corpus

- Total strains in feature table: **46,029**
- Labeled-strain counts by target:
  - `optimal_temperature_c`: 45,621
  - `optimal_ph`: 5,103
  - `oxygen_requirement`: 21,639
  - `salt_tolerance_pct`: 3,844

## Target distributions

- `optimal_temperature_c`: n=45,621, mean=32.24, std=7.13, p10=27.50, median=30.00, p90=37.00
- `optimal_ph`: n=5,103, mean=7.19, std=0.82, p10=6.50, median=7.00, p90=8.00
- `salt_tolerance_pct`: n=3,844, mean=3.15, std=4.28, p10=0.00, median=2.50, p90=7.00
- `oxygen_requirement`:
  - `aerobe`: 7,803
  - `anaerobe`: 4,193
  - `microaerophile`: 3,804
  - `facultative anaerobe`: 3,389
  - `obligate aerobe`: 2,213
  - `obligate anaerobe`: 136
  - `facultative aerobe`: 87
  - `aerotolerant`: 12
  - `microaerotolerant`: 2

## Per-target results (5-fold GroupKFold by family)

Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).

| Target | Task | n labeled | Model metric | Baseline | Improvement |
|---|---|---|---|---|---|
| `optimal_temperature_c` | regression | 45,621 | MAE=2.857 | MAE=4.981 | +42.6% |
| `optimal_ph` | regression | 5,103 | MAE=0.482 | MAE=0.546 | +11.6% |
| `oxygen_requirement` | classification | 21,639 | F1=0.357 | F1=0.059 | +507.0% |
| `salt_tolerance_pct` | regression | 3,844 | MAE=2.112 | MAE=2.515 | +16.0% |

### `optimal_temperature_c` β€” fold-by-fold

| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 2.953 | n=36,496 | n=9,125 |
| 2 | mae = 2.626 | n=36,497 | n=9,124 |
| 3 | mae = 3.060 | n=36,497 | n=9,124 |
| 4 | mae = 3.265 | n=36,497 | n=9,124 |
| 5 | mae = 2.381 | n=36,497 | n=9,124 |

**Top 10 features for `optimal_temperature_c`:**

- `ivywrel_frac` β€” 0.1235
- `iso_cat2_thermophilic_gt45_c` β€” 0.0288
- `iso_cat2_patient` β€” 0.0251
- `iso_cat2_human` β€” 0.0234
- `n_predicted_cds` β€” 0.0216
- `iso_cat1_infection` β€” 0.0204
- `aa_frac_C` β€” 0.0143
- `genome_size_nt` β€” 0.0123
- `tetra_CTAA` β€” 0.0118
- `aa_frac_D` β€” 0.0109

### `optimal_ph` β€” fold-by-fold

| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 0.440 | n=4,082 | n=1,021 |
| 2 | mae = 0.568 | n=4,082 | n=1,021 |
| 3 | mae = 0.494 | n=4,082 | n=1,021 |
| 4 | mae = 0.466 | n=4,083 | n=1,020 |
| 5 | mae = 0.444 | n=4,083 | n=1,020 |

**Top 10 features for `optimal_ph`:**

- `md_ph_median` β€” 0.0518
- `iso_cat2_acidic` β€” 0.0307
- `iso_cat2_alkaline` β€” 0.0287
- `neg_charged_frac` β€” 0.0146
- `aa_frac_H` β€” 0.0081
- `aa_frac_E` β€” 0.0077
- `tetra_CTCT` β€” 0.0071
- `iso_cat2_plant` β€” 0.0068
- `tetra_AGAC` β€” 0.0067
- `tetra_CACT` β€” 0.0065

### `oxygen_requirement` β€” fold-by-fold

| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | f1_macro = 0.353 | n=17,311 | n=4,328 |
| 2 | f1_macro = 0.375 | n=17,311 | n=4,326 |
| 3 | f1_macro = 0.357 | n=17,311 | n=4,328 |
| 4 | f1_macro = 0.274 | n=17,311 | n=4,328 |
| 5 | f1_macro = 0.429 | n=17,312 | n=4,327 |

**Top 10 features for `oxygen_requirement`:**

- `codon_ATA` β€” 0.0395
- `iso_cat1_host` β€” 0.0269
- `n_predicted_cds` β€” 0.0266
- `aa_frac_C` β€” 0.0195
- `iso_cat1_environmental` β€” 0.0162
- `codon_CGT` β€” 0.0144
- `iso_cat1_engineered` β€” 0.0139
- `iso_cat2_human` β€” 0.0124
- `genome_size_nt` β€” 0.0103
- `codon_TAA` β€” 0.0083

### `salt_tolerance_pct` β€” fold-by-fold

| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 1.926 | n=3,075 | n=769 |
| 2 | mae = 1.893 | n=3,075 | n=769 |
| 3 | mae = 2.746 | n=3,075 | n=769 |
| 4 | mae = 1.870 | n=3,075 | n=769 |
| 5 | mae = 2.128 | n=3,076 | n=768 |

**Top 10 features for `salt_tolerance_pct`:**

- `neg_charged_frac` β€” 0.0702
- `tetra_ATCC` β€” 0.0428
- `aa_frac_C` β€” 0.0298
- `iso_cat2_saline` β€” 0.0286
- `md_nacl_pct_median` β€” 0.0256
- `tetra_ACAT` β€” 0.0255
- `md_nacl_pct_max` β€” 0.0128
- `aa_frac_T` β€” 0.0120
- `codon_CCG` β€” 0.0093
- `tetra_TGAT` β€” 0.0089

## Feature ↔ target correlations (Spearman, top 10)

Sanity-checks the biology β€” features known to track each target should appear here at high |ρ|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature).

### `optimal_temperature_c`

| Feature | Spearman ρ | p-value |
|---|---|---|
| `genome_size_nt` | -0.493 | 0.0e+00 |
| `n_predicted_cds` | -0.482 | 0.0e+00 |
| `aa_frac_P` | -0.391 | 0.0e+00 |
| `aa_frac_Y` | +0.390 | 0.0e+00 |
| `tetra_TCTT` | +0.383 | 0.0e+00 |
| `tetra_TATC` | +0.381 | 0.0e+00 |
| `tetra_GATA` | +0.381 | 0.0e+00 |
| `tetra_AAGA` | +0.381 | 0.0e+00 |
| `tetra_CATA` | +0.380 | 0.0e+00 |
| `tetra_TATG` | +0.379 | 0.0e+00 |

### `optimal_ph`

| Feature | Spearman ρ | p-value |
|---|---|---|
| `md_ph_median` | +0.429 | 4.0e-131 |
| `neg_charged_frac` | +0.304 | 1.6e-109 |
| `mean_isoelectric_point` | -0.278 | 1.8e-91 |
| `aa_frac_E` | +0.256 | 4.5e-77 |
| `md_nacl_pct_max` | +0.218 | 1.9e-33 |
| `md_nacl_pct_median` | +0.212 | 9.9e-32 |
| `iso_cat2_alkaline` | +0.165 | 2.5e-32 |
| `ivywrel_frac` | +0.159 | 2.4e-30 |
| `codon_AAG` | -0.154 | 1.7e-28 |
| `codon_CGA` | +0.153 | 5.8e-28 |

### `salt_tolerance_pct`

| Feature | Spearman ρ | p-value |
|---|---|---|
| `neg_charged_frac` | +0.442 | 1.5e-183 |
| `mean_isoelectric_point` | -0.344 | 1.7e-107 |
| `aa_frac_E` | +0.310 | 3.1e-86 |
| `tetra_GACT` | +0.302 | 4.3e-82 |
| `tetra_AGTC` | +0.302 | 1.0e-81 |
| `md_nacl_pct_max` | +0.298 | 2.9e-52 |
| `md_nacl_pct_median` | +0.290 | 1.6e-49 |
| `tetra_ACTC` | +0.282 | 2.2e-71 |
| `tetra_GAGT` | +0.273 | 1.9e-66 |
| `iso_cat2_saline` | +0.263 | 9.4e-62 |

## Per-family error breakdown (regression targets)

Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.

### `optimal_temperature_c`

| Family | n | MAE |
|---|---|---|
| Enterobacteriaceae | 2662 | 3.792 |
| Streptomycetaceae | 2212 | 1.783 |
| Bacillaceae | 1886 | 3.174 |
| Lactobacillaceae | 1732 | 3.709 |
| Pseudomonadaceae | 1621 | 2.488 |
| Myxococcaceae | 1546 | 0.238 |
| Streptococcaceae | 1170 | 2.537 |
| Staphylococcaceae | 1068 | 3.374 |
| Flavobacteriaceae | 981 | 4.116 |
| Corynebacteriaceae | 900 | 2.146 |
| Moraxellaceae | 890 | 3.388 |
| Paenibacillaceae | 760 | 3.081 |
| Microbacteriaceae | 734 | 2.459 |
| Micrococcaceae | 719 | 2.811 |
| Nocardiaceae | 715 | 2.276 |

### `optimal_ph`

| Family | n | MAE |
|---|---|---|
| Flavobacteriaceae | 355 | 0.405 |
| Bacillaceae | 298 | 0.606 |
| Roseobacteraceae | 204 | 0.375 |
| Paenibacillaceae | 139 | 0.469 |
| Microbacteriaceae | 120 | 0.446 |
| Sphingobacteriaceae | 114 | 0.336 |
| Sphingomonadaceae | 102 | 0.319 |
| Streptomycetaceae | 98 | 0.513 |
| Pseudonocardiaceae | 93 | 0.479 |
| Halomonadaceae | 82 | 0.584 |
| Micrococcaceae | 82 | 0.613 |
| Nocardioidaceae | 80 | 0.502 |
| Paracoccaceae | 76 | 0.574 |
| Alteromonadaceae | 71 | 0.355 |
| Erythrobacteraceae | 68 | 0.446 |

### `salt_tolerance_pct`

| Family | n | MAE |
|---|---|---|
| Flavobacteriaceae | 267 | 1.713 |
| Streptomycetaceae | 264 | 1.987 |
| Bacillaceae | 201 | 3.315 |
| Roseobacteraceae | 127 | 1.395 |
| Pseudonocardiaceae | 123 | 2.280 |
| Paenibacillaceae | 93 | 1.651 |
| Enterococcaceae | 93 | 2.935 |
| Microbacteriaceae | 91 | 2.789 |
| Micromonosporaceae | 90 | 1.609 |
| Sphingomonadaceae | 81 | 1.028 |
| Micrococcaceae | 71 | 2.613 |
| Streptosporangiaceae | 68 | 1.480 |
| Lactobacillaceae | 66 | 2.559 |
| Sphingobacteriaceae | 55 | 1.218 |
| Halomonadaceae | 52 | 2.815 |

## Known limitations

- **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
- **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β€” this can be off by 5Β°C or more for some strains.
- **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
- **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
- **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.

## Next steps

1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily.
2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV.
3. **Integrate KOMODO media DB** as a richer label source than BacDive alone.
4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.