microbe-model β v0 baseline eval report
Generated: 2026-05-05T10:42:09+00:00
TL;DR
optimal_temperature_c: MAE = 2.86 (vs always-predict-mean 4.98, +43%)
optimal_ph: MAE = 0.48 (vs always-predict-mean 0.55, +12%)
oxygen_requirement: macro-F1 = 0.357 (vs always-predict-majority 0.059, +507%)
salt_tolerance_pct: MAE = 2.11 (vs always-predict-mean 2.51, +16%)
Trained on 46,029 strains with 423 genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.
Corpus
- Total strains in feature table: 46,029
- Labeled-strain counts by target:
optimal_temperature_c: 45,621
optimal_ph: 5,103
oxygen_requirement: 21,639
salt_tolerance_pct: 3,844
Target distributions
optimal_temperature_c: n=45,621, mean=32.24, std=7.13, p10=27.50, median=30.00, p90=37.00
optimal_ph: n=5,103, mean=7.19, std=0.82, p10=6.50, median=7.00, p90=8.00
salt_tolerance_pct: n=3,844, mean=3.15, std=4.28, p10=0.00, median=2.50, p90=7.00
oxygen_requirement:
aerobe: 7,803
anaerobe: 4,193
microaerophile: 3,804
facultative anaerobe: 3,389
obligate aerobe: 2,213
obligate anaerobe: 136
facultative aerobe: 87
aerotolerant: 12
microaerotolerant: 2
Per-target results (5-fold GroupKFold by family)
Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).
| Target |
Task |
n labeled |
Model metric |
Baseline |
Improvement |
optimal_temperature_c |
regression |
45,621 |
MAE=2.857 |
MAE=4.981 |
+42.6% |
optimal_ph |
regression |
5,103 |
MAE=0.482 |
MAE=0.546 |
+11.6% |
oxygen_requirement |
classification |
21,639 |
F1=0.357 |
F1=0.059 |
+507.0% |
salt_tolerance_pct |
regression |
3,844 |
MAE=2.112 |
MAE=2.515 |
+16.0% |
optimal_temperature_c β fold-by-fold
| Fold |
Metric |
Train |
Test |
| 1 |
mae = 2.953 |
n=36,496 |
n=9,125 |
| 2 |
mae = 2.626 |
n=36,497 |
n=9,124 |
| 3 |
mae = 3.060 |
n=36,497 |
n=9,124 |
| 4 |
mae = 3.265 |
n=36,497 |
n=9,124 |
| 5 |
mae = 2.381 |
n=36,497 |
n=9,124 |
Top 10 features for optimal_temperature_c:
ivywrel_frac β 0.1235
iso_cat2_thermophilic_gt45_c β 0.0288
iso_cat2_patient β 0.0251
iso_cat2_human β 0.0234
n_predicted_cds β 0.0216
iso_cat1_infection β 0.0204
aa_frac_C β 0.0143
genome_size_nt β 0.0123
tetra_CTAA β 0.0118
aa_frac_D β 0.0109
optimal_ph β fold-by-fold
| Fold |
Metric |
Train |
Test |
| 1 |
mae = 0.440 |
n=4,082 |
n=1,021 |
| 2 |
mae = 0.568 |
n=4,082 |
n=1,021 |
| 3 |
mae = 0.494 |
n=4,082 |
n=1,021 |
| 4 |
mae = 0.466 |
n=4,083 |
n=1,020 |
| 5 |
mae = 0.444 |
n=4,083 |
n=1,020 |
Top 10 features for optimal_ph:
md_ph_median β 0.0518
iso_cat2_acidic β 0.0307
iso_cat2_alkaline β 0.0287
neg_charged_frac β 0.0146
aa_frac_H β 0.0081
aa_frac_E β 0.0077
tetra_CTCT β 0.0071
iso_cat2_plant β 0.0068
tetra_AGAC β 0.0067
tetra_CACT β 0.0065
oxygen_requirement β fold-by-fold
| Fold |
Metric |
Train |
Test |
| 1 |
f1_macro = 0.353 |
n=17,311 |
n=4,328 |
| 2 |
f1_macro = 0.375 |
n=17,311 |
n=4,326 |
| 3 |
f1_macro = 0.357 |
n=17,311 |
n=4,328 |
| 4 |
f1_macro = 0.274 |
n=17,311 |
n=4,328 |
| 5 |
f1_macro = 0.429 |
n=17,312 |
n=4,327 |
Top 10 features for oxygen_requirement:
codon_ATA β 0.0395
iso_cat1_host β 0.0269
n_predicted_cds β 0.0266
aa_frac_C β 0.0195
iso_cat1_environmental β 0.0162
codon_CGT β 0.0144
iso_cat1_engineered β 0.0139
iso_cat2_human β 0.0124
genome_size_nt β 0.0103
codon_TAA β 0.0083
salt_tolerance_pct β fold-by-fold
| Fold |
Metric |
Train |
Test |
| 1 |
mae = 1.926 |
n=3,075 |
n=769 |
| 2 |
mae = 1.893 |
n=3,075 |
n=769 |
| 3 |
mae = 2.746 |
n=3,075 |
n=769 |
| 4 |
mae = 1.870 |
n=3,075 |
n=769 |
| 5 |
mae = 2.128 |
n=3,076 |
n=768 |
Top 10 features for salt_tolerance_pct:
neg_charged_frac β 0.0702
tetra_ATCC β 0.0428
aa_frac_C β 0.0298
iso_cat2_saline β 0.0286
md_nacl_pct_median β 0.0256
tetra_ACAT β 0.0255
md_nacl_pct_max β 0.0128
aa_frac_T β 0.0120
codon_CCG β 0.0093
tetra_TGAT β 0.0089
Feature β target correlations (Spearman, top 10)
Sanity-checks the biology β features known to track each target should appear here at high |Ο|. E.g. ivywrel_frac should correlate with optimal_temperature_c (Zeldovich 2007 thermophile signature).
optimal_temperature_c
| Feature |
Spearman Ο |
p-value |
genome_size_nt |
-0.493 |
0.0e+00 |
n_predicted_cds |
-0.482 |
0.0e+00 |
aa_frac_P |
-0.391 |
0.0e+00 |
aa_frac_Y |
+0.390 |
0.0e+00 |
tetra_TCTT |
+0.383 |
0.0e+00 |
tetra_TATC |
+0.381 |
0.0e+00 |
tetra_GATA |
+0.381 |
0.0e+00 |
tetra_AAGA |
+0.381 |
0.0e+00 |
tetra_CATA |
+0.380 |
0.0e+00 |
tetra_TATG |
+0.379 |
0.0e+00 |
optimal_ph
| Feature |
Spearman Ο |
p-value |
md_ph_median |
+0.429 |
4.0e-131 |
neg_charged_frac |
+0.304 |
1.6e-109 |
mean_isoelectric_point |
-0.278 |
1.8e-91 |
aa_frac_E |
+0.256 |
4.5e-77 |
md_nacl_pct_max |
+0.218 |
1.9e-33 |
md_nacl_pct_median |
+0.212 |
9.9e-32 |
iso_cat2_alkaline |
+0.165 |
2.5e-32 |
ivywrel_frac |
+0.159 |
2.4e-30 |
codon_AAG |
-0.154 |
1.7e-28 |
codon_CGA |
+0.153 |
5.8e-28 |
salt_tolerance_pct
| Feature |
Spearman Ο |
p-value |
neg_charged_frac |
+0.442 |
1.5e-183 |
mean_isoelectric_point |
-0.344 |
1.7e-107 |
aa_frac_E |
+0.310 |
3.1e-86 |
tetra_GACT |
+0.302 |
4.3e-82 |
tetra_AGTC |
+0.302 |
1.0e-81 |
md_nacl_pct_max |
+0.298 |
2.9e-52 |
md_nacl_pct_median |
+0.290 |
1.6e-49 |
tetra_ACTC |
+0.282 |
2.2e-71 |
tetra_GAGT |
+0.273 |
1.9e-66 |
iso_cat2_saline |
+0.263 |
9.4e-62 |
Per-family error breakdown (regression targets)
Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.
optimal_temperature_c
| Family |
n |
MAE |
| Enterobacteriaceae |
2662 |
3.792 |
| Streptomycetaceae |
2212 |
1.783 |
| Bacillaceae |
1886 |
3.174 |
| Lactobacillaceae |
1732 |
3.709 |
| Pseudomonadaceae |
1621 |
2.488 |
| Myxococcaceae |
1546 |
0.238 |
| Streptococcaceae |
1170 |
2.537 |
| Staphylococcaceae |
1068 |
3.374 |
| Flavobacteriaceae |
981 |
4.116 |
| Corynebacteriaceae |
900 |
2.146 |
| Moraxellaceae |
890 |
3.388 |
| Paenibacillaceae |
760 |
3.081 |
| Microbacteriaceae |
734 |
2.459 |
| Micrococcaceae |
719 |
2.811 |
| Nocardiaceae |
715 |
2.276 |
optimal_ph
| Family |
n |
MAE |
| Flavobacteriaceae |
355 |
0.405 |
| Bacillaceae |
298 |
0.606 |
| Roseobacteraceae |
204 |
0.375 |
| Paenibacillaceae |
139 |
0.469 |
| Microbacteriaceae |
120 |
0.446 |
| Sphingobacteriaceae |
114 |
0.336 |
| Sphingomonadaceae |
102 |
0.319 |
| Streptomycetaceae |
98 |
0.513 |
| Pseudonocardiaceae |
93 |
0.479 |
| Halomonadaceae |
82 |
0.584 |
| Micrococcaceae |
82 |
0.613 |
| Nocardioidaceae |
80 |
0.502 |
| Paracoccaceae |
76 |
0.574 |
| Alteromonadaceae |
71 |
0.355 |
| Erythrobacteraceae |
68 |
0.446 |
salt_tolerance_pct
| Family |
n |
MAE |
| Flavobacteriaceae |
267 |
1.713 |
| Streptomycetaceae |
264 |
1.987 |
| Bacillaceae |
201 |
3.315 |
| Roseobacteraceae |
127 |
1.395 |
| Pseudonocardiaceae |
123 |
2.280 |
| Paenibacillaceae |
93 |
1.651 |
| Enterococcaceae |
93 |
2.935 |
| Microbacteriaceae |
91 |
2.789 |
| Micromonosporaceae |
90 |
1.609 |
| Sphingomonadaceae |
81 |
1.028 |
| Micrococcaceae |
71 |
2.613 |
| Streptosporangiaceae |
68 |
1.480 |
| Lactobacillaceae |
66 |
2.559 |
| Sphingobacteriaceae |
55 |
1.218 |
| Halomonadaceae |
52 |
2.815 |
Known limitations
- Survivorship bias. BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
- Optimum derivation is heuristic. Most BacDive temperature entries are tagged as
growth (positive growth at this temperature), not optimum. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β this can be off by 5Β°C or more for some strains.
- Family grouping is naive. The current
family column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
- Feature set is shallow. No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
- Pyrodigal accuracy. Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.
Next steps
- Add tetranucleotide / codon-usage features. ~50 extra columns, well-known signal for thermophily.
- Replace naive family lookup with LPSN/GTDB join. Reduces leakage in CV.
- Integrate KOMODO media DB as a richer label source than BacDive alone.
- Move to genome embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
- Active learning loop: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.