microbe-model / artifacts /eval_report.md
Miyu Horiuchi
Add MediaDive-derived features (medium pH, NaCl, n_media) β€” all 4 targets improve
5df9ef8
# microbe-model β€” v0 baseline eval report
_Generated: 2026-05-05T10:42:09+00:00_
## TL;DR
- **`optimal_temperature_c`**: MAE = **2.86** (vs always-predict-mean 4.98, **+43%**)
- **`optimal_ph`**: MAE = **0.48** (vs always-predict-mean 0.55, **+12%**)
- **`oxygen_requirement`**: macro-F1 = **0.357** (vs always-predict-majority 0.059, **+507%**)
- **`salt_tolerance_pct`**: MAE = **2.11** (vs always-predict-mean 2.51, **+16%**)
Trained on **46,029** strains with **423** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.
## Corpus
- Total strains in feature table: **46,029**
- Labeled-strain counts by target:
- `optimal_temperature_c`: 45,621
- `optimal_ph`: 5,103
- `oxygen_requirement`: 21,639
- `salt_tolerance_pct`: 3,844
## Target distributions
- `optimal_temperature_c`: n=45,621, mean=32.24, std=7.13, p10=27.50, median=30.00, p90=37.00
- `optimal_ph`: n=5,103, mean=7.19, std=0.82, p10=6.50, median=7.00, p90=8.00
- `salt_tolerance_pct`: n=3,844, mean=3.15, std=4.28, p10=0.00, median=2.50, p90=7.00
- `oxygen_requirement`:
- `aerobe`: 7,803
- `anaerobe`: 4,193
- `microaerophile`: 3,804
- `facultative anaerobe`: 3,389
- `obligate aerobe`: 2,213
- `obligate anaerobe`: 136
- `facultative aerobe`: 87
- `aerotolerant`: 12
- `microaerotolerant`: 2
## Per-target results (5-fold GroupKFold by family)
Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).
| Target | Task | n labeled | Model metric | Baseline | Improvement |
|---|---|---|---|---|---|
| `optimal_temperature_c` | regression | 45,621 | MAE=2.857 | MAE=4.981 | +42.6% |
| `optimal_ph` | regression | 5,103 | MAE=0.482 | MAE=0.546 | +11.6% |
| `oxygen_requirement` | classification | 21,639 | F1=0.357 | F1=0.059 | +507.0% |
| `salt_tolerance_pct` | regression | 3,844 | MAE=2.112 | MAE=2.515 | +16.0% |
### `optimal_temperature_c` β€” fold-by-fold
| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 2.953 | n=36,496 | n=9,125 |
| 2 | mae = 2.626 | n=36,497 | n=9,124 |
| 3 | mae = 3.060 | n=36,497 | n=9,124 |
| 4 | mae = 3.265 | n=36,497 | n=9,124 |
| 5 | mae = 2.381 | n=36,497 | n=9,124 |
**Top 10 features for `optimal_temperature_c`:**
- `ivywrel_frac` β€” 0.1235
- `iso_cat2_thermophilic_gt45_c` β€” 0.0288
- `iso_cat2_patient` β€” 0.0251
- `iso_cat2_human` β€” 0.0234
- `n_predicted_cds` β€” 0.0216
- `iso_cat1_infection` β€” 0.0204
- `aa_frac_C` β€” 0.0143
- `genome_size_nt` β€” 0.0123
- `tetra_CTAA` β€” 0.0118
- `aa_frac_D` β€” 0.0109
### `optimal_ph` β€” fold-by-fold
| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 0.440 | n=4,082 | n=1,021 |
| 2 | mae = 0.568 | n=4,082 | n=1,021 |
| 3 | mae = 0.494 | n=4,082 | n=1,021 |
| 4 | mae = 0.466 | n=4,083 | n=1,020 |
| 5 | mae = 0.444 | n=4,083 | n=1,020 |
**Top 10 features for `optimal_ph`:**
- `md_ph_median` β€” 0.0518
- `iso_cat2_acidic` β€” 0.0307
- `iso_cat2_alkaline` β€” 0.0287
- `neg_charged_frac` β€” 0.0146
- `aa_frac_H` β€” 0.0081
- `aa_frac_E` β€” 0.0077
- `tetra_CTCT` β€” 0.0071
- `iso_cat2_plant` β€” 0.0068
- `tetra_AGAC` β€” 0.0067
- `tetra_CACT` β€” 0.0065
### `oxygen_requirement` β€” fold-by-fold
| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | f1_macro = 0.353 | n=17,311 | n=4,328 |
| 2 | f1_macro = 0.375 | n=17,311 | n=4,326 |
| 3 | f1_macro = 0.357 | n=17,311 | n=4,328 |
| 4 | f1_macro = 0.274 | n=17,311 | n=4,328 |
| 5 | f1_macro = 0.429 | n=17,312 | n=4,327 |
**Top 10 features for `oxygen_requirement`:**
- `codon_ATA` β€” 0.0395
- `iso_cat1_host` β€” 0.0269
- `n_predicted_cds` β€” 0.0266
- `aa_frac_C` β€” 0.0195
- `iso_cat1_environmental` β€” 0.0162
- `codon_CGT` β€” 0.0144
- `iso_cat1_engineered` β€” 0.0139
- `iso_cat2_human` β€” 0.0124
- `genome_size_nt` β€” 0.0103
- `codon_TAA` β€” 0.0083
### `salt_tolerance_pct` β€” fold-by-fold
| Fold | Metric | Train | Test |
|---|---|---|---|
| 1 | mae = 1.926 | n=3,075 | n=769 |
| 2 | mae = 1.893 | n=3,075 | n=769 |
| 3 | mae = 2.746 | n=3,075 | n=769 |
| 4 | mae = 1.870 | n=3,075 | n=769 |
| 5 | mae = 2.128 | n=3,076 | n=768 |
**Top 10 features for `salt_tolerance_pct`:**
- `neg_charged_frac` β€” 0.0702
- `tetra_ATCC` β€” 0.0428
- `aa_frac_C` β€” 0.0298
- `iso_cat2_saline` β€” 0.0286
- `md_nacl_pct_median` β€” 0.0256
- `tetra_ACAT` β€” 0.0255
- `md_nacl_pct_max` β€” 0.0128
- `aa_frac_T` β€” 0.0120
- `codon_CCG` β€” 0.0093
- `tetra_TGAT` β€” 0.0089
## Feature ↔ target correlations (Spearman, top 10)
Sanity-checks the biology β€” features known to track each target should appear here at high |ρ|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature).
### `optimal_temperature_c`
| Feature | Spearman ρ | p-value |
|---|---|---|
| `genome_size_nt` | -0.493 | 0.0e+00 |
| `n_predicted_cds` | -0.482 | 0.0e+00 |
| `aa_frac_P` | -0.391 | 0.0e+00 |
| `aa_frac_Y` | +0.390 | 0.0e+00 |
| `tetra_TCTT` | +0.383 | 0.0e+00 |
| `tetra_TATC` | +0.381 | 0.0e+00 |
| `tetra_GATA` | +0.381 | 0.0e+00 |
| `tetra_AAGA` | +0.381 | 0.0e+00 |
| `tetra_CATA` | +0.380 | 0.0e+00 |
| `tetra_TATG` | +0.379 | 0.0e+00 |
### `optimal_ph`
| Feature | Spearman ρ | p-value |
|---|---|---|
| `md_ph_median` | +0.429 | 4.0e-131 |
| `neg_charged_frac` | +0.304 | 1.6e-109 |
| `mean_isoelectric_point` | -0.278 | 1.8e-91 |
| `aa_frac_E` | +0.256 | 4.5e-77 |
| `md_nacl_pct_max` | +0.218 | 1.9e-33 |
| `md_nacl_pct_median` | +0.212 | 9.9e-32 |
| `iso_cat2_alkaline` | +0.165 | 2.5e-32 |
| `ivywrel_frac` | +0.159 | 2.4e-30 |
| `codon_AAG` | -0.154 | 1.7e-28 |
| `codon_CGA` | +0.153 | 5.8e-28 |
### `salt_tolerance_pct`
| Feature | Spearman ρ | p-value |
|---|---|---|
| `neg_charged_frac` | +0.442 | 1.5e-183 |
| `mean_isoelectric_point` | -0.344 | 1.7e-107 |
| `aa_frac_E` | +0.310 | 3.1e-86 |
| `tetra_GACT` | +0.302 | 4.3e-82 |
| `tetra_AGTC` | +0.302 | 1.0e-81 |
| `md_nacl_pct_max` | +0.298 | 2.9e-52 |
| `md_nacl_pct_median` | +0.290 | 1.6e-49 |
| `tetra_ACTC` | +0.282 | 2.2e-71 |
| `tetra_GAGT` | +0.273 | 1.9e-66 |
| `iso_cat2_saline` | +0.263 | 9.4e-62 |
## Per-family error breakdown (regression targets)
Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.
### `optimal_temperature_c`
| Family | n | MAE |
|---|---|---|
| Enterobacteriaceae | 2662 | 3.792 |
| Streptomycetaceae | 2212 | 1.783 |
| Bacillaceae | 1886 | 3.174 |
| Lactobacillaceae | 1732 | 3.709 |
| Pseudomonadaceae | 1621 | 2.488 |
| Myxococcaceae | 1546 | 0.238 |
| Streptococcaceae | 1170 | 2.537 |
| Staphylococcaceae | 1068 | 3.374 |
| Flavobacteriaceae | 981 | 4.116 |
| Corynebacteriaceae | 900 | 2.146 |
| Moraxellaceae | 890 | 3.388 |
| Paenibacillaceae | 760 | 3.081 |
| Microbacteriaceae | 734 | 2.459 |
| Micrococcaceae | 719 | 2.811 |
| Nocardiaceae | 715 | 2.276 |
### `optimal_ph`
| Family | n | MAE |
|---|---|---|
| Flavobacteriaceae | 355 | 0.405 |
| Bacillaceae | 298 | 0.606 |
| Roseobacteraceae | 204 | 0.375 |
| Paenibacillaceae | 139 | 0.469 |
| Microbacteriaceae | 120 | 0.446 |
| Sphingobacteriaceae | 114 | 0.336 |
| Sphingomonadaceae | 102 | 0.319 |
| Streptomycetaceae | 98 | 0.513 |
| Pseudonocardiaceae | 93 | 0.479 |
| Halomonadaceae | 82 | 0.584 |
| Micrococcaceae | 82 | 0.613 |
| Nocardioidaceae | 80 | 0.502 |
| Paracoccaceae | 76 | 0.574 |
| Alteromonadaceae | 71 | 0.355 |
| Erythrobacteraceae | 68 | 0.446 |
### `salt_tolerance_pct`
| Family | n | MAE |
|---|---|---|
| Flavobacteriaceae | 267 | 1.713 |
| Streptomycetaceae | 264 | 1.987 |
| Bacillaceae | 201 | 3.315 |
| Roseobacteraceae | 127 | 1.395 |
| Pseudonocardiaceae | 123 | 2.280 |
| Paenibacillaceae | 93 | 1.651 |
| Enterococcaceae | 93 | 2.935 |
| Microbacteriaceae | 91 | 2.789 |
| Micromonosporaceae | 90 | 1.609 |
| Sphingomonadaceae | 81 | 1.028 |
| Micrococcaceae | 71 | 2.613 |
| Streptosporangiaceae | 68 | 1.480 |
| Lactobacillaceae | 66 | 2.559 |
| Sphingobacteriaceae | 55 | 1.218 |
| Halomonadaceae | 52 | 2.815 |
## Known limitations
- **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
- **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β€” this can be off by 5Β°C or more for some strains.
- **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
- **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
- **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.
## Next steps
1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily.
2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV.
3. **Integrate KOMODO media DB** as a richer label source than BacDive alone.
4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.