Spaces:
Running
Running
Miyu Horiuchi
Add MediaDive-derived features (medium pH, NaCl, n_media) β all 4 targets improve
5df9ef8 | # microbe-model β v0 baseline eval report | |
| _Generated: 2026-05-05T10:42:09+00:00_ | |
| ## TL;DR | |
| - **`optimal_temperature_c`**: MAE = **2.86** (vs always-predict-mean 4.98, **+43%**) | |
| - **`optimal_ph`**: MAE = **0.48** (vs always-predict-mean 0.55, **+12%**) | |
| - **`oxygen_requirement`**: macro-F1 = **0.357** (vs always-predict-majority 0.059, **+507%**) | |
| - **`salt_tolerance_pct`**: MAE = **2.11** (vs always-predict-mean 2.51, **+16%**) | |
| Trained on **46,029** strains with **423** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family. | |
| ## Corpus | |
| - Total strains in feature table: **46,029** | |
| - Labeled-strain counts by target: | |
| - `optimal_temperature_c`: 45,621 | |
| - `optimal_ph`: 5,103 | |
| - `oxygen_requirement`: 21,639 | |
| - `salt_tolerance_pct`: 3,844 | |
| ## Target distributions | |
| - `optimal_temperature_c`: n=45,621, mean=32.24, std=7.13, p10=27.50, median=30.00, p90=37.00 | |
| - `optimal_ph`: n=5,103, mean=7.19, std=0.82, p10=6.50, median=7.00, p90=8.00 | |
| - `salt_tolerance_pct`: n=3,844, mean=3.15, std=4.28, p10=0.00, median=2.50, p90=7.00 | |
| - `oxygen_requirement`: | |
| - `aerobe`: 7,803 | |
| - `anaerobe`: 4,193 | |
| - `microaerophile`: 3,804 | |
| - `facultative anaerobe`: 3,389 | |
| - `obligate aerobe`: 2,213 | |
| - `obligate anaerobe`: 136 | |
| - `facultative aerobe`: 87 | |
| - `aerotolerant`: 12 | |
| - `microaerotolerant`: 2 | |
| ## Per-target results (5-fold GroupKFold by family) | |
| Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better). | |
| Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority). | |
| | Target | Task | n labeled | Model metric | Baseline | Improvement | | |
| |---|---|---|---|---|---| | |
| | `optimal_temperature_c` | regression | 45,621 | MAE=2.857 | MAE=4.981 | +42.6% | | |
| | `optimal_ph` | regression | 5,103 | MAE=0.482 | MAE=0.546 | +11.6% | | |
| | `oxygen_requirement` | classification | 21,639 | F1=0.357 | F1=0.059 | +507.0% | | |
| | `salt_tolerance_pct` | regression | 3,844 | MAE=2.112 | MAE=2.515 | +16.0% | | |
| ### `optimal_temperature_c` β fold-by-fold | |
| | Fold | Metric | Train | Test | | |
| |---|---|---|---| | |
| | 1 | mae = 2.953 | n=36,496 | n=9,125 | | |
| | 2 | mae = 2.626 | n=36,497 | n=9,124 | | |
| | 3 | mae = 3.060 | n=36,497 | n=9,124 | | |
| | 4 | mae = 3.265 | n=36,497 | n=9,124 | | |
| | 5 | mae = 2.381 | n=36,497 | n=9,124 | | |
| **Top 10 features for `optimal_temperature_c`:** | |
| - `ivywrel_frac` β 0.1235 | |
| - `iso_cat2_thermophilic_gt45_c` β 0.0288 | |
| - `iso_cat2_patient` β 0.0251 | |
| - `iso_cat2_human` β 0.0234 | |
| - `n_predicted_cds` β 0.0216 | |
| - `iso_cat1_infection` β 0.0204 | |
| - `aa_frac_C` β 0.0143 | |
| - `genome_size_nt` β 0.0123 | |
| - `tetra_CTAA` β 0.0118 | |
| - `aa_frac_D` β 0.0109 | |
| ### `optimal_ph` β fold-by-fold | |
| | Fold | Metric | Train | Test | | |
| |---|---|---|---| | |
| | 1 | mae = 0.440 | n=4,082 | n=1,021 | | |
| | 2 | mae = 0.568 | n=4,082 | n=1,021 | | |
| | 3 | mae = 0.494 | n=4,082 | n=1,021 | | |
| | 4 | mae = 0.466 | n=4,083 | n=1,020 | | |
| | 5 | mae = 0.444 | n=4,083 | n=1,020 | | |
| **Top 10 features for `optimal_ph`:** | |
| - `md_ph_median` β 0.0518 | |
| - `iso_cat2_acidic` β 0.0307 | |
| - `iso_cat2_alkaline` β 0.0287 | |
| - `neg_charged_frac` β 0.0146 | |
| - `aa_frac_H` β 0.0081 | |
| - `aa_frac_E` β 0.0077 | |
| - `tetra_CTCT` β 0.0071 | |
| - `iso_cat2_plant` β 0.0068 | |
| - `tetra_AGAC` β 0.0067 | |
| - `tetra_CACT` β 0.0065 | |
| ### `oxygen_requirement` β fold-by-fold | |
| | Fold | Metric | Train | Test | | |
| |---|---|---|---| | |
| | 1 | f1_macro = 0.353 | n=17,311 | n=4,328 | | |
| | 2 | f1_macro = 0.375 | n=17,311 | n=4,326 | | |
| | 3 | f1_macro = 0.357 | n=17,311 | n=4,328 | | |
| | 4 | f1_macro = 0.274 | n=17,311 | n=4,328 | | |
| | 5 | f1_macro = 0.429 | n=17,312 | n=4,327 | | |
| **Top 10 features for `oxygen_requirement`:** | |
| - `codon_ATA` β 0.0395 | |
| - `iso_cat1_host` β 0.0269 | |
| - `n_predicted_cds` β 0.0266 | |
| - `aa_frac_C` β 0.0195 | |
| - `iso_cat1_environmental` β 0.0162 | |
| - `codon_CGT` β 0.0144 | |
| - `iso_cat1_engineered` β 0.0139 | |
| - `iso_cat2_human` β 0.0124 | |
| - `genome_size_nt` β 0.0103 | |
| - `codon_TAA` β 0.0083 | |
| ### `salt_tolerance_pct` β fold-by-fold | |
| | Fold | Metric | Train | Test | | |
| |---|---|---|---| | |
| | 1 | mae = 1.926 | n=3,075 | n=769 | | |
| | 2 | mae = 1.893 | n=3,075 | n=769 | | |
| | 3 | mae = 2.746 | n=3,075 | n=769 | | |
| | 4 | mae = 1.870 | n=3,075 | n=769 | | |
| | 5 | mae = 2.128 | n=3,076 | n=768 | | |
| **Top 10 features for `salt_tolerance_pct`:** | |
| - `neg_charged_frac` β 0.0702 | |
| - `tetra_ATCC` β 0.0428 | |
| - `aa_frac_C` β 0.0298 | |
| - `iso_cat2_saline` β 0.0286 | |
| - `md_nacl_pct_median` β 0.0256 | |
| - `tetra_ACAT` β 0.0255 | |
| - `md_nacl_pct_max` β 0.0128 | |
| - `aa_frac_T` β 0.0120 | |
| - `codon_CCG` β 0.0093 | |
| - `tetra_TGAT` β 0.0089 | |
| ## Feature β target correlations (Spearman, top 10) | |
| Sanity-checks the biology β features known to track each target should appear here at high |Ο|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature). | |
| ### `optimal_temperature_c` | |
| | Feature | Spearman Ο | p-value | | |
| |---|---|---| | |
| | `genome_size_nt` | -0.493 | 0.0e+00 | | |
| | `n_predicted_cds` | -0.482 | 0.0e+00 | | |
| | `aa_frac_P` | -0.391 | 0.0e+00 | | |
| | `aa_frac_Y` | +0.390 | 0.0e+00 | | |
| | `tetra_TCTT` | +0.383 | 0.0e+00 | | |
| | `tetra_TATC` | +0.381 | 0.0e+00 | | |
| | `tetra_GATA` | +0.381 | 0.0e+00 | | |
| | `tetra_AAGA` | +0.381 | 0.0e+00 | | |
| | `tetra_CATA` | +0.380 | 0.0e+00 | | |
| | `tetra_TATG` | +0.379 | 0.0e+00 | | |
| ### `optimal_ph` | |
| | Feature | Spearman Ο | p-value | | |
| |---|---|---| | |
| | `md_ph_median` | +0.429 | 4.0e-131 | | |
| | `neg_charged_frac` | +0.304 | 1.6e-109 | | |
| | `mean_isoelectric_point` | -0.278 | 1.8e-91 | | |
| | `aa_frac_E` | +0.256 | 4.5e-77 | | |
| | `md_nacl_pct_max` | +0.218 | 1.9e-33 | | |
| | `md_nacl_pct_median` | +0.212 | 9.9e-32 | | |
| | `iso_cat2_alkaline` | +0.165 | 2.5e-32 | | |
| | `ivywrel_frac` | +0.159 | 2.4e-30 | | |
| | `codon_AAG` | -0.154 | 1.7e-28 | | |
| | `codon_CGA` | +0.153 | 5.8e-28 | | |
| ### `salt_tolerance_pct` | |
| | Feature | Spearman Ο | p-value | | |
| |---|---|---| | |
| | `neg_charged_frac` | +0.442 | 1.5e-183 | | |
| | `mean_isoelectric_point` | -0.344 | 1.7e-107 | | |
| | `aa_frac_E` | +0.310 | 3.1e-86 | | |
| | `tetra_GACT` | +0.302 | 4.3e-82 | | |
| | `tetra_AGTC` | +0.302 | 1.0e-81 | | |
| | `md_nacl_pct_max` | +0.298 | 2.9e-52 | | |
| | `md_nacl_pct_median` | +0.290 | 1.6e-49 | | |
| | `tetra_ACTC` | +0.282 | 2.2e-71 | | |
| | `tetra_GAGT` | +0.273 | 1.9e-66 | | |
| | `iso_cat2_saline` | +0.263 | 9.4e-62 | | |
| ## Per-family error breakdown (regression targets) | |
| Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling. | |
| ### `optimal_temperature_c` | |
| | Family | n | MAE | | |
| |---|---|---| | |
| | Enterobacteriaceae | 2662 | 3.792 | | |
| | Streptomycetaceae | 2212 | 1.783 | | |
| | Bacillaceae | 1886 | 3.174 | | |
| | Lactobacillaceae | 1732 | 3.709 | | |
| | Pseudomonadaceae | 1621 | 2.488 | | |
| | Myxococcaceae | 1546 | 0.238 | | |
| | Streptococcaceae | 1170 | 2.537 | | |
| | Staphylococcaceae | 1068 | 3.374 | | |
| | Flavobacteriaceae | 981 | 4.116 | | |
| | Corynebacteriaceae | 900 | 2.146 | | |
| | Moraxellaceae | 890 | 3.388 | | |
| | Paenibacillaceae | 760 | 3.081 | | |
| | Microbacteriaceae | 734 | 2.459 | | |
| | Micrococcaceae | 719 | 2.811 | | |
| | Nocardiaceae | 715 | 2.276 | | |
| ### `optimal_ph` | |
| | Family | n | MAE | | |
| |---|---|---| | |
| | Flavobacteriaceae | 355 | 0.405 | | |
| | Bacillaceae | 298 | 0.606 | | |
| | Roseobacteraceae | 204 | 0.375 | | |
| | Paenibacillaceae | 139 | 0.469 | | |
| | Microbacteriaceae | 120 | 0.446 | | |
| | Sphingobacteriaceae | 114 | 0.336 | | |
| | Sphingomonadaceae | 102 | 0.319 | | |
| | Streptomycetaceae | 98 | 0.513 | | |
| | Pseudonocardiaceae | 93 | 0.479 | | |
| | Halomonadaceae | 82 | 0.584 | | |
| | Micrococcaceae | 82 | 0.613 | | |
| | Nocardioidaceae | 80 | 0.502 | | |
| | Paracoccaceae | 76 | 0.574 | | |
| | Alteromonadaceae | 71 | 0.355 | | |
| | Erythrobacteraceae | 68 | 0.446 | | |
| ### `salt_tolerance_pct` | |
| | Family | n | MAE | | |
| |---|---|---| | |
| | Flavobacteriaceae | 267 | 1.713 | | |
| | Streptomycetaceae | 264 | 1.987 | | |
| | Bacillaceae | 201 | 3.315 | | |
| | Roseobacteraceae | 127 | 1.395 | | |
| | Pseudonocardiaceae | 123 | 2.280 | | |
| | Paenibacillaceae | 93 | 1.651 | | |
| | Enterococcaceae | 93 | 2.935 | | |
| | Microbacteriaceae | 91 | 2.789 | | |
| | Micromonosporaceae | 90 | 1.609 | | |
| | Sphingomonadaceae | 81 | 1.028 | | |
| | Micrococcaceae | 71 | 2.613 | | |
| | Streptosporangiaceae | 68 | 1.480 | | |
| | Lactobacillaceae | 66 | 2.559 | | |
| | Sphingobacteriaceae | 55 | 1.218 | | |
| | Halomonadaceae | 52 | 2.815 | | |
| ## Known limitations | |
| - **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation. | |
| - **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β this can be off by 5Β°C or more for some strains. | |
| - **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping. | |
| - **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs. | |
| - **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set. | |
| ## Next steps | |
| 1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily. | |
| 2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV. | |
| 3. **Integrate KOMODO media DB** as a richer label source than BacDive alone. | |
| 4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established. | |
| 5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing. | |