Spaces:

miyuiu
/

microbe-model

Running

App Files Files Community

microbe-model / artifacts /eval_report.md

Miyu Horiuchi

Add MediaDive-derived features (medium pH, NaCl, n_media) — all 4 targets improve

5df9ef8 27 days ago

preview code

raw

history blame contribute delete

9.67 kB

microbe-model — v0 baseline eval report

Generated: 2026-05-05T10:42:09+00:00

TL;DR

optimal_temperature_c: MAE = 2.86 (vs always-predict-mean 4.98, +43%)
optimal_ph: MAE = 0.48 (vs always-predict-mean 0.55, +12%)
oxygen_requirement: macro-F1 = 0.357 (vs always-predict-majority 0.059, +507%)
salt_tolerance_pct: MAE = 2.11 (vs always-predict-mean 2.51, +16%)

Trained on 46,029 strains with 423 genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.

Corpus

Total strains in feature table: 46,029
Labeled-strain counts by target:
- optimal_temperature_c: 45,621
- optimal_ph: 5,103
- oxygen_requirement: 21,639
- salt_tolerance_pct: 3,844

Target distributions

optimal_temperature_c: n=45,621, mean=32.24, std=7.13, p10=27.50, median=30.00, p90=37.00
optimal_ph: n=5,103, mean=7.19, std=0.82, p10=6.50, median=7.00, p90=8.00
salt_tolerance_pct: n=3,844, mean=3.15, std=4.28, p10=0.00, median=2.50, p90=7.00
oxygen_requirement:
- aerobe: 7,803
- anaerobe: 4,193
- microaerophile: 3,804
- facultative anaerobe: 3,389
- obligate aerobe: 2,213
- obligate anaerobe: 136
- facultative aerobe: 87
- aerotolerant: 12
- microaerotolerant: 2

Per-target results (5-fold GroupKFold by family)

Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better). Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).

Target	Task	n labeled	Model metric	Baseline	Improvement
`optimal_temperature_c`	regression	45,621	MAE=2.857	MAE=4.981	+42.6%
`optimal_ph`	regression	5,103	MAE=0.482	MAE=0.546	+11.6%
`oxygen_requirement`	classification	21,639	F1=0.357	F1=0.059	+507.0%
`salt_tolerance_pct`	regression	3,844	MAE=2.112	MAE=2.515	+16.0%

`optimal_temperature_c` — fold-by-fold

Fold	Metric	Train	Test
1	mae = 2.953	n=36,496	n=9,125
2	mae = 2.626	n=36,497	n=9,124
3	mae = 3.060	n=36,497	n=9,124
4	mae = 3.265	n=36,497	n=9,124
5	mae = 2.381	n=36,497	n=9,124

Top 10 features for optimal_temperature_c:

ivywrel_frac — 0.1235
iso_cat2_thermophilic_gt45_c — 0.0288
iso_cat2_patient — 0.0251
iso_cat2_human — 0.0234
n_predicted_cds — 0.0216
iso_cat1_infection — 0.0204
aa_frac_C — 0.0143
genome_size_nt — 0.0123
tetra_CTAA — 0.0118
aa_frac_D — 0.0109

`optimal_ph` — fold-by-fold

Fold	Metric	Train	Test
1	mae = 0.440	n=4,082	n=1,021
2	mae = 0.568	n=4,082	n=1,021
3	mae = 0.494	n=4,082	n=1,021
4	mae = 0.466	n=4,083	n=1,020
5	mae = 0.444	n=4,083	n=1,020

Top 10 features for optimal_ph:

md_ph_median — 0.0518
iso_cat2_acidic — 0.0307
iso_cat2_alkaline — 0.0287
neg_charged_frac — 0.0146
aa_frac_H — 0.0081
aa_frac_E — 0.0077
tetra_CTCT — 0.0071
iso_cat2_plant — 0.0068
tetra_AGAC — 0.0067
tetra_CACT — 0.0065

`oxygen_requirement` — fold-by-fold

Fold	Metric	Train	Test
1	f1_macro = 0.353	n=17,311	n=4,328
2	f1_macro = 0.375	n=17,311	n=4,326
3	f1_macro = 0.357	n=17,311	n=4,328
4	f1_macro = 0.274	n=17,311	n=4,328
5	f1_macro = 0.429	n=17,312	n=4,327

Top 10 features for oxygen_requirement:

codon_ATA — 0.0395
iso_cat1_host — 0.0269
n_predicted_cds — 0.0266
aa_frac_C — 0.0195
iso_cat1_environmental — 0.0162
codon_CGT — 0.0144
iso_cat1_engineered — 0.0139
iso_cat2_human — 0.0124
genome_size_nt — 0.0103
codon_TAA — 0.0083

`salt_tolerance_pct` — fold-by-fold

Fold	Metric	Train	Test
1	mae = 1.926	n=3,075	n=769
2	mae = 1.893	n=3,075	n=769
3	mae = 2.746	n=3,075	n=769
4	mae = 1.870	n=3,075	n=769
5	mae = 2.128	n=3,076	n=768

Top 10 features for salt_tolerance_pct:

neg_charged_frac — 0.0702
tetra_ATCC — 0.0428
aa_frac_C — 0.0298
iso_cat2_saline — 0.0286
md_nacl_pct_median — 0.0256
tetra_ACAT — 0.0255
md_nacl_pct_max — 0.0128
aa_frac_T — 0.0120
codon_CCG — 0.0093
tetra_TGAT — 0.0089

Feature ↔ target correlations (Spearman, top 10)

Sanity-checks the biology — features known to track each target should appear here at high |ρ|. E.g. ivywrel_frac should correlate with optimal_temperature_c (Zeldovich 2007 thermophile signature).

`optimal_temperature_c`

Feature	Spearman ρ	p-value
`genome_size_nt`	-0.493	0.0e+00
`n_predicted_cds`	-0.482	0.0e+00
`aa_frac_P`	-0.391	0.0e+00
`aa_frac_Y`	+0.390	0.0e+00
`tetra_TCTT`	+0.383	0.0e+00
`tetra_TATC`	+0.381	0.0e+00
`tetra_GATA`	+0.381	0.0e+00
`tetra_AAGA`	+0.381	0.0e+00
`tetra_CATA`	+0.380	0.0e+00
`tetra_TATG`	+0.379	0.0e+00

`optimal_ph`

Feature	Spearman ρ	p-value
`md_ph_median`	+0.429	4.0e-131
`neg_charged_frac`	+0.304	1.6e-109
`mean_isoelectric_point`	-0.278	1.8e-91
`aa_frac_E`	+0.256	4.5e-77
`md_nacl_pct_max`	+0.218	1.9e-33
`md_nacl_pct_median`	+0.212	9.9e-32
`iso_cat2_alkaline`	+0.165	2.5e-32
`ivywrel_frac`	+0.159	2.4e-30
`codon_AAG`	-0.154	1.7e-28
`codon_CGA`	+0.153	5.8e-28

`salt_tolerance_pct`

Feature	Spearman ρ	p-value
`neg_charged_frac`	+0.442	1.5e-183
`mean_isoelectric_point`	-0.344	1.7e-107
`aa_frac_E`	+0.310	3.1e-86
`tetra_GACT`	+0.302	4.3e-82
`tetra_AGTC`	+0.302	1.0e-81
`md_nacl_pct_max`	+0.298	2.9e-52
`md_nacl_pct_median`	+0.290	1.6e-49
`tetra_ACTC`	+0.282	2.2e-71
`tetra_GAGT`	+0.273	1.9e-66
`iso_cat2_saline`	+0.263	9.4e-62

Per-family error breakdown (regression targets)

Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.

`optimal_temperature_c`

Family	n	MAE
Enterobacteriaceae	2662	3.792
Streptomycetaceae	2212	1.783
Bacillaceae	1886	3.174
Lactobacillaceae	1732	3.709
Pseudomonadaceae	1621	2.488
Myxococcaceae	1546	0.238
Streptococcaceae	1170	2.537
Staphylococcaceae	1068	3.374
Flavobacteriaceae	981	4.116
Corynebacteriaceae	900	2.146
Moraxellaceae	890	3.388
Paenibacillaceae	760	3.081
Microbacteriaceae	734	2.459
Micrococcaceae	719	2.811
Nocardiaceae	715	2.276

`optimal_ph`

Family	n	MAE
Flavobacteriaceae	355	0.405
Bacillaceae	298	0.606
Roseobacteraceae	204	0.375
Paenibacillaceae	139	0.469
Microbacteriaceae	120	0.446
Sphingobacteriaceae	114	0.336
Sphingomonadaceae	102	0.319
Streptomycetaceae	98	0.513
Pseudonocardiaceae	93	0.479
Halomonadaceae	82	0.584
Micrococcaceae	82	0.613
Nocardioidaceae	80	0.502
Paracoccaceae	76	0.574
Alteromonadaceae	71	0.355
Erythrobacteraceae	68	0.446

`salt_tolerance_pct`

Family	n	MAE
Flavobacteriaceae	267	1.713
Streptomycetaceae	264	1.987
Bacillaceae	201	3.315
Roseobacteraceae	127	1.395
Pseudonocardiaceae	123	2.280
Paenibacillaceae	93	1.651
Enterococcaceae	93	2.935
Microbacteriaceae	91	2.789
Micromonosporaceae	90	1.609
Sphingomonadaceae	81	1.028
Micrococcaceae	71	2.613
Streptosporangiaceae	68	1.480
Lactobacillaceae	66	2.559
Sphingobacteriaceae	55	1.218
Halomonadaceae	52	2.815

Known limitations

Survivorship bias. BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
Optimum derivation is heuristic. Most BacDive temperature entries are tagged as growth (positive growth at this temperature), not optimum. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded — this can be off by 5°C or more for some strains.
Family grouping is naive. The current family column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
Feature set is shallow. No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
Pyrodigal accuracy. Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.

Next steps

Add tetranucleotide / codon-usage features. ~50 extra columns, well-known signal for thermophily.
Replace naive family lookup with LPSN/GTDB join. Reduces leakage in CV.
Integrate KOMODO media DB as a richer label source than BacDive alone.
Move to genome embeddings (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
Active learning loop: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.

microbe-model — v0 baseline eval report

TL;DR

Corpus

Target distributions

Per-target results (5-fold GroupKFold by family)

optimal_temperature_c — fold-by-fold

optimal_ph — fold-by-fold

oxygen_requirement — fold-by-fold

salt_tolerance_pct — fold-by-fold

Feature ↔ target correlations (Spearman, top 10)

optimal_temperature_c

optimal_ph

salt_tolerance_pct

Per-family error breakdown (regression targets)

optimal_temperature_c

optimal_ph

salt_tolerance_pct

Known limitations

Next steps

`optimal_temperature_c` — fold-by-fold

`optimal_ph` — fold-by-fold

`oxygen_requirement` — fold-by-fold

`salt_tolerance_pct` — fold-by-fold

`optimal_temperature_c`

`optimal_ph`

`salt_tolerance_pct`

`optimal_temperature_c`

`optimal_ph`

`salt_tolerance_pct`