Spaces:
Sleeping
Final overnight commit: trained baseline + eval report + summary
Browse filesHeadline results (5-fold GroupKFold by family, 17,065 strains):
| Target | Task | n labeled | Model | Baseline | Improvement |
|-----------------------|---------------|-----------|-----------|-----------|-------------|
| optimal_temperature_c | regression | 17,025 | MAE=3.17 | MAE=5.53 | **+43%** |
| optimal_ph | regression | 4,654 | MAE=0.54 | MAE=0.55 | +1% |
| oxygen_requirement | classification | 10,434 | F1=0.28 | F1=0.07 | **+294%** |
| salt_tolerance_pct | regression | 4,800 | MAE=2.52 | MAE=2.72 | +7% |
Biology validates: ivywrel_frac (Zeldovich 2007 thermophile signature) is the
dominant feature for T_opt prediction (importance 0.50). Strong correlations
also surface in the Spearman analysis β ivywrel_frac +0.32 with T_opt,
neg_charged_frac +0.31 with optimal_pH, etc.
Per-family breakdown reveals where the model struggles:
- Easy: Streptomycetaceae (T_opt MAE 1.45), Sphingomonadaceae (1.85)
- Hard: Clostridiaceae (T_opt MAE 4.56), Bacillaceae (4.09)
Pipeline timing:
- BacDive scan: 10 min for 100,866 strains
- Streaming featurize: 270 min (~4.5 hr) for 17,065 genomes (99.83% success)
- Train + eval + summary: ~50 sec end-to-end
Now committing the human-readable artifacts:
- artifacts/eval_report.md (the main result)
- artifacts/baseline_results.json (machine-readable per-fold scores)
- artifacts/run.log + per-phase logs (audit trail)
- OVERNIGHT_SUMMARY.md (top-level entry point β already tracked)
Updated .gitignore to allow these specific files inside artifacts/ while still
excluding the gitignored parquet files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- .gitignore +10 -0
- OVERNIGHT_SUMMARY.md +13 -12
- artifacts/baseline_results.json +311 -0
- artifacts/eval.log +1 -0
- artifacts/eval_report.md +270 -0
- artifacts/run.log +7 -0
- artifacts/summary.log +1 -0
- artifacts/train.log +13 -0
|
@@ -24,6 +24,16 @@ venv/
|
|
| 24 |
*.gbff
|
| 25 |
*.gbff.gz
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
# Notebooks
|
| 28 |
.ipynb_checkpoints/
|
| 29 |
notebooks/scratch/
|
|
|
|
| 24 |
*.gbff
|
| 25 |
*.gbff.gz
|
| 26 |
|
| 27 |
+
# But keep small, human-readable artifacts in version control
|
| 28 |
+
!artifacts/
|
| 29 |
+
artifacts/*
|
| 30 |
+
!artifacts/eval_report.md
|
| 31 |
+
!artifacts/baseline_results.json
|
| 32 |
+
!artifacts/run.log
|
| 33 |
+
!artifacts/train.log
|
| 34 |
+
!artifacts/eval.log
|
| 35 |
+
!artifacts/summary.log
|
| 36 |
+
|
| 37 |
# Notebooks
|
| 38 |
.ipynb_checkpoints/
|
| 39 |
notebooks/scratch/
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Overnight run β summary
|
| 2 |
|
| 3 |
-
_Written 2026-04-
|
| 4 |
|
| 5 |
## Pipeline status
|
| 6 |
|
|
@@ -8,28 +8,29 @@ _Written 2026-04-27T01:45+00:00_
|
|
| 8 |
- 19,637 have genome accessions
|
| 9 |
- 50,384 have optimal_temperature_c labels
|
| 10 |
- **17,054** strains are training-ready (genome + T_opt)
|
| 11 |
-
-
|
| 12 |
-
- Processed:
|
| 13 |
-
- Successful:
|
| 14 |
-
- Failed:
|
| 15 |
-
-
|
| 16 |
-
-
|
| 17 |
|
| 18 |
## What to read first
|
| 19 |
|
| 20 |
-
1.
|
| 21 |
2. Check `git log --oneline` to see the commit timeline.
|
| 22 |
|
| 23 |
## Files of interest
|
| 24 |
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
- β
`data/bacdive_phenotypes.parquet` β phenotype labels (gitignored) 1.7 MB
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
|
| 31 |
## Commits since yesterday
|
| 32 |
|
|
|
|
| 33 |
- 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
|
| 34 |
- 316196d Fix predictions parquet type mix + plumb feature_cols through eval
|
| 35 |
- 7db9544 Add tests for explore module (correlations + class means)
|
|
|
|
| 1 |
# Overnight run β summary
|
| 2 |
|
| 3 |
+
_Written 2026-04-27T02:16+00:00_
|
| 4 |
|
| 5 |
## Pipeline status
|
| 6 |
|
|
|
|
| 8 |
- 19,637 have genome accessions
|
| 9 |
- 50,384 have optimal_temperature_c labels
|
| 10 |
- **17,054** strains are training-ready (genome + T_opt)
|
| 11 |
+
- β
Featurize: complete
|
| 12 |
+
- Processed: 17,094 / 17,094
|
| 13 |
+
- Successful: 17,065 (99.8%)
|
| 14 |
+
- Failed: 29 (mostly suppressed/withdrawn NCBI assemblies)
|
| 15 |
+
- β
Training: see `artifacts/train.log` for stdout
|
| 16 |
+
- β
Eval report: **`artifacts/eval_report.md`**
|
| 17 |
|
| 18 |
## What to read first
|
| 19 |
|
| 20 |
+
1. Open **`artifacts/eval_report.md`** β headline metrics + per-target detail.
|
| 21 |
2. Check `git log --oneline` to see the commit timeline.
|
| 22 |
|
| 23 |
## Files of interest
|
| 24 |
|
| 25 |
+
- β
`artifacts/eval_report.md` β headline result + metrics 0.0 MB
|
| 26 |
+
- β
`artifacts/baseline_results.json` β machine-readable per-fold scores 0.0 MB
|
| 27 |
- β
`data/bacdive_phenotypes.parquet` β phenotype labels (gitignored) 1.7 MB
|
| 28 |
+
- β
`data/features.parquet` β extracted genome features (gitignored) 5.4 MB
|
| 29 |
+
- β
`data/training_table.parquet` β merged + group-keyed table used for training (gitignored) 5.8 MB
|
| 30 |
|
| 31 |
## Commits since yesterday
|
| 32 |
|
| 33 |
+
- 72e12e7 Make OVERNIGHT_SUMMARY.md write atomic (avoid race with regen loop)
|
| 34 |
- 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
|
| 35 |
- 316196d Fix predictions parquet type mix + plumb feature_cols through eval
|
| 36 |
- 7db9544 Add tests for explore module (correlations + class means)
|
|
@@ -0,0 +1,311 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"optimal_temperature_c": {
|
| 3 |
+
"task": "regression",
|
| 4 |
+
"mean_metric": 3.1725860578562193,
|
| 5 |
+
"folds": [
|
| 6 |
+
{
|
| 7 |
+
"target": "optimal_temperature_c",
|
| 8 |
+
"task": "regression",
|
| 9 |
+
"metric_name": "mae",
|
| 10 |
+
"value": 2.8455384323075306,
|
| 11 |
+
"n_train": 13620,
|
| 12 |
+
"n_test": 3405
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"target": "optimal_temperature_c",
|
| 16 |
+
"task": "regression",
|
| 17 |
+
"metric_name": "mae",
|
| 18 |
+
"value": 3.4572334195443712,
|
| 19 |
+
"n_train": 13620,
|
| 20 |
+
"n_test": 3405
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"target": "optimal_temperature_c",
|
| 24 |
+
"task": "regression",
|
| 25 |
+
"metric_name": "mae",
|
| 26 |
+
"value": 3.4555605416991115,
|
| 27 |
+
"n_train": 13620,
|
| 28 |
+
"n_test": 3405
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"target": "optimal_temperature_c",
|
| 32 |
+
"task": "regression",
|
| 33 |
+
"metric_name": "mae",
|
| 34 |
+
"value": 2.7375921792185776,
|
| 35 |
+
"n_train": 13620,
|
| 36 |
+
"n_test": 3405
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"target": "optimal_temperature_c",
|
| 40 |
+
"task": "regression",
|
| 41 |
+
"metric_name": "mae",
|
| 42 |
+
"value": 3.3670057165115064,
|
| 43 |
+
"n_train": 13620,
|
| 44 |
+
"n_test": 3405
|
| 45 |
+
}
|
| 46 |
+
],
|
| 47 |
+
"top_features": {
|
| 48 |
+
"ivywrel_frac": 0.4960219383239746,
|
| 49 |
+
"n_predicted_cds": 0.05391158536076546,
|
| 50 |
+
"pos_charged_frac": 0.039320334792137146,
|
| 51 |
+
"aa_frac_P": 0.02822224497795105,
|
| 52 |
+
"aa_frac_C": 0.026947063207626343,
|
| 53 |
+
"aa_frac_Y": 0.0244345061480999,
|
| 54 |
+
"aa_frac_S": 0.022366111725568773,
|
| 55 |
+
"aa_frac_E": 0.021237003803253173,
|
| 56 |
+
"mean_isoelectric_point": 0.02086108475923538,
|
| 57 |
+
"genome_size_nt": 0.019465378299355508,
|
| 58 |
+
"aa_frac_L": 0.019293070212006568,
|
| 59 |
+
"aa_frac_V": 0.016112500242888926,
|
| 60 |
+
"neg_charged_frac": 0.01534007117152214,
|
| 61 |
+
"aa_frac_D": 0.015074614621698856,
|
| 62 |
+
"aa_frac_R": 0.01491919681429863,
|
| 63 |
+
"aa_frac_Q": 0.014176080748438836,
|
| 64 |
+
"aa_frac_I": 0.013733518309891224,
|
| 65 |
+
"mean_hydrophobicity": 0.013560011424124241,
|
| 66 |
+
"aa_frac_T": 0.012679264973849058,
|
| 67 |
+
"aa_frac_W": 0.011554591916501522
|
| 68 |
+
}
|
| 69 |
+
},
|
| 70 |
+
"optimal_ph": {
|
| 71 |
+
"task": "regression",
|
| 72 |
+
"mean_metric": 0.5397326897468703,
|
| 73 |
+
"folds": [
|
| 74 |
+
{
|
| 75 |
+
"target": "optimal_ph",
|
| 76 |
+
"task": "regression",
|
| 77 |
+
"metric_name": "mae",
|
| 78 |
+
"value": 0.4874841267004945,
|
| 79 |
+
"n_train": 3723,
|
| 80 |
+
"n_test": 931
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"target": "optimal_ph",
|
| 84 |
+
"task": "regression",
|
| 85 |
+
"metric_name": "mae",
|
| 86 |
+
"value": 0.5713520769145897,
|
| 87 |
+
"n_train": 3723,
|
| 88 |
+
"n_test": 931
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"target": "optimal_ph",
|
| 92 |
+
"task": "regression",
|
| 93 |
+
"metric_name": "mae",
|
| 94 |
+
"value": 0.5535044743039043,
|
| 95 |
+
"n_train": 3723,
|
| 96 |
+
"n_test": 931
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"target": "optimal_ph",
|
| 100 |
+
"task": "regression",
|
| 101 |
+
"metric_name": "mae",
|
| 102 |
+
"value": 0.5538595587200816,
|
| 103 |
+
"n_train": 3723,
|
| 104 |
+
"n_test": 931
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
"target": "optimal_ph",
|
| 108 |
+
"task": "regression",
|
| 109 |
+
"metric_name": "mae",
|
| 110 |
+
"value": 0.5324632120952811,
|
| 111 |
+
"n_train": 3724,
|
| 112 |
+
"n_test": 930
|
| 113 |
+
}
|
| 114 |
+
],
|
| 115 |
+
"top_features": {
|
| 116 |
+
"neg_charged_frac": 0.11188828796148301,
|
| 117 |
+
"aa_frac_H": 0.06443588957190513,
|
| 118 |
+
"ivywrel_frac": 0.04711368381977081,
|
| 119 |
+
"aa_frac_Q": 0.0425002209842205,
|
| 120 |
+
"aa_frac_E": 0.04013040736317634,
|
| 121 |
+
"n_predicted_cds": 0.0346000149846077,
|
| 122 |
+
"mean_hydrophobicity": 0.033536428958177565,
|
| 123 |
+
"aa_frac_L": 0.03228112906217575,
|
| 124 |
+
"aa_frac_C": 0.03177782744169235,
|
| 125 |
+
"aa_frac_V": 0.031084126606583596,
|
| 126 |
+
"mean_isoelectric_point": 0.03050720915198326,
|
| 127 |
+
"aa_frac_S": 0.02965335138142109,
|
| 128 |
+
"aa_frac_Y": 0.02864424102008343,
|
| 129 |
+
"aa_frac_W": 0.028111560642719267,
|
| 130 |
+
"aa_frac_P": 0.027575015649199487,
|
| 131 |
+
"aa_frac_I": 0.02686116211116314,
|
| 132 |
+
"pos_charged_frac": 0.02524181194603443,
|
| 133 |
+
"aa_frac_R": 0.024946783483028413,
|
| 134 |
+
"aromatic_frac": 0.024847164377570154,
|
| 135 |
+
"aa_frac_D": 0.024307455867528915
|
| 136 |
+
}
|
| 137 |
+
},
|
| 138 |
+
"oxygen_requirement": {
|
| 139 |
+
"task": "classification",
|
| 140 |
+
"mean_metric": 0.28303272250198364,
|
| 141 |
+
"folds": [
|
| 142 |
+
{
|
| 143 |
+
"target": "oxygen_requirement",
|
| 144 |
+
"task": "classification",
|
| 145 |
+
"metric_name": "f1_macro",
|
| 146 |
+
"value": 0.2740537993495899,
|
| 147 |
+
"n_train": 8347,
|
| 148 |
+
"n_test": 2086
|
| 149 |
+
},
|
| 150 |
+
{
|
| 151 |
+
"target": "oxygen_requirement",
|
| 152 |
+
"task": "classification",
|
| 153 |
+
"metric_name": "f1_macro",
|
| 154 |
+
"value": 0.27987706315870853,
|
| 155 |
+
"n_train": 8347,
|
| 156 |
+
"n_test": 2087
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"target": "oxygen_requirement",
|
| 160 |
+
"task": "classification",
|
| 161 |
+
"metric_name": "f1_macro",
|
| 162 |
+
"value": 0.2906654309479178,
|
| 163 |
+
"n_train": 8347,
|
| 164 |
+
"n_test": 2087
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"target": "oxygen_requirement",
|
| 168 |
+
"task": "classification",
|
| 169 |
+
"metric_name": "f1_macro",
|
| 170 |
+
"value": 0.2668782460426221,
|
| 171 |
+
"n_train": 8347,
|
| 172 |
+
"n_test": 2087
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"target": "oxygen_requirement",
|
| 176 |
+
"task": "classification",
|
| 177 |
+
"metric_name": "f1_macro",
|
| 178 |
+
"value": 0.3036890730110796,
|
| 179 |
+
"n_train": 8348,
|
| 180 |
+
"n_test": 2086
|
| 181 |
+
}
|
| 182 |
+
],
|
| 183 |
+
"top_features": {
|
| 184 |
+
"aa_frac_C": 0.10454434156417847,
|
| 185 |
+
"genome_size_nt": 0.0889880508184433,
|
| 186 |
+
"n_predicted_cds": 0.061901997029781344,
|
| 187 |
+
"aa_frac_Q": 0.054254394769668576,
|
| 188 |
+
"aa_frac_K": 0.037319378927350044,
|
| 189 |
+
"aa_frac_W": 0.03615027591586113,
|
| 190 |
+
"aa_frac_M": 0.03541129007935524,
|
| 191 |
+
"aa_frac_H": 0.03520122580230236,
|
| 192 |
+
"aa_frac_G": 0.03354446776211262,
|
| 193 |
+
"ivywrel_frac": 0.0306744959205389,
|
| 194 |
+
"aa_frac_Y": 0.03018171414732933,
|
| 195 |
+
"aa_frac_L": 0.029987764731049537,
|
| 196 |
+
"pos_charged_frac": 0.029102231562137603,
|
| 197 |
+
"aa_frac_V": 0.02888880521059036,
|
| 198 |
+
"aa_frac_N": 0.02404877059161663,
|
| 199 |
+
"aa_frac_F": 0.023649439588189124,
|
| 200 |
+
"aa_frac_E": 0.023098185285925867,
|
| 201 |
+
"aa_frac_P": 0.022718899697065354,
|
| 202 |
+
"aa_frac_R": 0.021204233914613724,
|
| 203 |
+
"aa_frac_I": 0.02029264494776726
|
| 204 |
+
}
|
| 205 |
+
},
|
| 206 |
+
"salt_tolerance_pct": {
|
| 207 |
+
"task": "regression",
|
| 208 |
+
"mean_metric": 2.522809721716851,
|
| 209 |
+
"folds": [
|
| 210 |
+
{
|
| 211 |
+
"target": "salt_tolerance_pct",
|
| 212 |
+
"task": "regression",
|
| 213 |
+
"metric_name": "mae",
|
| 214 |
+
"value": 2.182882065629829,
|
| 215 |
+
"n_train": 3840,
|
| 216 |
+
"n_test": 960
|
| 217 |
+
},
|
| 218 |
+
{
|
| 219 |
+
"target": "salt_tolerance_pct",
|
| 220 |
+
"task": "regression",
|
| 221 |
+
"metric_name": "mae",
|
| 222 |
+
"value": 2.5901312468783932,
|
| 223 |
+
"n_train": 3840,
|
| 224 |
+
"n_test": 960
|
| 225 |
+
},
|
| 226 |
+
{
|
| 227 |
+
"target": "salt_tolerance_pct",
|
| 228 |
+
"task": "regression",
|
| 229 |
+
"metric_name": "mae",
|
| 230 |
+
"value": 3.1095002870118864,
|
| 231 |
+
"n_train": 3840,
|
| 232 |
+
"n_test": 960
|
| 233 |
+
},
|
| 234 |
+
{
|
| 235 |
+
"target": "salt_tolerance_pct",
|
| 236 |
+
"task": "regression",
|
| 237 |
+
"metric_name": "mae",
|
| 238 |
+
"value": 2.4400568475718374,
|
| 239 |
+
"n_train": 3840,
|
| 240 |
+
"n_test": 960
|
| 241 |
+
},
|
| 242 |
+
{
|
| 243 |
+
"target": "salt_tolerance_pct",
|
| 244 |
+
"task": "regression",
|
| 245 |
+
"metric_name": "mae",
|
| 246 |
+
"value": 2.291478161492307,
|
| 247 |
+
"n_train": 3840,
|
| 248 |
+
"n_test": 960
|
| 249 |
+
}
|
| 250 |
+
],
|
| 251 |
+
"top_features": {
|
| 252 |
+
"aa_frac_C": 0.15850864201784134,
|
| 253 |
+
"neg_charged_frac": 0.14477943778038024,
|
| 254 |
+
"aa_frac_T": 0.03598657064139843,
|
| 255 |
+
"mean_isoelectric_point": 0.030603666976094245,
|
| 256 |
+
"aa_frac_H": 0.030093548446893693,
|
| 257 |
+
"aa_frac_D": 0.02875022515654564,
|
| 258 |
+
"aa_frac_L": 0.02756672874093056,
|
| 259 |
+
"aa_frac_K": 0.026532869413495065,
|
| 260 |
+
"aa_frac_W": 0.02639944478869438,
|
| 261 |
+
"ivywrel_frac": 0.026095620542764663,
|
| 262 |
+
"aa_frac_I": 0.025597410276532173,
|
| 263 |
+
"aa_frac_S": 0.0251168642193079,
|
| 264 |
+
"aa_frac_M": 0.02414705790579319,
|
| 265 |
+
"aa_frac_Y": 0.02411585859954357,
|
| 266 |
+
"aa_frac_V": 0.022448396496474742,
|
| 267 |
+
"mean_hydrophobicity": 0.022069810889661312,
|
| 268 |
+
"mean_cds_aa_length": 0.02146391086280346,
|
| 269 |
+
"aromatic_frac": 0.02145920880138874,
|
| 270 |
+
"aa_frac_P": 0.02134738564491272,
|
| 271 |
+
"n_predicted_cds": 0.020885448157787322
|
| 272 |
+
}
|
| 273 |
+
},
|
| 274 |
+
"__meta__": {
|
| 275 |
+
"feature_cols": [
|
| 276 |
+
"genome_size_nt",
|
| 277 |
+
"n_contigs",
|
| 278 |
+
"gc_content",
|
| 279 |
+
"n_predicted_cds",
|
| 280 |
+
"coding_density",
|
| 281 |
+
"mean_cds_aa_length",
|
| 282 |
+
"median_cds_aa_length",
|
| 283 |
+
"aromatic_frac",
|
| 284 |
+
"pos_charged_frac",
|
| 285 |
+
"neg_charged_frac",
|
| 286 |
+
"ivywrel_frac",
|
| 287 |
+
"mean_hydrophobicity",
|
| 288 |
+
"mean_isoelectric_point",
|
| 289 |
+
"aa_frac_A",
|
| 290 |
+
"aa_frac_C",
|
| 291 |
+
"aa_frac_D",
|
| 292 |
+
"aa_frac_E",
|
| 293 |
+
"aa_frac_F",
|
| 294 |
+
"aa_frac_G",
|
| 295 |
+
"aa_frac_H",
|
| 296 |
+
"aa_frac_I",
|
| 297 |
+
"aa_frac_K",
|
| 298 |
+
"aa_frac_L",
|
| 299 |
+
"aa_frac_M",
|
| 300 |
+
"aa_frac_N",
|
| 301 |
+
"aa_frac_P",
|
| 302 |
+
"aa_frac_Q",
|
| 303 |
+
"aa_frac_R",
|
| 304 |
+
"aa_frac_S",
|
| 305 |
+
"aa_frac_T",
|
| 306 |
+
"aa_frac_V",
|
| 307 |
+
"aa_frac_W",
|
| 308 |
+
"aa_frac_Y"
|
| 309 |
+
]
|
| 310 |
+
}
|
| 311 |
+
}
|
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Wrote /Users/miyuhoriuchi/microbe-model/artifacts/eval_report.md
|
|
@@ -0,0 +1,270 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# microbe-model β v0 baseline eval report
|
| 2 |
+
|
| 3 |
+
_Generated: 2026-04-27T02:16:35+00:00_
|
| 4 |
+
|
| 5 |
+
## TL;DR
|
| 6 |
+
|
| 7 |
+
- **`optimal_temperature_c`**: MAE = **3.17** (vs always-predict-mean 5.53, **+43%**)
|
| 8 |
+
- **`optimal_ph`**: MAE = **0.54** (vs always-predict-mean 0.55, **+1%**)
|
| 9 |
+
- **`oxygen_requirement`**: macro-F1 = **0.283** (vs always-predict-majority 0.072, **+294%**)
|
| 10 |
+
- **`salt_tolerance_pct`**: MAE = **2.52** (vs always-predict-mean 2.72, **+7%**)
|
| 11 |
+
|
| 12 |
+
Trained on **17,065** strains with **33** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.
|
| 13 |
+
|
| 14 |
+
## Corpus
|
| 15 |
+
|
| 16 |
+
- Total strains in feature table: **17,065**
|
| 17 |
+
- Labeled-strain counts by target:
|
| 18 |
+
- `optimal_temperature_c`: 17,025
|
| 19 |
+
- `optimal_ph`: 4,654
|
| 20 |
+
- `oxygen_requirement`: 10,434
|
| 21 |
+
- `salt_tolerance_pct`: 4,800
|
| 22 |
+
|
| 23 |
+
## Target distributions
|
| 24 |
+
|
| 25 |
+
- `optimal_temperature_c`: n=17,025, mean=31.96, std=8.57, p10=25.00, median=30.00, p90=37.00
|
| 26 |
+
- `optimal_ph`: n=4,654, mean=7.19, std=0.83, p10=6.50, median=7.00, p90=8.00
|
| 27 |
+
- `salt_tolerance_pct`: n=4,800, mean=3.55, std=4.11, p10=0.00, median=2.50, p90=8.00
|
| 28 |
+
- `oxygen_requirement`:
|
| 29 |
+
- `aerobe`: 4,978
|
| 30 |
+
- `anaerobe`: 2,120
|
| 31 |
+
- `facultative anaerobe`: 1,227
|
| 32 |
+
- `obligate aerobe`: 1,029
|
| 33 |
+
- `microaerophile`: 889
|
| 34 |
+
- `obligate anaerobe`: 105
|
| 35 |
+
- `facultative aerobe`: 83
|
| 36 |
+
- `microaerotolerant`: 2
|
| 37 |
+
- `aerotolerant`: 1
|
| 38 |
+
|
| 39 |
+
## Per-target results (5-fold GroupKFold by family)
|
| 40 |
+
|
| 41 |
+
Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
|
| 42 |
+
Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).
|
| 43 |
+
|
| 44 |
+
| Target | Task | n labeled | Model metric | Baseline | Improvement |
|
| 45 |
+
|---|---|---|---|---|---|
|
| 46 |
+
| `optimal_temperature_c` | regression | 17,025 | MAE=3.173 | MAE=5.528 | +42.6% |
|
| 47 |
+
| `optimal_ph` | regression | 4,654 | MAE=0.540 | MAE=0.546 | +1.2% |
|
| 48 |
+
| `oxygen_requirement` | classification | 10,434 | F1=0.283 | F1=0.072 | +294.3% |
|
| 49 |
+
| `salt_tolerance_pct` | regression | 4,800 | MAE=2.523 | MAE=2.720 | +7.3% |
|
| 50 |
+
|
| 51 |
+
### `optimal_temperature_c` β fold-by-fold
|
| 52 |
+
|
| 53 |
+
| Fold | Metric | Train | Test |
|
| 54 |
+
|---|---|---|---|
|
| 55 |
+
| 1 | mae = 2.846 | n=13,620 | n=3,405 |
|
| 56 |
+
| 2 | mae = 3.457 | n=13,620 | n=3,405 |
|
| 57 |
+
| 3 | mae = 3.456 | n=13,620 | n=3,405 |
|
| 58 |
+
| 4 | mae = 2.738 | n=13,620 | n=3,405 |
|
| 59 |
+
| 5 | mae = 3.367 | n=13,620 | n=3,405 |
|
| 60 |
+
|
| 61 |
+
**Top 10 features for `optimal_temperature_c`:**
|
| 62 |
+
|
| 63 |
+
- `ivywrel_frac` β 0.4960
|
| 64 |
+
- `n_predicted_cds` β 0.0539
|
| 65 |
+
- `pos_charged_frac` β 0.0393
|
| 66 |
+
- `aa_frac_P` β 0.0282
|
| 67 |
+
- `aa_frac_C` β 0.0269
|
| 68 |
+
- `aa_frac_Y` β 0.0244
|
| 69 |
+
- `aa_frac_S` β 0.0224
|
| 70 |
+
- `aa_frac_E` β 0.0212
|
| 71 |
+
- `mean_isoelectric_point` β 0.0209
|
| 72 |
+
- `genome_size_nt` β 0.0195
|
| 73 |
+
|
| 74 |
+
### `optimal_ph` β fold-by-fold
|
| 75 |
+
|
| 76 |
+
| Fold | Metric | Train | Test |
|
| 77 |
+
|---|---|---|---|
|
| 78 |
+
| 1 | mae = 0.487 | n=3,723 | n=931 |
|
| 79 |
+
| 2 | mae = 0.571 | n=3,723 | n=931 |
|
| 80 |
+
| 3 | mae = 0.554 | n=3,723 | n=931 |
|
| 81 |
+
| 4 | mae = 0.554 | n=3,723 | n=931 |
|
| 82 |
+
| 5 | mae = 0.532 | n=3,724 | n=930 |
|
| 83 |
+
|
| 84 |
+
**Top 10 features for `optimal_ph`:**
|
| 85 |
+
|
| 86 |
+
- `neg_charged_frac` β 0.1119
|
| 87 |
+
- `aa_frac_H` β 0.0644
|
| 88 |
+
- `ivywrel_frac` β 0.0471
|
| 89 |
+
- `aa_frac_Q` β 0.0425
|
| 90 |
+
- `aa_frac_E` β 0.0401
|
| 91 |
+
- `n_predicted_cds` β 0.0346
|
| 92 |
+
- `mean_hydrophobicity` β 0.0335
|
| 93 |
+
- `aa_frac_L` β 0.0323
|
| 94 |
+
- `aa_frac_C` β 0.0318
|
| 95 |
+
- `aa_frac_V` β 0.0311
|
| 96 |
+
|
| 97 |
+
### `oxygen_requirement` β fold-by-fold
|
| 98 |
+
|
| 99 |
+
| Fold | Metric | Train | Test |
|
| 100 |
+
|---|---|---|---|
|
| 101 |
+
| 1 | f1_macro = 0.274 | n=8,347 | n=2,086 |
|
| 102 |
+
| 2 | f1_macro = 0.280 | n=8,347 | n=2,087 |
|
| 103 |
+
| 3 | f1_macro = 0.291 | n=8,347 | n=2,087 |
|
| 104 |
+
| 4 | f1_macro = 0.267 | n=8,347 | n=2,087 |
|
| 105 |
+
| 5 | f1_macro = 0.304 | n=8,348 | n=2,086 |
|
| 106 |
+
|
| 107 |
+
**Top 10 features for `oxygen_requirement`:**
|
| 108 |
+
|
| 109 |
+
- `aa_frac_C` β 0.1045
|
| 110 |
+
- `genome_size_nt` β 0.0890
|
| 111 |
+
- `n_predicted_cds` β 0.0619
|
| 112 |
+
- `aa_frac_Q` β 0.0543
|
| 113 |
+
- `aa_frac_K` β 0.0373
|
| 114 |
+
- `aa_frac_W` β 0.0362
|
| 115 |
+
- `aa_frac_M` β 0.0354
|
| 116 |
+
- `aa_frac_H` β 0.0352
|
| 117 |
+
- `aa_frac_G` β 0.0335
|
| 118 |
+
- `ivywrel_frac` β 0.0307
|
| 119 |
+
|
| 120 |
+
### `salt_tolerance_pct` β fold-by-fold
|
| 121 |
+
|
| 122 |
+
| Fold | Metric | Train | Test |
|
| 123 |
+
|---|---|---|---|
|
| 124 |
+
| 1 | mae = 2.183 | n=3,840 | n=960 |
|
| 125 |
+
| 2 | mae = 2.590 | n=3,840 | n=960 |
|
| 126 |
+
| 3 | mae = 3.110 | n=3,840 | n=960 |
|
| 127 |
+
| 4 | mae = 2.440 | n=3,840 | n=960 |
|
| 128 |
+
| 5 | mae = 2.291 | n=3,840 | n=960 |
|
| 129 |
+
|
| 130 |
+
**Top 10 features for `salt_tolerance_pct`:**
|
| 131 |
+
|
| 132 |
+
- `aa_frac_C` β 0.1585
|
| 133 |
+
- `neg_charged_frac` β 0.1448
|
| 134 |
+
- `aa_frac_T` β 0.0360
|
| 135 |
+
- `mean_isoelectric_point` β 0.0306
|
| 136 |
+
- `aa_frac_H` β 0.0301
|
| 137 |
+
- `aa_frac_D` β 0.0288
|
| 138 |
+
- `aa_frac_L` β 0.0276
|
| 139 |
+
- `aa_frac_K` β 0.0265
|
| 140 |
+
- `aa_frac_W` β 0.0264
|
| 141 |
+
- `ivywrel_frac` β 0.0261
|
| 142 |
+
|
| 143 |
+
## Feature β target correlations (Spearman, top 10)
|
| 144 |
+
|
| 145 |
+
Sanity-checks the biology β features known to track each target should appear here at high |Ο|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature).
|
| 146 |
+
|
| 147 |
+
### `optimal_temperature_c`
|
| 148 |
+
|
| 149 |
+
| Feature | Spearman Ο | p-value |
|
| 150 |
+
|---|---|---|
|
| 151 |
+
| `genome_size_nt` | -0.405 | 0.0e+00 |
|
| 152 |
+
| `aa_frac_E` | +0.388 | 0.0e+00 |
|
| 153 |
+
| `n_predicted_cds` | -0.386 | 0.0e+00 |
|
| 154 |
+
| `ivywrel_frac` | +0.320 | 0.0e+00 |
|
| 155 |
+
| `aa_frac_Y` | +0.318 | 0.0e+00 |
|
| 156 |
+
| `aa_frac_W` | -0.309 | 0.0e+00 |
|
| 157 |
+
| `pos_charged_frac` | +0.299 | 0.0e+00 |
|
| 158 |
+
| `aa_frac_A` | -0.295 | 0.0e+00 |
|
| 159 |
+
| `neg_charged_frac` | +0.293 | 0.0e+00 |
|
| 160 |
+
| `aa_frac_P` | -0.277 | 1.0e-297 |
|
| 161 |
+
|
| 162 |
+
### `optimal_ph`
|
| 163 |
+
|
| 164 |
+
| Feature | Spearman Ο | p-value |
|
| 165 |
+
|---|---|---|
|
| 166 |
+
| `neg_charged_frac` | +0.308 | 4.2e-103 |
|
| 167 |
+
| `mean_isoelectric_point` | -0.276 | 3.8e-82 |
|
| 168 |
+
| `aa_frac_E` | +0.260 | 6.2e-73 |
|
| 169 |
+
| `ivywrel_frac` | +0.166 | 4.0e-30 |
|
| 170 |
+
| `aa_frac_D` | +0.111 | 2.6e-14 |
|
| 171 |
+
| `mean_hydrophobicity` | -0.109 | 7.3e-14 |
|
| 172 |
+
| `aa_frac_C` | -0.095 | 7.6e-11 |
|
| 173 |
+
| `pos_charged_frac` | -0.086 | 3.5e-09 |
|
| 174 |
+
| `aa_frac_P` | -0.081 | 3.3e-08 |
|
| 175 |
+
| `aa_frac_A` | -0.078 | 9.6e-08 |
|
| 176 |
+
|
| 177 |
+
### `salt_tolerance_pct`
|
| 178 |
+
|
| 179 |
+
| Feature | Spearman Ο | p-value |
|
| 180 |
+
|---|---|---|
|
| 181 |
+
| `neg_charged_frac` | +0.227 | 3.2e-57 |
|
| 182 |
+
| `mean_isoelectric_point` | -0.204 | 3.5e-46 |
|
| 183 |
+
| `ivywrel_frac` | +0.195 | 1.7e-42 |
|
| 184 |
+
| `aa_frac_C` | -0.186 | 1.1e-38 |
|
| 185 |
+
| `mean_cds_aa_length` | -0.161 | 3.8e-29 |
|
| 186 |
+
| `aa_frac_D` | +0.159 | 1.4e-28 |
|
| 187 |
+
| `aa_frac_E` | +0.143 | 2.4e-23 |
|
| 188 |
+
| `aa_frac_V` | +0.112 | 8.3e-15 |
|
| 189 |
+
| `aa_frac_T` | +0.104 | 4.4e-13 |
|
| 190 |
+
| `coding_density` | -0.090 | 4.1e-10 |
|
| 191 |
+
|
| 192 |
+
## Per-family error breakdown (regression targets)
|
| 193 |
+
|
| 194 |
+
Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.
|
| 195 |
+
|
| 196 |
+
### `optimal_temperature_c`
|
| 197 |
+
|
| 198 |
+
| Family | n | MAE |
|
| 199 |
+
|---|---|---|
|
| 200 |
+
| Streptomycetaceae | 798 | 1.451 |
|
| 201 |
+
| Bacillaceae | 643 | 4.086 |
|
| 202 |
+
| Flavobacteriaceae | 631 | 4.195 |
|
| 203 |
+
| Lactobacillaceae | 471 | 3.161 |
|
| 204 |
+
| Enterobacteriaceae | 439 | 3.967 |
|
| 205 |
+
| Microbacteriaceae | 396 | 2.457 |
|
| 206 |
+
| Pseudomonadaceae | 388 | 2.523 |
|
| 207 |
+
| Roseobacteraceae | 341 | 2.992 |
|
| 208 |
+
| Paenibacillaceae | 319 | 3.474 |
|
| 209 |
+
| Pseudonocardiaceae | 306 | 2.184 |
|
| 210 |
+
| Moraxellaceae | 269 | 2.723 |
|
| 211 |
+
| Sphingomonadaceae | 256 | 1.845 |
|
| 212 |
+
| Streptococcaceae | 251 | 3.063 |
|
| 213 |
+
| Clostridiaceae | 247 | 4.560 |
|
| 214 |
+
| Vibrionaceae | 239 | 3.290 |
|
| 215 |
+
|
| 216 |
+
### `optimal_ph`
|
| 217 |
+
|
| 218 |
+
| Family | n | MAE |
|
| 219 |
+
|---|---|---|
|
| 220 |
+
| Flavobacteriaceae | 323 | 0.424 |
|
| 221 |
+
| Bacillaceae | 273 | 0.657 |
|
| 222 |
+
| Roseobacteraceae | 192 | 0.363 |
|
| 223 |
+
| Paenibacillaceae | 126 | 0.484 |
|
| 224 |
+
| Microbacteriaceae | 112 | 0.510 |
|
| 225 |
+
| Sphingobacteriaceae | 100 | 0.376 |
|
| 226 |
+
| Sphingomonadaceae | 96 | 0.365 |
|
| 227 |
+
| Streptomycetaceae | 92 | 0.742 |
|
| 228 |
+
| Pseudonocardiaceae | 85 | 0.547 |
|
| 229 |
+
| Halomonadaceae | 81 | 0.748 |
|
| 230 |
+
| Nocardioidaceae | 74 | 0.516 |
|
| 231 |
+
| Paracoccaceae | 71 | 0.563 |
|
| 232 |
+
| Micrococcaceae | 71 | 0.626 |
|
| 233 |
+
| Erythrobacteraceae | 68 | 0.442 |
|
| 234 |
+
| Alteromonadaceae | 68 | 0.375 |
|
| 235 |
+
|
| 236 |
+
### `salt_tolerance_pct`
|
| 237 |
+
|
| 238 |
+
| Family | n | MAE |
|
| 239 |
+
|---|---|---|
|
| 240 |
+
| Flavobacteriaceae | 285 | 1.491 |
|
| 241 |
+
| Streptomycetaceae | 283 | 2.070 |
|
| 242 |
+
| Bacillaceae | 246 | 3.241 |
|
| 243 |
+
| Microbacteriaceae | 140 | 2.808 |
|
| 244 |
+
| Pseudonocardiaceae | 134 | 2.501 |
|
| 245 |
+
| Roseobacteraceae | 134 | 1.533 |
|
| 246 |
+
| Paenibacillaceae | 125 | 2.387 |
|
| 247 |
+
| Pseudomonadaceae | 110 | 3.870 |
|
| 248 |
+
| Vibrionaceae | 100 | 2.653 |
|
| 249 |
+
| Sphingomonadaceae | 92 | 2.029 |
|
| 250 |
+
| Micromonosporaceae | 90 | 1.724 |
|
| 251 |
+
| Micrococcaceae | 85 | 3.045 |
|
| 252 |
+
| Nocardiaceae | 84 | 2.613 |
|
| 253 |
+
| Streptococcaceae | 82 | 1.390 |
|
| 254 |
+
| Lactobacillaceae | 78 | 2.506 |
|
| 255 |
+
|
| 256 |
+
## Known limitations
|
| 257 |
+
|
| 258 |
+
- **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
|
| 259 |
+
- **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β this can be off by 5Β°C or more for some strains.
|
| 260 |
+
- **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
|
| 261 |
+
- **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
|
| 262 |
+
- **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.
|
| 263 |
+
|
| 264 |
+
## Next steps
|
| 265 |
+
|
| 266 |
+
1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily.
|
| 267 |
+
2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV.
|
| 268 |
+
3. **Integrate KOMODO media DB** as a richer label source than BacDive alone.
|
| 269 |
+
4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
|
| 270 |
+
5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.
|
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[2026-04-27T02:15:42Z] >>> train
|
| 2 |
+
[2026-04-27T02:16:34Z] <<< train OK
|
| 3 |
+
[2026-04-27T02:16:34Z] >>> eval
|
| 4 |
+
[2026-04-27T02:16:36Z] <<< eval OK
|
| 5 |
+
[2026-04-27T02:16:36Z] >>> summary
|
| 6 |
+
[2026-04-27T02:16:37Z] <<< summary OK
|
| 7 |
+
[2026-04-27T02:16:37Z] all phases attempted
|
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Wrote /Users/miyuhoriuchi/microbe-model/OVERNIGHT_SUMMARY.md
|
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Training table: 17,065 strains Γ 33 features
|
| 2 |
+
Distinct groups: 605
|
| 3 |
+
Group sizes (top 10): {'Streptomycetaceae': 798, 'Bacillaceae': 643, 'Flavobacteriaceae': 631, 'Lactobacillaceae': 472, 'Enterobacteriaceae': 439, 'Microbacteriaceae': 397, 'Pseudomonadaceae': 388, 'Roseobacteraceae': 341, 'Paenibacillaceae': 319, 'Pseudonocardiaceae': 307}
|
| 4 |
+
|
| 5 |
+
Wrote training table to /Users/miyuhoriuchi/microbe-model/data/training_table.parquet
|
| 6 |
+
Wrote per-strain predictions to /Users/miyuhoriuchi/microbe-model/artifacts/predictions.parquet
|
| 7 |
+
|
| 8 |
+
Results summary (49.6s):
|
| 9 |
+
|
| 10 |
+
optimal_temperature_c mae = 3.1726 (n_folds=5)
|
| 11 |
+
optimal_ph mae = 0.5397 (n_folds=5)
|
| 12 |
+
oxygen_requirement f1_macro = 0.2830 (n_folds=5)
|
| 13 |
+
salt_tolerance_pct mae = 2.5228 (n_folds=5)
|