Miyu Horiuchi Claude Opus 4.7 (1M context) commited on
Commit
2ae21d7
Β·
1 Parent(s): de9e822

Final overnight commit: trained baseline + eval report + summary

Browse files

Headline results (5-fold GroupKFold by family, 17,065 strains):

| Target | Task | n labeled | Model | Baseline | Improvement |
|-----------------------|---------------|-----------|-----------|-----------|-------------|
| optimal_temperature_c | regression | 17,025 | MAE=3.17 | MAE=5.53 | **+43%** |
| optimal_ph | regression | 4,654 | MAE=0.54 | MAE=0.55 | +1% |
| oxygen_requirement | classification | 10,434 | F1=0.28 | F1=0.07 | **+294%** |
| salt_tolerance_pct | regression | 4,800 | MAE=2.52 | MAE=2.72 | +7% |

Biology validates: ivywrel_frac (Zeldovich 2007 thermophile signature) is the
dominant feature for T_opt prediction (importance 0.50). Strong correlations
also surface in the Spearman analysis β€” ivywrel_frac +0.32 with T_opt,
neg_charged_frac +0.31 with optimal_pH, etc.

Per-family breakdown reveals where the model struggles:
- Easy: Streptomycetaceae (T_opt MAE 1.45), Sphingomonadaceae (1.85)
- Hard: Clostridiaceae (T_opt MAE 4.56), Bacillaceae (4.09)

Pipeline timing:
- BacDive scan: 10 min for 100,866 strains
- Streaming featurize: 270 min (~4.5 hr) for 17,065 genomes (99.83% success)
- Train + eval + summary: ~50 sec end-to-end

Now committing the human-readable artifacts:
- artifacts/eval_report.md (the main result)
- artifacts/baseline_results.json (machine-readable per-fold scores)
- artifacts/run.log + per-phase logs (audit trail)
- OVERNIGHT_SUMMARY.md (top-level entry point β€” already tracked)

Updated .gitignore to allow these specific files inside artifacts/ while still
excluding the gitignored parquet files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

.gitignore CHANGED
@@ -24,6 +24,16 @@ venv/
24
  *.gbff
25
  *.gbff.gz
26
 
 
 
 
 
 
 
 
 
 
 
27
  # Notebooks
28
  .ipynb_checkpoints/
29
  notebooks/scratch/
 
24
  *.gbff
25
  *.gbff.gz
26
 
27
+ # But keep small, human-readable artifacts in version control
28
+ !artifacts/
29
+ artifacts/*
30
+ !artifacts/eval_report.md
31
+ !artifacts/baseline_results.json
32
+ !artifacts/run.log
33
+ !artifacts/train.log
34
+ !artifacts/eval.log
35
+ !artifacts/summary.log
36
+
37
  # Notebooks
38
  .ipynb_checkpoints/
39
  notebooks/scratch/
OVERNIGHT_SUMMARY.md CHANGED
@@ -1,6 +1,6 @@
1
  # Overnight run β€” summary
2
 
3
- _Written 2026-04-27T01:45+00:00_
4
 
5
  ## Pipeline status
6
 
@@ -8,28 +8,29 @@ _Written 2026-04-27T01:45+00:00_
8
  - 19,637 have genome accessions
9
  - 50,384 have optimal_temperature_c labels
10
  - **17,054** strains are training-ready (genome + T_opt)
11
- - 🟑 Featurize: in progress (84%)
12
- - Processed: 14,428 / 17,094
13
- - Successful: 14,402 (99.8%)
14
- - Failed: 26 (mostly suppressed/withdrawn NCBI assemblies)
15
- - ⏭ Training: not yet run (waits for featurize completion)
16
- - ⏭ Eval report: not yet generated
17
 
18
  ## What to read first
19
 
20
- 1. Wait for `artifacts/eval_report.md` to be generated, then open it.
21
  2. Check `git log --oneline` to see the commit timeline.
22
 
23
  ## Files of interest
24
 
25
- - β€” `artifacts/eval_report.md` β€” headline result + metrics
26
- - β€” `artifacts/baseline_results.json` β€” machine-readable per-fold scores
27
  - βœ… `data/bacdive_phenotypes.parquet` β€” phenotype labels (gitignored) 1.7 MB
28
- - β€” `data/features.parquet` β€” extracted genome features (gitignored)
29
- - β€” `data/training_table.parquet` β€” merged + group-keyed table used for training (gitignored)
30
 
31
  ## Commits since yesterday
32
 
 
33
  - 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
34
  - 316196d Fix predictions parquet type mix + plumb feature_cols through eval
35
  - 7db9544 Add tests for explore module (correlations + class means)
 
1
  # Overnight run β€” summary
2
 
3
+ _Written 2026-04-27T02:16+00:00_
4
 
5
  ## Pipeline status
6
 
 
8
  - 19,637 have genome accessions
9
  - 50,384 have optimal_temperature_c labels
10
  - **17,054** strains are training-ready (genome + T_opt)
11
+ - βœ… Featurize: complete
12
+ - Processed: 17,094 / 17,094
13
+ - Successful: 17,065 (99.8%)
14
+ - Failed: 29 (mostly suppressed/withdrawn NCBI assemblies)
15
+ - βœ… Training: see `artifacts/train.log` for stdout
16
+ - βœ… Eval report: **`artifacts/eval_report.md`**
17
 
18
  ## What to read first
19
 
20
+ 1. Open **`artifacts/eval_report.md`** β€” headline metrics + per-target detail.
21
  2. Check `git log --oneline` to see the commit timeline.
22
 
23
  ## Files of interest
24
 
25
+ - βœ… `artifacts/eval_report.md` β€” headline result + metrics 0.0 MB
26
+ - βœ… `artifacts/baseline_results.json` β€” machine-readable per-fold scores 0.0 MB
27
  - βœ… `data/bacdive_phenotypes.parquet` β€” phenotype labels (gitignored) 1.7 MB
28
+ - βœ… `data/features.parquet` β€” extracted genome features (gitignored) 5.4 MB
29
+ - βœ… `data/training_table.parquet` β€” merged + group-keyed table used for training (gitignored) 5.8 MB
30
 
31
  ## Commits since yesterday
32
 
33
+ - 72e12e7 Make OVERNIGHT_SUMMARY.md write atomic (avoid race with regen loop)
34
  - 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
35
  - 316196d Fix predictions parquet type mix + plumb feature_cols through eval
36
  - 7db9544 Add tests for explore module (correlations + class means)
artifacts/baseline_results.json ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "optimal_temperature_c": {
3
+ "task": "regression",
4
+ "mean_metric": 3.1725860578562193,
5
+ "folds": [
6
+ {
7
+ "target": "optimal_temperature_c",
8
+ "task": "regression",
9
+ "metric_name": "mae",
10
+ "value": 2.8455384323075306,
11
+ "n_train": 13620,
12
+ "n_test": 3405
13
+ },
14
+ {
15
+ "target": "optimal_temperature_c",
16
+ "task": "regression",
17
+ "metric_name": "mae",
18
+ "value": 3.4572334195443712,
19
+ "n_train": 13620,
20
+ "n_test": 3405
21
+ },
22
+ {
23
+ "target": "optimal_temperature_c",
24
+ "task": "regression",
25
+ "metric_name": "mae",
26
+ "value": 3.4555605416991115,
27
+ "n_train": 13620,
28
+ "n_test": 3405
29
+ },
30
+ {
31
+ "target": "optimal_temperature_c",
32
+ "task": "regression",
33
+ "metric_name": "mae",
34
+ "value": 2.7375921792185776,
35
+ "n_train": 13620,
36
+ "n_test": 3405
37
+ },
38
+ {
39
+ "target": "optimal_temperature_c",
40
+ "task": "regression",
41
+ "metric_name": "mae",
42
+ "value": 3.3670057165115064,
43
+ "n_train": 13620,
44
+ "n_test": 3405
45
+ }
46
+ ],
47
+ "top_features": {
48
+ "ivywrel_frac": 0.4960219383239746,
49
+ "n_predicted_cds": 0.05391158536076546,
50
+ "pos_charged_frac": 0.039320334792137146,
51
+ "aa_frac_P": 0.02822224497795105,
52
+ "aa_frac_C": 0.026947063207626343,
53
+ "aa_frac_Y": 0.0244345061480999,
54
+ "aa_frac_S": 0.022366111725568773,
55
+ "aa_frac_E": 0.021237003803253173,
56
+ "mean_isoelectric_point": 0.02086108475923538,
57
+ "genome_size_nt": 0.019465378299355508,
58
+ "aa_frac_L": 0.019293070212006568,
59
+ "aa_frac_V": 0.016112500242888926,
60
+ "neg_charged_frac": 0.01534007117152214,
61
+ "aa_frac_D": 0.015074614621698856,
62
+ "aa_frac_R": 0.01491919681429863,
63
+ "aa_frac_Q": 0.014176080748438836,
64
+ "aa_frac_I": 0.013733518309891224,
65
+ "mean_hydrophobicity": 0.013560011424124241,
66
+ "aa_frac_T": 0.012679264973849058,
67
+ "aa_frac_W": 0.011554591916501522
68
+ }
69
+ },
70
+ "optimal_ph": {
71
+ "task": "regression",
72
+ "mean_metric": 0.5397326897468703,
73
+ "folds": [
74
+ {
75
+ "target": "optimal_ph",
76
+ "task": "regression",
77
+ "metric_name": "mae",
78
+ "value": 0.4874841267004945,
79
+ "n_train": 3723,
80
+ "n_test": 931
81
+ },
82
+ {
83
+ "target": "optimal_ph",
84
+ "task": "regression",
85
+ "metric_name": "mae",
86
+ "value": 0.5713520769145897,
87
+ "n_train": 3723,
88
+ "n_test": 931
89
+ },
90
+ {
91
+ "target": "optimal_ph",
92
+ "task": "regression",
93
+ "metric_name": "mae",
94
+ "value": 0.5535044743039043,
95
+ "n_train": 3723,
96
+ "n_test": 931
97
+ },
98
+ {
99
+ "target": "optimal_ph",
100
+ "task": "regression",
101
+ "metric_name": "mae",
102
+ "value": 0.5538595587200816,
103
+ "n_train": 3723,
104
+ "n_test": 931
105
+ },
106
+ {
107
+ "target": "optimal_ph",
108
+ "task": "regression",
109
+ "metric_name": "mae",
110
+ "value": 0.5324632120952811,
111
+ "n_train": 3724,
112
+ "n_test": 930
113
+ }
114
+ ],
115
+ "top_features": {
116
+ "neg_charged_frac": 0.11188828796148301,
117
+ "aa_frac_H": 0.06443588957190513,
118
+ "ivywrel_frac": 0.04711368381977081,
119
+ "aa_frac_Q": 0.0425002209842205,
120
+ "aa_frac_E": 0.04013040736317634,
121
+ "n_predicted_cds": 0.0346000149846077,
122
+ "mean_hydrophobicity": 0.033536428958177565,
123
+ "aa_frac_L": 0.03228112906217575,
124
+ "aa_frac_C": 0.03177782744169235,
125
+ "aa_frac_V": 0.031084126606583596,
126
+ "mean_isoelectric_point": 0.03050720915198326,
127
+ "aa_frac_S": 0.02965335138142109,
128
+ "aa_frac_Y": 0.02864424102008343,
129
+ "aa_frac_W": 0.028111560642719267,
130
+ "aa_frac_P": 0.027575015649199487,
131
+ "aa_frac_I": 0.02686116211116314,
132
+ "pos_charged_frac": 0.02524181194603443,
133
+ "aa_frac_R": 0.024946783483028413,
134
+ "aromatic_frac": 0.024847164377570154,
135
+ "aa_frac_D": 0.024307455867528915
136
+ }
137
+ },
138
+ "oxygen_requirement": {
139
+ "task": "classification",
140
+ "mean_metric": 0.28303272250198364,
141
+ "folds": [
142
+ {
143
+ "target": "oxygen_requirement",
144
+ "task": "classification",
145
+ "metric_name": "f1_macro",
146
+ "value": 0.2740537993495899,
147
+ "n_train": 8347,
148
+ "n_test": 2086
149
+ },
150
+ {
151
+ "target": "oxygen_requirement",
152
+ "task": "classification",
153
+ "metric_name": "f1_macro",
154
+ "value": 0.27987706315870853,
155
+ "n_train": 8347,
156
+ "n_test": 2087
157
+ },
158
+ {
159
+ "target": "oxygen_requirement",
160
+ "task": "classification",
161
+ "metric_name": "f1_macro",
162
+ "value": 0.2906654309479178,
163
+ "n_train": 8347,
164
+ "n_test": 2087
165
+ },
166
+ {
167
+ "target": "oxygen_requirement",
168
+ "task": "classification",
169
+ "metric_name": "f1_macro",
170
+ "value": 0.2668782460426221,
171
+ "n_train": 8347,
172
+ "n_test": 2087
173
+ },
174
+ {
175
+ "target": "oxygen_requirement",
176
+ "task": "classification",
177
+ "metric_name": "f1_macro",
178
+ "value": 0.3036890730110796,
179
+ "n_train": 8348,
180
+ "n_test": 2086
181
+ }
182
+ ],
183
+ "top_features": {
184
+ "aa_frac_C": 0.10454434156417847,
185
+ "genome_size_nt": 0.0889880508184433,
186
+ "n_predicted_cds": 0.061901997029781344,
187
+ "aa_frac_Q": 0.054254394769668576,
188
+ "aa_frac_K": 0.037319378927350044,
189
+ "aa_frac_W": 0.03615027591586113,
190
+ "aa_frac_M": 0.03541129007935524,
191
+ "aa_frac_H": 0.03520122580230236,
192
+ "aa_frac_G": 0.03354446776211262,
193
+ "ivywrel_frac": 0.0306744959205389,
194
+ "aa_frac_Y": 0.03018171414732933,
195
+ "aa_frac_L": 0.029987764731049537,
196
+ "pos_charged_frac": 0.029102231562137603,
197
+ "aa_frac_V": 0.02888880521059036,
198
+ "aa_frac_N": 0.02404877059161663,
199
+ "aa_frac_F": 0.023649439588189124,
200
+ "aa_frac_E": 0.023098185285925867,
201
+ "aa_frac_P": 0.022718899697065354,
202
+ "aa_frac_R": 0.021204233914613724,
203
+ "aa_frac_I": 0.02029264494776726
204
+ }
205
+ },
206
+ "salt_tolerance_pct": {
207
+ "task": "regression",
208
+ "mean_metric": 2.522809721716851,
209
+ "folds": [
210
+ {
211
+ "target": "salt_tolerance_pct",
212
+ "task": "regression",
213
+ "metric_name": "mae",
214
+ "value": 2.182882065629829,
215
+ "n_train": 3840,
216
+ "n_test": 960
217
+ },
218
+ {
219
+ "target": "salt_tolerance_pct",
220
+ "task": "regression",
221
+ "metric_name": "mae",
222
+ "value": 2.5901312468783932,
223
+ "n_train": 3840,
224
+ "n_test": 960
225
+ },
226
+ {
227
+ "target": "salt_tolerance_pct",
228
+ "task": "regression",
229
+ "metric_name": "mae",
230
+ "value": 3.1095002870118864,
231
+ "n_train": 3840,
232
+ "n_test": 960
233
+ },
234
+ {
235
+ "target": "salt_tolerance_pct",
236
+ "task": "regression",
237
+ "metric_name": "mae",
238
+ "value": 2.4400568475718374,
239
+ "n_train": 3840,
240
+ "n_test": 960
241
+ },
242
+ {
243
+ "target": "salt_tolerance_pct",
244
+ "task": "regression",
245
+ "metric_name": "mae",
246
+ "value": 2.291478161492307,
247
+ "n_train": 3840,
248
+ "n_test": 960
249
+ }
250
+ ],
251
+ "top_features": {
252
+ "aa_frac_C": 0.15850864201784134,
253
+ "neg_charged_frac": 0.14477943778038024,
254
+ "aa_frac_T": 0.03598657064139843,
255
+ "mean_isoelectric_point": 0.030603666976094245,
256
+ "aa_frac_H": 0.030093548446893693,
257
+ "aa_frac_D": 0.02875022515654564,
258
+ "aa_frac_L": 0.02756672874093056,
259
+ "aa_frac_K": 0.026532869413495065,
260
+ "aa_frac_W": 0.02639944478869438,
261
+ "ivywrel_frac": 0.026095620542764663,
262
+ "aa_frac_I": 0.025597410276532173,
263
+ "aa_frac_S": 0.0251168642193079,
264
+ "aa_frac_M": 0.02414705790579319,
265
+ "aa_frac_Y": 0.02411585859954357,
266
+ "aa_frac_V": 0.022448396496474742,
267
+ "mean_hydrophobicity": 0.022069810889661312,
268
+ "mean_cds_aa_length": 0.02146391086280346,
269
+ "aromatic_frac": 0.02145920880138874,
270
+ "aa_frac_P": 0.02134738564491272,
271
+ "n_predicted_cds": 0.020885448157787322
272
+ }
273
+ },
274
+ "__meta__": {
275
+ "feature_cols": [
276
+ "genome_size_nt",
277
+ "n_contigs",
278
+ "gc_content",
279
+ "n_predicted_cds",
280
+ "coding_density",
281
+ "mean_cds_aa_length",
282
+ "median_cds_aa_length",
283
+ "aromatic_frac",
284
+ "pos_charged_frac",
285
+ "neg_charged_frac",
286
+ "ivywrel_frac",
287
+ "mean_hydrophobicity",
288
+ "mean_isoelectric_point",
289
+ "aa_frac_A",
290
+ "aa_frac_C",
291
+ "aa_frac_D",
292
+ "aa_frac_E",
293
+ "aa_frac_F",
294
+ "aa_frac_G",
295
+ "aa_frac_H",
296
+ "aa_frac_I",
297
+ "aa_frac_K",
298
+ "aa_frac_L",
299
+ "aa_frac_M",
300
+ "aa_frac_N",
301
+ "aa_frac_P",
302
+ "aa_frac_Q",
303
+ "aa_frac_R",
304
+ "aa_frac_S",
305
+ "aa_frac_T",
306
+ "aa_frac_V",
307
+ "aa_frac_W",
308
+ "aa_frac_Y"
309
+ ]
310
+ }
311
+ }
artifacts/eval.log ADDED
@@ -0,0 +1 @@
 
 
1
+ Wrote /Users/miyuhoriuchi/microbe-model/artifacts/eval_report.md
artifacts/eval_report.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # microbe-model β€” v0 baseline eval report
2
+
3
+ _Generated: 2026-04-27T02:16:35+00:00_
4
+
5
+ ## TL;DR
6
+
7
+ - **`optimal_temperature_c`**: MAE = **3.17** (vs always-predict-mean 5.53, **+43%**)
8
+ - **`optimal_ph`**: MAE = **0.54** (vs always-predict-mean 0.55, **+1%**)
9
+ - **`oxygen_requirement`**: macro-F1 = **0.283** (vs always-predict-majority 0.072, **+294%**)
10
+ - **`salt_tolerance_pct`**: MAE = **2.52** (vs always-predict-mean 2.72, **+7%**)
11
+
12
+ Trained on **17,065** strains with **33** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.
13
+
14
+ ## Corpus
15
+
16
+ - Total strains in feature table: **17,065**
17
+ - Labeled-strain counts by target:
18
+ - `optimal_temperature_c`: 17,025
19
+ - `optimal_ph`: 4,654
20
+ - `oxygen_requirement`: 10,434
21
+ - `salt_tolerance_pct`: 4,800
22
+
23
+ ## Target distributions
24
+
25
+ - `optimal_temperature_c`: n=17,025, mean=31.96, std=8.57, p10=25.00, median=30.00, p90=37.00
26
+ - `optimal_ph`: n=4,654, mean=7.19, std=0.83, p10=6.50, median=7.00, p90=8.00
27
+ - `salt_tolerance_pct`: n=4,800, mean=3.55, std=4.11, p10=0.00, median=2.50, p90=8.00
28
+ - `oxygen_requirement`:
29
+ - `aerobe`: 4,978
30
+ - `anaerobe`: 2,120
31
+ - `facultative anaerobe`: 1,227
32
+ - `obligate aerobe`: 1,029
33
+ - `microaerophile`: 889
34
+ - `obligate anaerobe`: 105
35
+ - `facultative aerobe`: 83
36
+ - `microaerotolerant`: 2
37
+ - `aerotolerant`: 1
38
+
39
+ ## Per-target results (5-fold GroupKFold by family)
40
+
41
+ Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
42
+ Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).
43
+
44
+ | Target | Task | n labeled | Model metric | Baseline | Improvement |
45
+ |---|---|---|---|---|---|
46
+ | `optimal_temperature_c` | regression | 17,025 | MAE=3.173 | MAE=5.528 | +42.6% |
47
+ | `optimal_ph` | regression | 4,654 | MAE=0.540 | MAE=0.546 | +1.2% |
48
+ | `oxygen_requirement` | classification | 10,434 | F1=0.283 | F1=0.072 | +294.3% |
49
+ | `salt_tolerance_pct` | regression | 4,800 | MAE=2.523 | MAE=2.720 | +7.3% |
50
+
51
+ ### `optimal_temperature_c` β€” fold-by-fold
52
+
53
+ | Fold | Metric | Train | Test |
54
+ |---|---|---|---|
55
+ | 1 | mae = 2.846 | n=13,620 | n=3,405 |
56
+ | 2 | mae = 3.457 | n=13,620 | n=3,405 |
57
+ | 3 | mae = 3.456 | n=13,620 | n=3,405 |
58
+ | 4 | mae = 2.738 | n=13,620 | n=3,405 |
59
+ | 5 | mae = 3.367 | n=13,620 | n=3,405 |
60
+
61
+ **Top 10 features for `optimal_temperature_c`:**
62
+
63
+ - `ivywrel_frac` β€” 0.4960
64
+ - `n_predicted_cds` β€” 0.0539
65
+ - `pos_charged_frac` β€” 0.0393
66
+ - `aa_frac_P` β€” 0.0282
67
+ - `aa_frac_C` β€” 0.0269
68
+ - `aa_frac_Y` β€” 0.0244
69
+ - `aa_frac_S` β€” 0.0224
70
+ - `aa_frac_E` β€” 0.0212
71
+ - `mean_isoelectric_point` β€” 0.0209
72
+ - `genome_size_nt` β€” 0.0195
73
+
74
+ ### `optimal_ph` β€” fold-by-fold
75
+
76
+ | Fold | Metric | Train | Test |
77
+ |---|---|---|---|
78
+ | 1 | mae = 0.487 | n=3,723 | n=931 |
79
+ | 2 | mae = 0.571 | n=3,723 | n=931 |
80
+ | 3 | mae = 0.554 | n=3,723 | n=931 |
81
+ | 4 | mae = 0.554 | n=3,723 | n=931 |
82
+ | 5 | mae = 0.532 | n=3,724 | n=930 |
83
+
84
+ **Top 10 features for `optimal_ph`:**
85
+
86
+ - `neg_charged_frac` β€” 0.1119
87
+ - `aa_frac_H` β€” 0.0644
88
+ - `ivywrel_frac` β€” 0.0471
89
+ - `aa_frac_Q` β€” 0.0425
90
+ - `aa_frac_E` β€” 0.0401
91
+ - `n_predicted_cds` β€” 0.0346
92
+ - `mean_hydrophobicity` β€” 0.0335
93
+ - `aa_frac_L` β€” 0.0323
94
+ - `aa_frac_C` β€” 0.0318
95
+ - `aa_frac_V` β€” 0.0311
96
+
97
+ ### `oxygen_requirement` β€” fold-by-fold
98
+
99
+ | Fold | Metric | Train | Test |
100
+ |---|---|---|---|
101
+ | 1 | f1_macro = 0.274 | n=8,347 | n=2,086 |
102
+ | 2 | f1_macro = 0.280 | n=8,347 | n=2,087 |
103
+ | 3 | f1_macro = 0.291 | n=8,347 | n=2,087 |
104
+ | 4 | f1_macro = 0.267 | n=8,347 | n=2,087 |
105
+ | 5 | f1_macro = 0.304 | n=8,348 | n=2,086 |
106
+
107
+ **Top 10 features for `oxygen_requirement`:**
108
+
109
+ - `aa_frac_C` β€” 0.1045
110
+ - `genome_size_nt` β€” 0.0890
111
+ - `n_predicted_cds` β€” 0.0619
112
+ - `aa_frac_Q` β€” 0.0543
113
+ - `aa_frac_K` β€” 0.0373
114
+ - `aa_frac_W` β€” 0.0362
115
+ - `aa_frac_M` β€” 0.0354
116
+ - `aa_frac_H` β€” 0.0352
117
+ - `aa_frac_G` β€” 0.0335
118
+ - `ivywrel_frac` β€” 0.0307
119
+
120
+ ### `salt_tolerance_pct` β€” fold-by-fold
121
+
122
+ | Fold | Metric | Train | Test |
123
+ |---|---|---|---|
124
+ | 1 | mae = 2.183 | n=3,840 | n=960 |
125
+ | 2 | mae = 2.590 | n=3,840 | n=960 |
126
+ | 3 | mae = 3.110 | n=3,840 | n=960 |
127
+ | 4 | mae = 2.440 | n=3,840 | n=960 |
128
+ | 5 | mae = 2.291 | n=3,840 | n=960 |
129
+
130
+ **Top 10 features for `salt_tolerance_pct`:**
131
+
132
+ - `aa_frac_C` β€” 0.1585
133
+ - `neg_charged_frac` β€” 0.1448
134
+ - `aa_frac_T` β€” 0.0360
135
+ - `mean_isoelectric_point` β€” 0.0306
136
+ - `aa_frac_H` β€” 0.0301
137
+ - `aa_frac_D` β€” 0.0288
138
+ - `aa_frac_L` β€” 0.0276
139
+ - `aa_frac_K` β€” 0.0265
140
+ - `aa_frac_W` β€” 0.0264
141
+ - `ivywrel_frac` β€” 0.0261
142
+
143
+ ## Feature ↔ target correlations (Spearman, top 10)
144
+
145
+ Sanity-checks the biology β€” features known to track each target should appear here at high |ρ|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature).
146
+
147
+ ### `optimal_temperature_c`
148
+
149
+ | Feature | Spearman ρ | p-value |
150
+ |---|---|---|
151
+ | `genome_size_nt` | -0.405 | 0.0e+00 |
152
+ | `aa_frac_E` | +0.388 | 0.0e+00 |
153
+ | `n_predicted_cds` | -0.386 | 0.0e+00 |
154
+ | `ivywrel_frac` | +0.320 | 0.0e+00 |
155
+ | `aa_frac_Y` | +0.318 | 0.0e+00 |
156
+ | `aa_frac_W` | -0.309 | 0.0e+00 |
157
+ | `pos_charged_frac` | +0.299 | 0.0e+00 |
158
+ | `aa_frac_A` | -0.295 | 0.0e+00 |
159
+ | `neg_charged_frac` | +0.293 | 0.0e+00 |
160
+ | `aa_frac_P` | -0.277 | 1.0e-297 |
161
+
162
+ ### `optimal_ph`
163
+
164
+ | Feature | Spearman ρ | p-value |
165
+ |---|---|---|
166
+ | `neg_charged_frac` | +0.308 | 4.2e-103 |
167
+ | `mean_isoelectric_point` | -0.276 | 3.8e-82 |
168
+ | `aa_frac_E` | +0.260 | 6.2e-73 |
169
+ | `ivywrel_frac` | +0.166 | 4.0e-30 |
170
+ | `aa_frac_D` | +0.111 | 2.6e-14 |
171
+ | `mean_hydrophobicity` | -0.109 | 7.3e-14 |
172
+ | `aa_frac_C` | -0.095 | 7.6e-11 |
173
+ | `pos_charged_frac` | -0.086 | 3.5e-09 |
174
+ | `aa_frac_P` | -0.081 | 3.3e-08 |
175
+ | `aa_frac_A` | -0.078 | 9.6e-08 |
176
+
177
+ ### `salt_tolerance_pct`
178
+
179
+ | Feature | Spearman ρ | p-value |
180
+ |---|---|---|
181
+ | `neg_charged_frac` | +0.227 | 3.2e-57 |
182
+ | `mean_isoelectric_point` | -0.204 | 3.5e-46 |
183
+ | `ivywrel_frac` | +0.195 | 1.7e-42 |
184
+ | `aa_frac_C` | -0.186 | 1.1e-38 |
185
+ | `mean_cds_aa_length` | -0.161 | 3.8e-29 |
186
+ | `aa_frac_D` | +0.159 | 1.4e-28 |
187
+ | `aa_frac_E` | +0.143 | 2.4e-23 |
188
+ | `aa_frac_V` | +0.112 | 8.3e-15 |
189
+ | `aa_frac_T` | +0.104 | 4.4e-13 |
190
+ | `coding_density` | -0.090 | 4.1e-10 |
191
+
192
+ ## Per-family error breakdown (regression targets)
193
+
194
+ Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.
195
+
196
+ ### `optimal_temperature_c`
197
+
198
+ | Family | n | MAE |
199
+ |---|---|---|
200
+ | Streptomycetaceae | 798 | 1.451 |
201
+ | Bacillaceae | 643 | 4.086 |
202
+ | Flavobacteriaceae | 631 | 4.195 |
203
+ | Lactobacillaceae | 471 | 3.161 |
204
+ | Enterobacteriaceae | 439 | 3.967 |
205
+ | Microbacteriaceae | 396 | 2.457 |
206
+ | Pseudomonadaceae | 388 | 2.523 |
207
+ | Roseobacteraceae | 341 | 2.992 |
208
+ | Paenibacillaceae | 319 | 3.474 |
209
+ | Pseudonocardiaceae | 306 | 2.184 |
210
+ | Moraxellaceae | 269 | 2.723 |
211
+ | Sphingomonadaceae | 256 | 1.845 |
212
+ | Streptococcaceae | 251 | 3.063 |
213
+ | Clostridiaceae | 247 | 4.560 |
214
+ | Vibrionaceae | 239 | 3.290 |
215
+
216
+ ### `optimal_ph`
217
+
218
+ | Family | n | MAE |
219
+ |---|---|---|
220
+ | Flavobacteriaceae | 323 | 0.424 |
221
+ | Bacillaceae | 273 | 0.657 |
222
+ | Roseobacteraceae | 192 | 0.363 |
223
+ | Paenibacillaceae | 126 | 0.484 |
224
+ | Microbacteriaceae | 112 | 0.510 |
225
+ | Sphingobacteriaceae | 100 | 0.376 |
226
+ | Sphingomonadaceae | 96 | 0.365 |
227
+ | Streptomycetaceae | 92 | 0.742 |
228
+ | Pseudonocardiaceae | 85 | 0.547 |
229
+ | Halomonadaceae | 81 | 0.748 |
230
+ | Nocardioidaceae | 74 | 0.516 |
231
+ | Paracoccaceae | 71 | 0.563 |
232
+ | Micrococcaceae | 71 | 0.626 |
233
+ | Erythrobacteraceae | 68 | 0.442 |
234
+ | Alteromonadaceae | 68 | 0.375 |
235
+
236
+ ### `salt_tolerance_pct`
237
+
238
+ | Family | n | MAE |
239
+ |---|---|---|
240
+ | Flavobacteriaceae | 285 | 1.491 |
241
+ | Streptomycetaceae | 283 | 2.070 |
242
+ | Bacillaceae | 246 | 3.241 |
243
+ | Microbacteriaceae | 140 | 2.808 |
244
+ | Pseudonocardiaceae | 134 | 2.501 |
245
+ | Roseobacteraceae | 134 | 1.533 |
246
+ | Paenibacillaceae | 125 | 2.387 |
247
+ | Pseudomonadaceae | 110 | 3.870 |
248
+ | Vibrionaceae | 100 | 2.653 |
249
+ | Sphingomonadaceae | 92 | 2.029 |
250
+ | Micromonosporaceae | 90 | 1.724 |
251
+ | Micrococcaceae | 85 | 3.045 |
252
+ | Nocardiaceae | 84 | 2.613 |
253
+ | Streptococcaceae | 82 | 1.390 |
254
+ | Lactobacillaceae | 78 | 2.506 |
255
+
256
+ ## Known limitations
257
+
258
+ - **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
259
+ - **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded β€” this can be off by 5Β°C or more for some strains.
260
+ - **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
261
+ - **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
262
+ - **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.
263
+
264
+ ## Next steps
265
+
266
+ 1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily.
267
+ 2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV.
268
+ 3. **Integrate KOMODO media DB** as a richer label source than BacDive alone.
269
+ 4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
270
+ 5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.
artifacts/run.log ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [2026-04-27T02:15:42Z] >>> train
2
+ [2026-04-27T02:16:34Z] <<< train OK
3
+ [2026-04-27T02:16:34Z] >>> eval
4
+ [2026-04-27T02:16:36Z] <<< eval OK
5
+ [2026-04-27T02:16:36Z] >>> summary
6
+ [2026-04-27T02:16:37Z] <<< summary OK
7
+ [2026-04-27T02:16:37Z] all phases attempted
artifacts/summary.log ADDED
@@ -0,0 +1 @@
 
 
1
+ Wrote /Users/miyuhoriuchi/microbe-model/OVERNIGHT_SUMMARY.md
artifacts/train.log ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Training table: 17,065 strains Γ— 33 features
2
+ Distinct groups: 605
3
+ Group sizes (top 10): {'Streptomycetaceae': 798, 'Bacillaceae': 643, 'Flavobacteriaceae': 631, 'Lactobacillaceae': 472, 'Enterobacteriaceae': 439, 'Microbacteriaceae': 397, 'Pseudomonadaceae': 388, 'Roseobacteraceae': 341, 'Paenibacillaceae': 319, 'Pseudonocardiaceae': 307}
4
+
5
+ Wrote training table to /Users/miyuhoriuchi/microbe-model/data/training_table.parquet
6
+ Wrote per-strain predictions to /Users/miyuhoriuchi/microbe-model/artifacts/predictions.parquet
7
+
8
+ Results summary (49.6s):
9
+
10
+ optimal_temperature_c mae = 3.1726 (n_folds=5)
11
+ optimal_ph mae = 0.5397 (n_folds=5)
12
+ oxygen_requirement f1_macro = 0.2830 (n_folds=5)
13
+ salt_tolerance_pct mae = 2.5228 (n_folds=5)