Spaces:

miyuiu
/

microbe-model

Sleeping

Miyu Horiuchi Claude Opus 4.7 (1M context) commited on Apr 27

Commit

2ae21d7

1 Parent(s): de9e822

Final overnight commit: trained baseline + eval report + summary

Headline results (5-fold GroupKFold by family, 17,065 strains):

| Target | Task | n labeled | Model | Baseline | Improvement |
|-----------------------|---------------|-----------|-----------|-----------|-------------|
| optimal_temperature_c | regression | 17,025 | MAE=3.17 | MAE=5.53 | **+43%** |
| optimal_ph | regression | 4,654 | MAE=0.54 | MAE=0.55 | +1% |
| oxygen_requirement | classification | 10,434 | F1=0.28 | F1=0.07 | **+294%** |
| salt_tolerance_pct | regression | 4,800 | MAE=2.52 | MAE=2.72 | +7% |

Biology validates: ivywrel_frac (Zeldovich 2007 thermophile signature) is the
dominant feature for T_opt prediction (importance 0.50). Strong correlations
also surface in the Spearman analysis — ivywrel_frac +0.32 with T_opt,
neg_charged_frac +0.31 with optimal_pH, etc.

Per-family breakdown reveals where the model struggles:
- Easy: Streptomycetaceae (T_opt MAE 1.45), Sphingomonadaceae (1.85)
- Hard: Clostridiaceae (T_opt MAE 4.56), Bacillaceae (4.09)

Pipeline timing:
- BacDive scan: 10 min for 100,866 strains
- Streaming featurize: 270 min (~4.5 hr) for 17,065 genomes (99.83% success)
- Train + eval + summary: ~50 sec end-to-end

Now committing the human-readable artifacts:
- artifacts/eval_report.md (the main result)
- artifacts/baseline_results.json (machine-readable per-fold scores)
- artifacts/run.log + per-phase logs (audit trail)
- OVERNIGHT_SUMMARY.md (top-level entry point — already tracked)

Updated .gitignore to allow these specific files inside artifacts/ while still
excluding the gitignored parquet files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (8) hide show

.gitignore +10 -0
OVERNIGHT_SUMMARY.md +13 -12
artifacts/baseline_results.json +311 -0
artifacts/eval.log +1 -0
artifacts/eval_report.md +270 -0
artifacts/run.log +7 -0
artifacts/summary.log +1 -0
artifacts/train.log +13 -0

.gitignore CHANGED Viewed

@@ -24,6 +24,16 @@ venv/
 *.gbff
 *.gbff.gz
 # Notebooks
 .ipynb_checkpoints/
 notebooks/scratch/

 *.gbff
 *.gbff.gz
+# But keep small, human-readable artifacts in version control
+!artifacts/
+artifacts/*
+!artifacts/eval_report.md
+!artifacts/baseline_results.json
+!artifacts/run.log
+!artifacts/train.log
+!artifacts/eval.log
+!artifacts/summary.log
 # Notebooks
 .ipynb_checkpoints/
 notebooks/scratch/

OVERNIGHT_SUMMARY.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Overnight run — summary
-_Written 2026-04-27T01:45+00:00_
 ## Pipeline status
@@ -8,28 +8,29 @@ _Written 2026-04-27T01:45+00:00_
   - 19,637 have genome accessions
   - 50,384 have optimal_temperature_c labels
   - **17,054** strains are training-ready (genome + T_opt)
-- 🟡 Featurize: in progress (84%)
-  - Processed: 14,428 / 17,094
-  - Successful: 14,402 (99.8%)
-  - Failed: 26 (mostly suppressed/withdrawn NCBI assemblies)
-- ⏭ Training: not yet run (waits for featurize completion)
-- ⏭ Eval report: not yet generated
 ## What to read first
-1. Wait for `artifacts/eval_report.md` to be generated, then open it.
 2. Check `git log --oneline` to see the commit timeline.
 ## Files of interest
-- — `artifacts/eval_report.md` — headline result + metrics
-- — `artifacts/baseline_results.json` — machine-readable per-fold scores
 - ✅ `data/bacdive_phenotypes.parquet` — phenotype labels (gitignored) 1.7 MB
-- — `data/features.parquet` — extracted genome features (gitignored)
-- — `data/training_table.parquet` — merged + group-keyed table used for training (gitignored)
 ## Commits since yesterday
 - 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
 - 316196d Fix predictions parquet type mix + plumb feature_cols through eval
 - 7db9544 Add tests for explore module (correlations + class means)

 # Overnight run — summary
+_Written 2026-04-27T02:16+00:00_
 ## Pipeline status
   - 19,637 have genome accessions
   - 50,384 have optimal_temperature_c labels
   - **17,054** strains are training-ready (genome + T_opt)
+- ✅ Featurize: complete
+  - Processed: 17,094 / 17,094
+  - Successful: 17,065 (99.8%)
+  - Failed: 29 (mostly suppressed/withdrawn NCBI assemblies)
+- ✅ Training: see `artifacts/train.log` for stdout
+- ✅ Eval report: **`artifacts/eval_report.md`**
 ## What to read first
+1. Open **`artifacts/eval_report.md`** — headline metrics + per-target detail.
 2. Check `git log --oneline` to see the commit timeline.
 ## Files of interest
+- ✅ `artifacts/eval_report.md` — headline result + metrics 0.0 MB
+- ✅ `artifacts/baseline_results.json` — machine-readable per-fold scores 0.0 MB
 - ✅ `data/bacdive_phenotypes.parquet` — phenotype labels (gitignored) 1.7 MB
+- ✅ `data/features.parquet` — extracted genome features (gitignored) 5.4 MB
+- ✅ `data/training_table.parquet` — merged + group-keyed table used for training (gitignored) 5.8 MB
 ## Commits since yesterday
+- 72e12e7 Make OVERNIGHT_SUMMARY.md write atomic (avoid race with regen loop)
 - 2ea77d1 Add v1 composition features (tetranucleotides + codon usage)
 - 316196d Fix predictions parquet type mix + plumb feature_cols through eval
 - 7db9544 Add tests for explore module (correlations + class means)

artifacts/baseline_results.json ADDED Viewed

	@@ -0,0 +1,311 @@

+{
+  "optimal_temperature_c": {
+    "task": "regression",
+    "mean_metric": 3.1725860578562193,
+    "folds": [
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.8455384323075306,
+        "n_train": 13620,
+        "n_test": 3405
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 3.4572334195443712,
+        "n_train": 13620,
+        "n_test": 3405
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 3.4555605416991115,
+        "n_train": 13620,
+        "n_test": 3405
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.7375921792185776,
+        "n_train": 13620,
+        "n_test": 3405
+      },
+      {
+        "target": "optimal_temperature_c",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 3.3670057165115064,
+        "n_train": 13620,
+        "n_test": 3405
+      }
+    ],
+    "top_features": {
+      "ivywrel_frac": 0.4960219383239746,
+      "n_predicted_cds": 0.05391158536076546,
+      "pos_charged_frac": 0.039320334792137146,
+      "aa_frac_P": 0.02822224497795105,
+      "aa_frac_C": 0.026947063207626343,
+      "aa_frac_Y": 0.0244345061480999,
+      "aa_frac_S": 0.022366111725568773,
+      "aa_frac_E": 0.021237003803253173,
+      "mean_isoelectric_point": 0.02086108475923538,
+      "genome_size_nt": 0.019465378299355508,
+      "aa_frac_L": 0.019293070212006568,
+      "aa_frac_V": 0.016112500242888926,
+      "neg_charged_frac": 0.01534007117152214,
+      "aa_frac_D": 0.015074614621698856,
+      "aa_frac_R": 0.01491919681429863,
+      "aa_frac_Q": 0.014176080748438836,
+      "aa_frac_I": 0.013733518309891224,
+      "mean_hydrophobicity": 0.013560011424124241,
+      "aa_frac_T": 0.012679264973849058,
+      "aa_frac_W": 0.011554591916501522
+    }
+  },
+  "optimal_ph": {
+    "task": "regression",
+    "mean_metric": 0.5397326897468703,
+    "folds": [
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.4874841267004945,
+        "n_train": 3723,
+        "n_test": 931
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.5713520769145897,
+        "n_train": 3723,
+        "n_test": 931
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.5535044743039043,
+        "n_train": 3723,
+        "n_test": 931
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.5538595587200816,
+        "n_train": 3723,
+        "n_test": 931
+      },
+      {
+        "target": "optimal_ph",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 0.5324632120952811,
+        "n_train": 3724,
+        "n_test": 930
+      }
+    ],
+    "top_features": {
+      "neg_charged_frac": 0.11188828796148301,
+      "aa_frac_H": 0.06443588957190513,
+      "ivywrel_frac": 0.04711368381977081,
+      "aa_frac_Q": 0.0425002209842205,
+      "aa_frac_E": 0.04013040736317634,
+      "n_predicted_cds": 0.0346000149846077,
+      "mean_hydrophobicity": 0.033536428958177565,
+      "aa_frac_L": 0.03228112906217575,
+      "aa_frac_C": 0.03177782744169235,
+      "aa_frac_V": 0.031084126606583596,
+      "mean_isoelectric_point": 0.03050720915198326,
+      "aa_frac_S": 0.02965335138142109,
+      "aa_frac_Y": 0.02864424102008343,
+      "aa_frac_W": 0.028111560642719267,
+      "aa_frac_P": 0.027575015649199487,
+      "aa_frac_I": 0.02686116211116314,
+      "pos_charged_frac": 0.02524181194603443,
+      "aa_frac_R": 0.024946783483028413,
+      "aromatic_frac": 0.024847164377570154,
+      "aa_frac_D": 0.024307455867528915
+    }
+  },
+  "oxygen_requirement": {
+    "task": "classification",
+    "mean_metric": 0.28303272250198364,
+    "folds": [
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.2740537993495899,
+        "n_train": 8347,
+        "n_test": 2086
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.27987706315870853,
+        "n_train": 8347,
+        "n_test": 2087
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.2906654309479178,
+        "n_train": 8347,
+        "n_test": 2087
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.2668782460426221,
+        "n_train": 8347,
+        "n_test": 2087
+      },
+      {
+        "target": "oxygen_requirement",
+        "task": "classification",
+        "metric_name": "f1_macro",
+        "value": 0.3036890730110796,
+        "n_train": 8348,
+        "n_test": 2086
+      }
+    ],
+    "top_features": {
+      "aa_frac_C": 0.10454434156417847,
+      "genome_size_nt": 0.0889880508184433,
+      "n_predicted_cds": 0.061901997029781344,
+      "aa_frac_Q": 0.054254394769668576,
+      "aa_frac_K": 0.037319378927350044,
+      "aa_frac_W": 0.03615027591586113,
+      "aa_frac_M": 0.03541129007935524,
+      "aa_frac_H": 0.03520122580230236,
+      "aa_frac_G": 0.03354446776211262,
+      "ivywrel_frac": 0.0306744959205389,
+      "aa_frac_Y": 0.03018171414732933,
+      "aa_frac_L": 0.029987764731049537,
+      "pos_charged_frac": 0.029102231562137603,
+      "aa_frac_V": 0.02888880521059036,
+      "aa_frac_N": 0.02404877059161663,
+      "aa_frac_F": 0.023649439588189124,
+      "aa_frac_E": 0.023098185285925867,
+      "aa_frac_P": 0.022718899697065354,
+      "aa_frac_R": 0.021204233914613724,
+      "aa_frac_I": 0.02029264494776726
+    }
+  },
+  "salt_tolerance_pct": {
+    "task": "regression",
+    "mean_metric": 2.522809721716851,
+    "folds": [
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.182882065629829,
+        "n_train": 3840,
+        "n_test": 960
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.5901312468783932,
+        "n_train": 3840,
+        "n_test": 960
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 3.1095002870118864,
+        "n_train": 3840,
+        "n_test": 960
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.4400568475718374,
+        "n_train": 3840,
+        "n_test": 960
+      },
+      {
+        "target": "salt_tolerance_pct",
+        "task": "regression",
+        "metric_name": "mae",
+        "value": 2.291478161492307,
+        "n_train": 3840,
+        "n_test": 960
+      }
+    ],
+    "top_features": {
+      "aa_frac_C": 0.15850864201784134,
+      "neg_charged_frac": 0.14477943778038024,
+      "aa_frac_T": 0.03598657064139843,
+      "mean_isoelectric_point": 0.030603666976094245,
+      "aa_frac_H": 0.030093548446893693,
+      "aa_frac_D": 0.02875022515654564,
+      "aa_frac_L": 0.02756672874093056,
+      "aa_frac_K": 0.026532869413495065,
+      "aa_frac_W": 0.02639944478869438,
+      "ivywrel_frac": 0.026095620542764663,
+      "aa_frac_I": 0.025597410276532173,
+      "aa_frac_S": 0.0251168642193079,
+      "aa_frac_M": 0.02414705790579319,
+      "aa_frac_Y": 0.02411585859954357,
+      "aa_frac_V": 0.022448396496474742,
+      "mean_hydrophobicity": 0.022069810889661312,
+      "mean_cds_aa_length": 0.02146391086280346,
+      "aromatic_frac": 0.02145920880138874,
+      "aa_frac_P": 0.02134738564491272,
+      "n_predicted_cds": 0.020885448157787322
+    }
+  },
+  "__meta__": {
+    "feature_cols": [
+      "genome_size_nt",
+      "n_contigs",
+      "gc_content",
+      "n_predicted_cds",
+      "coding_density",
+      "mean_cds_aa_length",
+      "median_cds_aa_length",
+      "aromatic_frac",
+      "pos_charged_frac",
+      "neg_charged_frac",
+      "ivywrel_frac",
+      "mean_hydrophobicity",
+      "mean_isoelectric_point",
+      "aa_frac_A",
+      "aa_frac_C",
+      "aa_frac_D",
+      "aa_frac_E",
+      "aa_frac_F",
+      "aa_frac_G",
+      "aa_frac_H",
+      "aa_frac_I",
+      "aa_frac_K",
+      "aa_frac_L",
+      "aa_frac_M",
+      "aa_frac_N",
+      "aa_frac_P",
+      "aa_frac_Q",
+      "aa_frac_R",
+      "aa_frac_S",
+      "aa_frac_T",
+      "aa_frac_V",
+      "aa_frac_W",
+      "aa_frac_Y"
+    ]
+  }
+}

artifacts/eval.log ADDED Viewed

	@@ -0,0 +1 @@


1	+ Wrote /Users/miyuhoriuchi/microbe-model/artifacts/eval_report.md

artifacts/eval_report.md ADDED Viewed

	@@ -0,0 +1,270 @@

+# microbe-model — v0 baseline eval report
+_Generated: 2026-04-27T02:16:35+00:00_
+## TL;DR
+- **`optimal_temperature_c`**: MAE = **3.17** (vs always-predict-mean 5.53, **+43%**)
+- **`optimal_ph`**: MAE = **0.54** (vs always-predict-mean 0.55, **+1%**)
+- **`oxygen_requirement`**: macro-F1 = **0.283** (vs always-predict-majority 0.072, **+294%**)
+- **`salt_tolerance_pct`**: MAE = **2.52** (vs always-predict-mean 2.72, **+7%**)
+Trained on **17,065** strains with **33** genome-derived features. Cross-validation: 5-fold GroupKFold by taxonomic family.
+## Corpus
+- Total strains in feature table: **17,065**
+- Labeled-strain counts by target:
+  - `optimal_temperature_c`: 17,025
+  - `optimal_ph`: 4,654
+  - `oxygen_requirement`: 10,434
+  - `salt_tolerance_pct`: 4,800
+## Target distributions
+- `optimal_temperature_c`: n=17,025, mean=31.96, std=8.57, p10=25.00, median=30.00, p90=37.00
+- `optimal_ph`: n=4,654, mean=7.19, std=0.83, p10=6.50, median=7.00, p90=8.00
+- `salt_tolerance_pct`: n=4,800, mean=3.55, std=4.11, p10=0.00, median=2.50, p90=8.00
+- `oxygen_requirement`:
+  - `aerobe`: 4,978
+  - `anaerobe`: 2,120
+  - `facultative anaerobe`: 1,227
+  - `obligate aerobe`: 1,029
+  - `microaerophile`: 889
+  - `obligate anaerobe`: 105
+  - `facultative aerobe`: 83
+  - `microaerotolerant`: 2
+  - `aerotolerant`: 1
+## Per-target results (5-fold GroupKFold by family)
+Metrics: regression = MAE (lower is better), classification = macro-F1 (higher is better).
+Each is shown alongside the dumb-baseline (always-predict-mean / always-predict-majority).
+| Target | Task | n labeled | Model metric | Baseline | Improvement |
+|---|---|---|---|---|---|
+| `optimal_temperature_c` | regression | 17,025 | MAE=3.173 | MAE=5.528 | +42.6% |
+| `optimal_ph` | regression | 4,654 | MAE=0.540 | MAE=0.546 | +1.2% |
+| `oxygen_requirement` | classification | 10,434 | F1=0.283 | F1=0.072 | +294.3% |
+| `salt_tolerance_pct` | regression | 4,800 | MAE=2.523 | MAE=2.720 | +7.3% |
+### `optimal_temperature_c` — fold-by-fold
+| Fold | Metric | Train | Test |
+|---|---|---|---|
+| 1 | mae = 2.846 | n=13,620 | n=3,405 |
+| 2 | mae = 3.457 | n=13,620 | n=3,405 |
+| 3 | mae = 3.456 | n=13,620 | n=3,405 |
+| 4 | mae = 2.738 | n=13,620 | n=3,405 |
+| 5 | mae = 3.367 | n=13,620 | n=3,405 |
+**Top 10 features for `optimal_temperature_c`:**
+- `ivywrel_frac` — 0.4960
+- `n_predicted_cds` — 0.0539
+- `pos_charged_frac` — 0.0393
+- `aa_frac_P` — 0.0282
+- `aa_frac_C` — 0.0269
+- `aa_frac_Y` — 0.0244
+- `aa_frac_S` — 0.0224
+- `aa_frac_E` — 0.0212
+- `mean_isoelectric_point` — 0.0209
+- `genome_size_nt` — 0.0195
+### `optimal_ph` — fold-by-fold
+| Fold | Metric | Train | Test |
+|---|---|---|---|
+| 1 | mae = 0.487 | n=3,723 | n=931 |
+| 2 | mae = 0.571 | n=3,723 | n=931 |
+| 3 | mae = 0.554 | n=3,723 | n=931 |
+| 4 | mae = 0.554 | n=3,723 | n=931 |
+| 5 | mae = 0.532 | n=3,724 | n=930 |
+**Top 10 features for `optimal_ph`:**
+- `neg_charged_frac` — 0.1119
+- `aa_frac_H` — 0.0644
+- `ivywrel_frac` — 0.0471
+- `aa_frac_Q` — 0.0425
+- `aa_frac_E` — 0.0401
+- `n_predicted_cds` — 0.0346
+- `mean_hydrophobicity` — 0.0335
+- `aa_frac_L` — 0.0323
+- `aa_frac_C` — 0.0318
+- `aa_frac_V` — 0.0311
+### `oxygen_requirement` — fold-by-fold
+| Fold | Metric | Train | Test |
+|---|---|---|---|
+| 1 | f1_macro = 0.274 | n=8,347 | n=2,086 |
+| 2 | f1_macro = 0.280 | n=8,347 | n=2,087 |
+| 3 | f1_macro = 0.291 | n=8,347 | n=2,087 |
+| 4 | f1_macro = 0.267 | n=8,347 | n=2,087 |
+| 5 | f1_macro = 0.304 | n=8,348 | n=2,086 |
+**Top 10 features for `oxygen_requirement`:**
+- `aa_frac_C` — 0.1045
+- `genome_size_nt` — 0.0890
+- `n_predicted_cds` — 0.0619
+- `aa_frac_Q` — 0.0543
+- `aa_frac_K` — 0.0373
+- `aa_frac_W` — 0.0362
+- `aa_frac_M` — 0.0354
+- `aa_frac_H` — 0.0352
+- `aa_frac_G` — 0.0335
+- `ivywrel_frac` — 0.0307
+### `salt_tolerance_pct` — fold-by-fold
+| Fold | Metric | Train | Test |
+|---|---|---|---|
+| 1 | mae = 2.183 | n=3,840 | n=960 |
+| 2 | mae = 2.590 | n=3,840 | n=960 |
+| 3 | mae = 3.110 | n=3,840 | n=960 |
+| 4 | mae = 2.440 | n=3,840 | n=960 |
+| 5 | mae = 2.291 | n=3,840 | n=960 |
+**Top 10 features for `salt_tolerance_pct`:**
+- `aa_frac_C` — 0.1585
+- `neg_charged_frac` — 0.1448
+- `aa_frac_T` — 0.0360
+- `mean_isoelectric_point` — 0.0306
+- `aa_frac_H` — 0.0301
+- `aa_frac_D` — 0.0288
+- `aa_frac_L` — 0.0276
+- `aa_frac_K` — 0.0265
+- `aa_frac_W` — 0.0264
+- `ivywrel_frac` — 0.0261
+## Feature ↔ target correlations (Spearman, top 10)
+Sanity-checks the biology — features known to track each target should appear here at high |ρ|. E.g. `ivywrel_frac` should correlate with `optimal_temperature_c` (Zeldovich 2007 thermophile signature).
+### `optimal_temperature_c`
+| Feature | Spearman ρ | p-value |
+|---|---|---|
+| `genome_size_nt` | -0.405 | 0.0e+00 |
+| `aa_frac_E` | +0.388 | 0.0e+00 |
+| `n_predicted_cds` | -0.386 | 0.0e+00 |
+| `ivywrel_frac` | +0.320 | 0.0e+00 |
+| `aa_frac_Y` | +0.318 | 0.0e+00 |
+| `aa_frac_W` | -0.309 | 0.0e+00 |
+| `pos_charged_frac` | +0.299 | 0.0e+00 |
+| `aa_frac_A` | -0.295 | 0.0e+00 |
+| `neg_charged_frac` | +0.293 | 0.0e+00 |
+| `aa_frac_P` | -0.277 | 1.0e-297 |
+### `optimal_ph`
+| Feature | Spearman ρ | p-value |
+|---|---|---|
+| `neg_charged_frac` | +0.308 | 4.2e-103 |
+| `mean_isoelectric_point` | -0.276 | 3.8e-82 |
+| `aa_frac_E` | +0.260 | 6.2e-73 |
+| `ivywrel_frac` | +0.166 | 4.0e-30 |
+| `aa_frac_D` | +0.111 | 2.6e-14 |
+| `mean_hydrophobicity` | -0.109 | 7.3e-14 |
+| `aa_frac_C` | -0.095 | 7.6e-11 |
+| `pos_charged_frac` | -0.086 | 3.5e-09 |
+| `aa_frac_P` | -0.081 | 3.3e-08 |
+| `aa_frac_A` | -0.078 | 9.6e-08 |
+### `salt_tolerance_pct`
+| Feature | Spearman ρ | p-value |
+|---|---|---|
+| `neg_charged_frac` | +0.227 | 3.2e-57 |
+| `mean_isoelectric_point` | -0.204 | 3.5e-46 |
+| `ivywrel_frac` | +0.195 | 1.7e-42 |
+| `aa_frac_C` | -0.186 | 1.1e-38 |
+| `mean_cds_aa_length` | -0.161 | 3.8e-29 |
+| `aa_frac_D` | +0.159 | 1.4e-28 |
+| `aa_frac_E` | +0.143 | 2.4e-23 |
+| `aa_frac_V` | +0.112 | 8.3e-15 |
+| `aa_frac_T` | +0.104 | 4.4e-13 |
+| `coding_density` | -0.090 | 4.1e-10 |
+## Per-family error breakdown (regression targets)
+Top 15 most-represented families, MAE per family. Highlights where the model is doing well vs. struggling.
+### `optimal_temperature_c`
+| Family | n | MAE |
+|---|---|---|
+| Streptomycetaceae | 798 | 1.451 |
+| Bacillaceae | 643 | 4.086 |
+| Flavobacteriaceae | 631 | 4.195 |
+| Lactobacillaceae | 471 | 3.161 |
+| Enterobacteriaceae | 439 | 3.967 |
+| Microbacteriaceae | 396 | 2.457 |
+| Pseudomonadaceae | 388 | 2.523 |
+| Roseobacteraceae | 341 | 2.992 |
+| Paenibacillaceae | 319 | 3.474 |
+| Pseudonocardiaceae | 306 | 2.184 |
+| Moraxellaceae | 269 | 2.723 |
+| Sphingomonadaceae | 256 | 1.845 |
+| Streptococcaceae | 251 | 3.063 |
+| Clostridiaceae | 247 | 4.560 |
+| Vibrionaceae | 239 | 3.290 |
+### `optimal_ph`
+| Family | n | MAE |
+|---|---|---|
+| Flavobacteriaceae | 323 | 0.424 |
+| Bacillaceae | 273 | 0.657 |
+| Roseobacteraceae | 192 | 0.363 |
+| Paenibacillaceae | 126 | 0.484 |
+| Microbacteriaceae | 112 | 0.510 |
+| Sphingobacteriaceae | 100 | 0.376 |
+| Sphingomonadaceae | 96 | 0.365 |
+| Streptomycetaceae | 92 | 0.742 |
+| Pseudonocardiaceae | 85 | 0.547 |
+| Halomonadaceae | 81 | 0.748 |
+| Nocardioidaceae | 74 | 0.516 |
+| Paracoccaceae | 71 | 0.563 |
+| Micrococcaceae | 71 | 0.626 |
+| Erythrobacteraceae | 68 | 0.442 |
+| Alteromonadaceae | 68 | 0.375 |
+### `salt_tolerance_pct`
+| Family | n | MAE |
+|---|---|---|
+| Flavobacteriaceae | 285 | 1.491 |
+| Streptomycetaceae | 283 | 2.070 |
+| Bacillaceae | 246 | 3.241 |
+| Microbacteriaceae | 140 | 2.808 |
+| Pseudonocardiaceae | 134 | 2.501 |
+| Roseobacteraceae | 134 | 1.533 |
+| Paenibacillaceae | 125 | 2.387 |
+| Pseudomonadaceae | 110 | 3.870 |
+| Vibrionaceae | 100 | 2.653 |
+| Sphingomonadaceae | 92 | 2.029 |
+| Micromonosporaceae | 90 | 1.724 |
+| Micrococcaceae | 85 | 3.045 |
+| Nocardiaceae | 84 | 2.613 |
+| Streptococcaceae | 82 | 1.390 |
+| Lactobacillaceae | 78 | 2.506 |
+## Known limitations
+- **Survivorship bias.** BacDive only contains organisms that have been cultured successfully at least once. The model cannot generalize to truly uncultured strains without explicit out-of-distribution evaluation.
+- **Optimum derivation is heuristic.** Most BacDive temperature entries are tagged as `growth` (positive growth at this temperature), not `optimum`. We approximate the optimum as the median of positive-growth temperatures when no explicit optimum is recorded — this can be off by 5°C or more for some strains.
+- **Family grouping is naive.** The current `family` column is derived from the genus (first word of binomial name). A proper LPSN/GTDB family assignment would give tighter taxonomic grouping.
+- **Feature set is shallow.** No HMM/KEGG annotations, no codon usage indices, no tRNA counts. These are interpretable next steps before moving to genome LMs.
+- **Pyrodigal accuracy.** Gene prediction quality drops on highly-fragmented assemblies and atypical genetic codes. Not currently flagged in the feature set.
+## Next steps
+1. **Add tetranucleotide / codon-usage features.** ~50 extra columns, well-known signal for thermophily.
+2. **Replace naive family lookup with LPSN/GTDB join.** Reduces leakage in CV.
+3. **Integrate KOMODO media DB** as a richer label source than BacDive alone.
+4. **Move to genome embeddings** (Nucleotide Transformer / Evo-1 / DNABERT-2) once the tabular ceiling is established.
+5. **Active learning loop**: select novel-family strains where the model is uncertain, prioritize these for wet-lab cultivation testing.

artifacts/run.log ADDED Viewed

	@@ -0,0 +1,7 @@

+[2026-04-27T02:15:42Z] >>> train
+[2026-04-27T02:16:34Z] <<< train OK
+[2026-04-27T02:16:34Z] >>> eval
+[2026-04-27T02:16:36Z] <<< eval OK
+[2026-04-27T02:16:36Z] >>> summary
+[2026-04-27T02:16:37Z] <<< summary OK
+[2026-04-27T02:16:37Z] all phases attempted

artifacts/summary.log ADDED Viewed

	@@ -0,0 +1 @@


1	+ Wrote /Users/miyuhoriuchi/microbe-model/OVERNIGHT_SUMMARY.md

artifacts/train.log ADDED Viewed

	@@ -0,0 +1,13 @@

+Training table: 17,065 strains × 33 features
+Distinct groups: 605
+Group sizes (top 10): {'Streptomycetaceae': 798, 'Bacillaceae': 643, 'Flavobacteriaceae': 631, 'Lactobacillaceae': 472, 'Enterobacteriaceae': 439, 'Microbacteriaceae': 397, 'Pseudomonadaceae': 388, 'Roseobacteraceae': 341, 'Paenibacillaceae': 319, 'Pseudonocardiaceae': 307}
+Wrote training table to /Users/miyuhoriuchi/microbe-model/data/training_table.parquet
+Wrote per-strain predictions to /Users/miyuhoriuchi/microbe-model/artifacts/predictions.parquet
+Results summary (49.6s):
+  optimal_temperature_c     mae        = 3.1726  (n_folds=5)
+  optimal_ph                mae        = 0.5397  (n_folds=5)
+  oxygen_requirement        f1_macro   = 0.2830  (n_folds=5)
+  salt_tolerance_pct        mae        = 2.5228  (n_folds=5)