Spaces:

miyuiu
/

microbe-model

Running

Miyu Horiuchi commited on Apr 27

Commit

6d2a502

1 Parent(s): bbbea9d

Add v1 composition features (tetranucleotides + codon usage)

Two new feature groups, ready to plug into a v1 featurize run:
- 256 tetranucleotide frequencies (skips kmers with N)
- 64 codon-usage frequencies (skips codons with N)

Both expressed as relative frequencies (sum to 1 within each group), so they
are scale-invariant across genome sizes.

These supplement the v0 amino-acid-composition features (33 dims) — adding them
roughly 10× the feature count. Tetranucleotides are well-known to track
phylum-level taxonomy and thermophily; codon usage informs translation
efficiency and growth-rate phenotype.

Not yet wired into the streaming pipeline — that will happen in v1 once we
have the v0 baseline numbers to compare against.

5 new tests, all passing. Total: 26/26.

Files changed (3) hide show

OVERNIGHT_SUMMARY.md +11 -5
src/microbe_model/features/composition.py +68 -0
tests/test_composition.py +44 -0

OVERNIGHT_SUMMARY.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Overnight run — summary
-_Written 2026-04-26T23:03+00:00_
 ## Pipeline status
@@ -8,10 +8,10 @@ _Written 2026-04-26T23:03+00:00_
   - 19,637 have genome accessions
   - 50,384 have optimal_temperature_c labels
   - **17,054** strains are training-ready (genome + T_opt)
-- 🟡 Featurize: in progress (32%)
-  - Processed: 5,489 / 17,094
-  - Successful: 5,473 (99.7%)
-  - Failed: 16 (mostly suppressed/withdrawn NCBI assemblies)
 - ⏭ Training: not yet run (waits for featurize completion)
 - ⏭ Eval report: not yet generated
@@ -30,6 +30,12 @@ _Written 2026-04-26T23:03+00:00_
 ## Commits since yesterday
 - 82997f4 Fix classification fold bug + add end-to-end integration tests
 - 8d52535 Add eval report generator + training table persistence + group-col override
 - 33535e5 Streaming fetch+featurize pipeline + 6× pyrodigal speedup + GCA version resolution

 # Overnight run — summary
+_Written 2026-04-27T01:44+00:00_
 ## Pipeline status
   - 19,637 have genome accessions
   - 50,384 have optimal_temperature_c labels
   - **17,054** strains are training-ready (genome + T_opt)
+- 🟡 Featurize: in progress (84%)
+  - Processed: 14,309 / 17,094
+  - Successful: 14,283 (99.8%)
+  - Failed: 26 (mostly suppressed/withdrawn NCBI assemblies)
 - ⏭ Training: not yet run (waits for featurize completion)
 - ⏭ Eval report: not yet generated
 ## Commits since yesterday
+- 316196d Fix predictions parquet type mix + plumb feature_cols through eval
+- 7db9544 Add tests for explore module (correlations + class means)
+- a22773f Harden post-featurize chain: each phase runs even if previous fails
+- eb37476 Add feature↔target correlation analysis to eval report
+- a7d692a Update README to reflect current state
+- 401687e Eval report enhancements: TL;DR + per-strain predictions + per-family error
 - 82997f4 Fix classification fold bug + add end-to-end integration tests
 - 8d52535 Add eval report generator + training table persistence + group-col override
 - 33535e5 Streaming fetch+featurize pipeline + 6× pyrodigal speedup + GCA version resolution

src/microbe_model/features/composition.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""Compositional features: k-mer frequencies and codon usage.
+These supplement the v0 amino-acid-composition features in `genome.py`. They are
+computed on the same predicted-CDS set, so adding them to a v1 featurize run is
+~free in network/CPU terms.
+Two feature groups:
+  - tetranucleotide frequencies (256 dims) — well-known signal for thermophily,
+    halophily, and phylum-level taxonomy
+  - codon usage frequencies (64 dims) — informs translation efficiency, GC bias,
+    and growth rate phenotype
+We use them as relative frequencies (sum to 1 across each group) rather than
+counts, so they're scale-invariant across genome sizes.
+"""
+from __future__ import annotations
+from collections import Counter
+from collections.abc import Iterable
+NUCLEOTIDES = "ACGT"
+TETRA_KMERS = [a + b + c + d for a in NUCLEOTIDES for b in NUCLEOTIDES
+               for c in NUCLEOTIDES for d in NUCLEOTIDES]
+CODONS = [a + b + c for a in NUCLEOTIDES for b in NUCLEOTIDES for c in NUCLEOTIDES]
+def tetranucleotide_freqs(contigs: Iterable[tuple[str, str]]) -> dict[str, float]:
+    """Relative frequency of each of the 256 ACGT tetranucleotides.
+    Skips any 4-mer containing a non-ACGT character (e.g. N).
+    """
+    counts: Counter[str] = Counter()
+    total = 0
+    for _, seq in contigs:
+        s = seq.upper()
+        for i in range(len(s) - 3):
+            kmer = s[i : i + 4]
+            if kmer in TETRA_KMERS_SET:  # fast in-set check
+                counts[kmer] += 1
+                total += 1
+    if total == 0:
+        return {f"tetra_{k}": 0.0 for k in TETRA_KMERS}
+    return {f"tetra_{k}": counts.get(k, 0) / total for k in TETRA_KMERS}
+def codon_freqs(cds_nucleotides: Iterable[str]) -> dict[str, float]:
+    """Relative frequency of each of the 64 codons across all predicted CDS.
+    Argument: an iterable of nucleotide CDS strings (multiples of 3, ATG-start).
+    Skips codons containing non-ACGT (e.g. N).
+    """
+    counts: Counter[str] = Counter()
+    total = 0
+    for cds in cds_nucleotides:
+        s = cds.upper()
+        for i in range(0, len(s) - 2, 3):
+            codon = s[i : i + 3]
+            if codon in CODONS_SET:
+                counts[codon] += 1
+                total += 1
+    if total == 0:
+        return {f"codon_{k}": 0.0 for k in CODONS}
+    return {f"codon_{k}": counts.get(k, 0) / total for k in CODONS}
+# Lookup sets for fast membership checks
+TETRA_KMERS_SET = set(TETRA_KMERS)
+CODONS_SET = set(CODONS)

tests/test_composition.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Tests for tetranucleotide + codon-frequency features."""
+from __future__ import annotations
+from microbe_model.features.composition import codon_freqs, tetranucleotide_freqs
+def test_tetranucleotide_freqs_sum_to_one() -> None:
+    contigs = [("c1", "ACGT" * 100)]  # 400 nt → 397 4-mers
+    out = tetranucleotide_freqs(contigs)
+    assert len(out) == 256
+    total = sum(out.values())
+    assert abs(total - 1.0) < 1e-6
+def test_tetranucleotide_freqs_handles_n() -> None:
+    contigs = [("c1", "ACGNACGTACGT")]
+    out = tetranucleotide_freqs(contigs)
+    # All 4-mers containing N should be skipped; valid ones (ACGT, CGTA, GTAC, TACG) counted
+    nonzero = {k: v for k, v in out.items() if v > 0}
+    assert all(("N" not in k.removeprefix("tetra_")) for k in nonzero)
+    assert nonzero  # we should have some non-N kmers
+def test_tetranucleotide_freqs_empty() -> None:
+    out = tetranucleotide_freqs([])
+    assert len(out) == 256
+    assert all(v == 0.0 for v in out.values())
+def test_codon_freqs_sum_to_one() -> None:
+    cds_list = ["ATG" * 30 + "TAA"]  # 30 ATG codons, 1 stop
+    out = codon_freqs(cds_list)
+    assert len(out) == 64
+    total = sum(out.values())
+    assert abs(total - 1.0) < 1e-6
+    # ATG should be 30/31 of the codons
+    assert abs(out["codon_ATG"] - 30 / 31) < 1e-6
+def test_codon_freqs_skips_non_acgt() -> None:
+    cds_list = ["ATGNNNATG"]  # codon NNN should be skipped
+    out = codon_freqs(cds_list)
+    assert out["codon_ATG"] == 1.0  # only the two ATG codons counted, both same
+    assert sum(out.values()) == 1.0