Spaces:

miyuiu
/

microbe-model

Running

Miyu Horiuchi Claude Opus 4.7 (1M context) commited on Apr 26

Commit

52cf5ab

0 Parent(s):

Scaffold v0: BacDive + NCBI ingestion, genome feature extractor, XGBoost baseline

Sets up the data + training pipeline for predicting cultivation conditions
(optimal T, pH, oxygen requirement, salt tolerance) from genome sequence:

- src/microbe_model/data/bacdive.py — BacDive REST client + phenotype extraction
- src/microbe_model/data/ncbi.py — NCBI Datasets v2 genome fetcher
- src/microbe_model/features/genome.py — pyrodigal CDS prediction + amino-acid
composition features (IVYWREL, hydrophobicity, isoelectric point — all
biologically motivated for the targets we predict)
- src/microbe_model/train/baseline.py — multi-task XGBoost with group K-fold
by family to prevent leakage from closely related strains
- scripts/01..04 — runnable pipeline entry points
- tests/test_features.py — smoke test on synthetic FASTA, passes

No trained model yet. Real BacDive ingestion needs BACDIVE_USER/PASSWORD env vars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (22) hide show

.env.example +7 -0
.gitattributes +2 -0
.gitignore +39 -0
.python-version +1 -0
README.md +93 -0
pyproject.toml +42 -0
scripts/01_fetch_bacdive.py +46 -0
scripts/02_fetch_genomes.py +40 -0
scripts/03_extract_features.py +46 -0
scripts/04_train_baseline.py +45 -0
src/microbe_model/__init__.py +1 -0
src/microbe_model/config.py +31 -0
src/microbe_model/data/__init__.py +0 -0
src/microbe_model/data/bacdive.py +182 -0
src/microbe_model/data/ncbi.py +61 -0
src/microbe_model/features/__init__.py +0 -0
src/microbe_model/features/genome.py +143 -0
src/microbe_model/train/__init__.py +0 -0
src/microbe_model/train/baseline.py +144 -0
tests/__init__.py +0 -0
tests/test_features.py +43 -0
uv.lock +0 -0

.env.example ADDED Viewed

	@@ -0,0 +1,7 @@

+# BacDive API credentials — register at https://bacdive.dsmz.de/
+BACDIVE_USER=
+BACDIVE_PASSWORD=
+# NCBI API key — optional, raises rate limit from 3 req/s to 10 req/s
+# Get one at https://www.ncbi.nlm.nih.gov/account/settings/
+NCBI_API_KEY=

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *.parquet filter=lfs diff=lfs merge=lfs -text
2	+ *.ubj filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,39 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+# Virtual env
+.venv/
+venv/
+.env
+.env.local
+# Data / artifacts (large files — kept out of git)
+/data/
+/artifacts/
+/models/
+*.parquet
+*.fna
+*.fna.gz
+*.faa
+*.gbff
+*.gbff.gz
+# Notebooks
+.ipynb_checkpoints/
+notebooks/scratch/
+# Editor
+.vscode/
+.idea/
+*.swp
+.DS_Store
+# Agent / tool state
+.claude/
+.letta/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.11

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# microbe-model
+Predict cultivation conditions (optimal temperature, pH, oxygen requirement, salt tolerance) for
+microbial isolates from genome sequence alone. The long-term aim is to lower the cost of culturing
+"microbial dark matter" — the >99% of microbial diversity that has not yet been grown in pure culture.
+## Status
+v0 — scaffolding the data pipeline + a non-deep-learning baseline. No trained model yet.
+## Approach
+```
+BacDive (phenotype labels) ──┐
+                             ├──> joined table (strain, genome_accession, phenotypes)
+GTDB / NCBI (genomes) ───────┘
+                                       │
+                                       ▼
+                              feature extraction
+                              (genome statistics, codon usage,
+                               proteome-level amino acid stats)
+                                       │
+                                       ▼
+                              XGBoost multi-task baseline
+                              (group K-fold by family)
+                                       │
+                                       ▼
+                              eval report (MAE, F1, importances)
+```
+The genome→phenotype features used here have well-established correlations with the target
+properties (e.g. proteome amino acid composition correlates with optimal growth temperature),
+so even a tabular model has a real signal to learn from. The point of the v0 is to establish
+a ceiling before investing in transformer-based approaches.
+## Setup
+```bash
+# Requires Python 3.11 and uv (https://docs.astral.sh/uv/)
+uv sync --all-extras
+```
+## Running the pipeline
+```bash
+# 1. Pull strain metadata + phenotype labels from BacDive
+#    (requires BACDIVE_USER and BACDIVE_PASSWORD env vars — register at bacdive.dsmz.de)
+uv run python scripts/01_fetch_bacdive.py --limit 1000
+# 2. Download genomes for strains that have an accession
+uv run python scripts/02_fetch_genomes.py
+# 3. Extract genome-level features (CDS prediction + amino acid stats)
+uv run python scripts/03_extract_features.py
+# 4. Train multi-task XGBoost baseline
+uv run python scripts/04_train_baseline.py
+# 5. Render eval report
+uv run python scripts/05_eval.py
+```
+## Layout
+```
+src/microbe_model/
+  config.py          # paths, constants
+  data/
+    bacdive.py       # BacDive REST API client
+    ncbi.py          # NCBI genome fetcher (Datasets API)
+  features/
+    genome.py        # gene prediction + tabular feature extraction
+  train/
+    baseline.py      # multi-task XGBoost + group K-fold eval
+scripts/             # runnable entry points (numbered by pipeline order)
+tests/               # smoke tests on small fixtures
+data/                # (gitignored) cached API responses, genomes, parquet tables
+```
+## What this is *not* yet
+- Not a foundation model. No transformer. No genome language model.
+- Not a platform. There is no upload UI or active-learning loop.
+- Not validated against held-out organisms. The eval scaffolding exists; the data does not.
+These are deliberate v0 boundaries. See the project notes for the longer-term plan.
+## Environment variables
+Copy `.env.example` to `.env` and fill in:
+- `BACDIVE_USER`, `BACDIVE_PASSWORD` — required for BacDive API access (free registration).
+- `NCBI_API_KEY` — optional, raises NCBI rate limit from 3 req/s to 10 req/s.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,42 @@

+[project]
+name = "microbe-model"
+version = "0.0.1"
+description = "Predict cultivation conditions for uncultured microbes from genome sequence."
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "biopython>=1.83",
+    "pyrodigal>=3.5",
+    "numpy>=1.26",
+    "pandas>=2.2",
+    "pyarrow>=15",
+    "scikit-learn>=1.4",
+    "xgboost>=2.0",
+    "requests>=2.32",
+    "tqdm>=4.66",
+    "python-dotenv>=1.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0",
+    "ruff>=0.4",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["src/microbe_model"]
+[tool.ruff]
+line-length = 100
+target-version = "py311"
+[tool.ruff.lint]
+select = ["E", "F", "W", "I", "UP", "B", "SIM"]
+ignore = ["E501"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]

scripts/01_fetch_bacdive.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Pull strain metadata + phenotype labels from BacDive.
+Writes one JSON per strain to data/bacdive/, plus a consolidated parquet table at
+data/bacdive_phenotypes.parquet.
+Usage:
+    uv run python scripts/01_fetch_bacdive.py --limit 1000
+"""
+from __future__ import annotations
+import argparse
+import pandas as pd
+from tqdm import tqdm
+from microbe_model import config
+from microbe_model.data.bacdive import (
+    BacDiveClient,
+    extract_phenotypes,
+    fetch_with_cache,
+)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--limit", type=int, default=1000, help="Max strains to fetch (None=all).")
+    args = parser.parse_args()
+    client = BacDiveClient()
+    rows = []
+    for bacdive_id in tqdm(client.iter_strain_ids(limit=args.limit), desc="BacDive", unit="strain"):
+        record = fetch_with_cache(client, bacdive_id)
+        rows.append(extract_phenotypes(record))
+    df = pd.DataFrame(rows)
+    out = config.DATA / "bacdive_phenotypes.parquet"
+    df.to_parquet(out, index=False)
+    print(f"\nWrote {len(df)} rows to {out}")
+    print("Coverage of prediction targets:")
+    for col in ("optimal_temperature_c", "optimal_ph", "oxygen_requirement", "salt_tolerance_pct"):
+        print(f"  {col}: {df[col].notna().sum()} / {len(df)}")
+    print(f"  genome_accession: {df['genome_accession'].notna().sum()} / {len(df)}")
+if __name__ == "__main__":
+    main()

scripts/02_fetch_genomes.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""Download genome FASTAs for every BacDive strain that has an accession.
+Skips strains already cached on disk. Run after 01_fetch_bacdive.py.
+"""
+from __future__ import annotations
+import pandas as pd
+from tqdm import tqdm
+from microbe_model import config
+from microbe_model.data.ncbi import GenomeNotFound, fetch_genome
+def main() -> None:
+    table = config.DATA / "bacdive_phenotypes.parquet"
+    if not table.exists():
+        raise SystemExit(f"Missing {table}. Run scripts/01_fetch_bacdive.py first.")
+    df = pd.read_parquet(table)
+    accessions = df["genome_accession"].dropna().unique().tolist()
+    print(f"{len(accessions)} unique genome accessions to fetch.")
+    failed: list[tuple[str, str]] = []
+    for acc in tqdm(accessions, desc="NCBI", unit="genome"):
+        try:
+            fetch_genome(acc)
+        except GenomeNotFound:
+            failed.append((acc, "not_found"))
+        except Exception as exc:  # noqa: BLE001 — log and continue, don't kill the batch
+            failed.append((acc, type(exc).__name__))
+    print(f"\nDownloaded: {len(accessions) - len(failed)} / {len(accessions)}")
+    if failed:
+        log = config.DATA / "genome_fetch_failures.tsv"
+        pd.DataFrame(failed, columns=["accession", "error"]).to_csv(log, sep="\t", index=False)
+        print(f"Failures logged to {log}")
+if __name__ == "__main__":
+    main()

scripts/03_extract_features.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Extract tabular genome features for every cached genome.
+Reads the BacDive phenotype table + the cached FASTAs in data/genomes/, runs Pyrodigal +
+amino-acid-composition feature extraction, and writes data/features.parquet (one row per strain).
+"""
+from __future__ import annotations
+import pandas as pd
+from tqdm import tqdm
+from microbe_model import config
+from microbe_model.data.ncbi import genome_path
+from microbe_model.features.genome import extract_features
+def main() -> None:
+    pheno_path = config.DATA / "bacdive_phenotypes.parquet"
+    if not pheno_path.exists():
+        raise SystemExit(f"Missing {pheno_path}. Run 01 then 02 first.")
+    pheno = pd.read_parquet(pheno_path)
+    rows = []
+    for _, row in tqdm(pheno.iterrows(), total=len(pheno), desc="features"):
+        accession = row["genome_accession"]
+        if not accession:
+            continue
+        path = genome_path(accession)
+        if not path.exists():
+            continue
+        try:
+            feats = extract_features(path)
+        except Exception as exc:  # noqa: BLE001 — bad FASTA shouldn't kill the run
+            print(f"  skip {accession}: {type(exc).__name__}: {exc}")
+            continue
+        feats["bacdive_id"] = row["bacdive_id"]
+        feats["genome_accession"] = accession
+        rows.append(feats)
+    feats_df = pd.DataFrame(rows)
+    out = config.DATA / "features.parquet"
+    feats_df.to_parquet(out, index=False)
+    print(f"\nWrote {len(feats_df)} rows to {out}")
+if __name__ == "__main__":
+    main()

scripts/04_train_baseline.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""Train the multi-task XGBoost baseline.
+Joins phenotypes + features, derives a `family` column from `species` for group K-fold,
+and writes per-target metrics to artifacts/baseline_results.json.
+"""
+from __future__ import annotations
+import pandas as pd
+from microbe_model import config
+from microbe_model.train.baseline import save_results, train_all
+def derive_family(species: str | None) -> str:
+    """Crude family proxy: first word of binomial. Replace with GTDB lookup later."""
+    if not species:
+        return "__unknown__"
+    return str(species).split()[0]
+def main() -> None:
+    pheno = pd.read_parquet(config.DATA / "bacdive_phenotypes.parquet")
+    feats = pd.read_parquet(config.DATA / "features.parquet")
+    df = pheno.merge(feats, on=["bacdive_id", "genome_accession"], how="inner")
+    df["family"] = df["species"].apply(derive_family)
+    feature_cols = [c for c in feats.columns if c not in {"bacdive_id", "genome_accession"}]
+    print(f"Training on {len(df)} strains × {len(feature_cols)} features.")
+    print(f"Group counts (top 10): {df['family'].value_counts().head(10).to_dict()}")
+    results = train_all(df, feature_cols)
+    out = config.ARTIFACTS / "baseline_results.json"
+    save_results(results, out)
+    print(f"\nWrote results to {out}\n")
+    for target, r in results.items():
+        if r.folds:
+            metric = r.folds[0].metric_name
+            print(f"  {target:25s} {metric:10s} = {r.mean():.4f}  (n_folds={len(r.folds)})")
+        else:
+            print(f"  {target:25s} skipped (insufficient data)")
+if __name__ == "__main__":
+    main()

src/microbe_model/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ __version__ = "0.0.1"

src/microbe_model/config.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""Project paths and shared constants."""
+from __future__ import annotations
+import os
+from pathlib import Path
+from dotenv import load_dotenv
+load_dotenv()
+ROOT = Path(__file__).resolve().parents[2]
+DATA = ROOT / "data"
+ARTIFACTS = ROOT / "artifacts"
+BACDIVE_DIR = DATA / "bacdive"
+GENOME_DIR = DATA / "genomes"
+FEATURE_DIR = DATA / "features"
+for _d in (DATA, ARTIFACTS, BACDIVE_DIR, GENOME_DIR, FEATURE_DIR):
+    _d.mkdir(parents=True, exist_ok=True)
+BACDIVE_USER = os.environ.get("BACDIVE_USER")
+BACDIVE_PASSWORD = os.environ.get("BACDIVE_PASSWORD")
+NCBI_API_KEY = os.environ.get("NCBI_API_KEY")
+PHENOTYPE_TARGETS = {
+    "optimal_temperature_c": "regression",
+    "optimal_ph": "regression",
+    "oxygen_requirement": "classification",
+    "salt_tolerance_pct": "regression",
+}

src/microbe_model/data/__init__.py ADDED Viewed

File without changes

src/microbe_model/data/bacdive.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""BacDive REST API client.
+BacDive (https://bacdive.dsmz.de/) is the largest curated database of bacterial phenotypes.
+Free registration is required; credentials are read from BACDIVE_USER / BACDIVE_PASSWORD.
+This client does the minimum needed for v0:
+  - log in and obtain an OAuth token
+  - paginate through the strain catalog
+  - fetch full records by BacDive ID
+  - extract the phenotype targets we predict (T_opt, pH_opt, oxygen, salt)
+"""
+from __future__ import annotations
+import json
+import time
+from collections.abc import Iterator
+from pathlib import Path
+from typing import Any
+import requests
+from microbe_model import config
+BASE_URL = "https://api.bacdive.dsmz.de"
+TOKEN_URL = "https://sso.dsmz.de/auth/realms/dsmz/protocol/openid-connect/token"
+class BacDiveAuthError(RuntimeError):
+    pass
+class BacDiveClient:
+    def __init__(self, user: str | None = None, password: str | None = None) -> None:
+        self.user = user or config.BACDIVE_USER
+        self.password = password or config.BACDIVE_PASSWORD
+        if not self.user or not self.password:
+            raise BacDiveAuthError(
+                "Set BACDIVE_USER and BACDIVE_PASSWORD in .env (register at bacdive.dsmz.de)."
+            )
+        self._token: str | None = None
+        self._token_expires_at: float = 0.0
+        self._session = requests.Session()
+    def _refresh_token(self) -> None:
+        resp = self._session.post(
+            TOKEN_URL,
+            data={
+                "grant_type": "password",
+                "client_id": "api.bacdive.public",
+                "username": self.user,
+                "password": self.password,
+            },
+            timeout=30,
+        )
+        if resp.status_code != 200:
+            raise BacDiveAuthError(f"BacDive auth failed: {resp.status_code} {resp.text}")
+        body = resp.json()
+        self._token = body["access_token"]
+        self._token_expires_at = time.time() + body.get("expires_in", 300) - 30
+    def _headers(self) -> dict[str, str]:
+        if self._token is None or time.time() >= self._token_expires_at:
+            self._refresh_token()
+        return {"Authorization": f"Bearer {self._token}", "Accept": "application/json"}
+    def _get(self, path: str, params: dict | None = None) -> dict[str, Any]:
+        url = f"{BASE_URL}{path}"
+        for attempt in range(3):
+            resp = self._session.get(url, headers=self._headers(), params=params, timeout=60)
+            if resp.status_code == 429:
+                time.sleep(2 ** attempt)
+                continue
+            resp.raise_for_status()
+            return resp.json()
+        resp.raise_for_status()
+        return {}
+    def iter_strain_ids(self, limit: int | None = None) -> Iterator[int]:
+        """Page through the BacDive catalog and yield strain IDs."""
+        page_url: str | None = "/fetch/"
+        seen = 0
+        while page_url:
+            body = self._get(page_url)
+            for record in body.get("results", []):
+                yield int(record["id"])
+                seen += 1
+                if limit is not None and seen >= limit:
+                    return
+            next_url = body.get("next")
+            if not next_url:
+                return
+            page_url = next_url.replace(BASE_URL, "")
+    def fetch_record(self, bacdive_id: int) -> dict[str, Any]:
+        body = self._get(f"/fetch/{bacdive_id}")
+        results = body.get("results") or {}
+        if isinstance(results, list):
+            return results[0] if results else {}
+        if isinstance(results, dict) and str(bacdive_id) in results:
+            return results[str(bacdive_id)]
+        return results
+def extract_phenotypes(record: dict[str, Any]) -> dict[str, Any]:
+    """Pull the v0 prediction targets out of a BacDive record.
+    BacDive's record schema is deeply nested and field names vary across record versions.
+    We tolerate missing fields — anything we can't find becomes None and is dropped at training time.
+    """
+    out: dict[str, Any] = {
+        "bacdive_id": record.get("General", {}).get("BacDive-ID"),
+        "species": record.get("Name and taxonomic classification", {}).get("species"),
+        "ncbi_taxon_id": record.get("General", {}).get("NCBI tax id"),
+        "optimal_temperature_c": None,
+        "optimal_ph": None,
+        "oxygen_requirement": None,
+        "salt_tolerance_pct": None,
+        "genome_accession": None,
+    }
+    culture = record.get("Culture and growth conditions", {})
+    temps = _as_list(culture.get("culture temp"))
+    for t in temps:
+        if isinstance(t, dict) and t.get("type", "").lower() in {"optimum", "optimal"}:
+            out["optimal_temperature_c"] = _to_float(t.get("temperature"))
+            break
+    phs = _as_list(culture.get("culture pH"))
+    for p in phs:
+        if isinstance(p, dict) and p.get("type", "").lower() in {"optimum", "optimal"}:
+            out["optimal_ph"] = _to_float(p.get("pH"))
+            break
+    physio = record.get("Physiology and metabolism", {})
+    oxygen = _as_list(physio.get("oxygen tolerance"))
+    if oxygen and isinstance(oxygen[0], dict):
+        out["oxygen_requirement"] = oxygen[0].get("oxygen tolerance")
+    salt = _as_list(physio.get("halophily"))
+    for s in salt:
+        if isinstance(s, dict) and "concentration" in s:
+            out["salt_tolerance_pct"] = _to_float(s.get("concentration"))
+            break
+    seq = record.get("Sequence information", {})
+    genomes = _as_list(seq.get("genome sequence"))
+    for g in genomes:
+        if isinstance(g, dict) and g.get("accession"):
+            out["genome_accession"] = g["accession"]
+            break
+    return out
+def _as_list(x: Any) -> list:
+    if x is None:
+        return []
+    if isinstance(x, list):
+        return x
+    return [x]
+def _to_float(x: Any) -> float | None:
+    if x is None:
+        return None
+    try:
+        return float(str(x).split()[0])
+    except (ValueError, AttributeError):
+        return None
+def cache_path(bacdive_id: int) -> Path:
+    return config.BACDIVE_DIR / f"{bacdive_id}.json"
+def fetch_with_cache(client: BacDiveClient, bacdive_id: int) -> dict[str, Any]:
+    path = cache_path(bacdive_id)
+    if path.exists():
+        return json.loads(path.read_text())
+    record = client.fetch_record(bacdive_id)
+    path.write_text(json.dumps(record))
+    return record

src/microbe_model/data/ncbi.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""NCBI genome fetcher.
+Uses the NCBI Datasets v2 REST API to download a single nucleotide FASTA per accession.
+This API doesn't require auth, but providing NCBI_API_KEY raises the rate limit from
+3 req/s to 10 req/s.
+"""
+from __future__ import annotations
+import gzip
+import io
+import time
+import zipfile
+from pathlib import Path
+import requests
+from microbe_model import config
+DATASETS_BASE = "https://api.ncbi.nlm.nih.gov/datasets/v2"
+RATE_LIMIT_S = 0.1 if config.NCBI_API_KEY else 0.34
+class GenomeNotFound(RuntimeError):
+    pass
+def genome_path(accession: str) -> Path:
+    return config.GENOME_DIR / f"{accession}.fna.gz"
+def fetch_genome(accession: str, *, force: bool = False) -> Path:
+    """Download a genome FASTA for the given assembly accession (e.g. GCF_000005845.2).
+    The Datasets API returns a zip; we extract the FASTA, gzip it, and write to disk.
+    Idempotent — returns immediately if the file is already cached.
+    """
+    out = genome_path(accession)
+    if out.exists() and not force:
+        return out
+    url = f"{DATASETS_BASE}/genome/accession/{accession}/download"
+    params = {"include_annotation_type": "GENOME_FASTA"}
+    headers = {"Accept": "application/zip"}
+    if config.NCBI_API_KEY:
+        headers["api-key"] = config.NCBI_API_KEY
+    time.sleep(RATE_LIMIT_S)
+    resp = requests.get(url, params=params, headers=headers, timeout=120, stream=True)
+    if resp.status_code == 404:
+        raise GenomeNotFound(accession)
+    resp.raise_for_status()
+    buf = io.BytesIO(resp.content)
+    with zipfile.ZipFile(buf) as zf:
+        fasta_names = [n for n in zf.namelist() if n.endswith(".fna")]
+        if not fasta_names:
+            raise GenomeNotFound(f"{accession} (no .fna in archive)")
+        with zf.open(fasta_names[0]) as src, gzip.open(out, "wb") as dst:
+            for chunk in iter(lambda: src.read(1 << 16), b""):
+                dst.write(chunk)
+    return out

src/microbe_model/features/__init__.py ADDED Viewed

File without changes

src/microbe_model/features/genome.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""Tabular feature extraction from a microbial genome FASTA.
+These features are deliberately simple and biologically motivated:
+  - genome size, GC content, coding density
+  - predicted gene count and mean CDS length
+  - proteome-level amino acid composition
+  - aromatic, charged, and IVYWREL fractions (correlate with growth temperature)
+  - mean isoelectric point and hydrophobicity
+The amino-acid-composition signals have well-established correlations with optimal growth
+temperature and pH (Zeldovich 2007; Tekaia 2002), so they give XGBoost real signal to learn from
+without any deep model.
+"""
+from __future__ import annotations
+import gzip
+from collections import Counter
+from collections.abc import Iterable
+from pathlib import Path
+import numpy as np
+import pyrodigal
+from Bio import SeqIO
+AA_ALPHABET = "ACDEFGHIKLMNPQRSTVWY"
+AA_AROMATIC = set("FWY")
+AA_CHARGED_POS = set("KRH")
+AA_CHARGED_NEG = set("DE")
+AA_IVYWREL = set("IVYWREL")  # thermophile signature (Zeldovich 2007)
+# Kyte-Doolittle hydrophobicity
+HYDROPHOBICITY = {
+    "A": 1.8, "C": 2.5, "D": -3.5, "E": -3.5, "F": 2.8, "G": -0.4, "H": -3.2,
+    "I": 4.5, "K": -3.9, "L": 3.8, "M": 1.9, "N": -3.5, "P": -1.6, "Q": -3.5,
+    "R": -4.5, "S": -0.8, "T": -0.7, "V": 4.2, "W": -0.9, "Y": -1.3,
+}
+# pKa values for isoelectric point estimation (Lehninger)
+PKA_NTERM = 9.69
+PKA_CTERM = 2.34
+PKA_SIDE = {"D": 3.65, "E": 4.25, "C": 8.33, "Y": 10.07, "H": 6.00, "K": 10.53, "R": 12.48}
+def read_fasta_records(path: Path) -> Iterable[tuple[str, str]]:
+    opener = gzip.open if str(path).endswith(".gz") else open
+    with opener(path, "rt") as handle:
+        for record in SeqIO.parse(handle, "fasta"):
+            yield record.id, str(record.seq).upper()
+def predict_proteins(contigs: Iterable[tuple[str, str]]) -> tuple[list[str], int]:
+    """Run Pyrodigal in meta mode and return predicted protein sequences + total nucleotides scanned."""
+    finder = pyrodigal.GeneFinder(meta=True)
+    proteins: list[str] = []
+    total_nt = 0
+    for _name, seq in contigs:
+        total_nt += len(seq)
+        # Pyrodigal accepts bytes; uppercase string works too in recent versions
+        genes = finder.find_genes(seq.encode("ascii"))
+        for gene in genes:
+            proteins.append(gene.translate().rstrip("*"))
+    return proteins, total_nt
+def aa_composition(proteins: list[str]) -> dict[str, float]:
+    counts: Counter[str] = Counter()
+    total = 0
+    for p in proteins:
+        counts.update(p)
+        total += len(p)
+    if total == 0:
+        return {f"aa_frac_{a}": 0.0 for a in AA_ALPHABET}
+    return {f"aa_frac_{a}": counts.get(a, 0) / total for a in AA_ALPHABET}
+def _isoelectric_point(seq: str) -> float:
+    """Bisection over pH to find the point where net charge is zero."""
+    if not seq:
+        return 7.0
+    counts = Counter(seq)
+    lo, hi = 0.0, 14.0
+    for _ in range(50):
+        ph = (lo + hi) / 2
+        net = (
+            1 / (1 + 10 ** (ph - PKA_NTERM))
+            - 1 / (1 + 10 ** (PKA_CTERM - ph))
+            + counts.get("K", 0) / (1 + 10 ** (ph - PKA_SIDE["K"]))
+            + counts.get("R", 0) / (1 + 10 ** (ph - PKA_SIDE["R"]))
+            + counts.get("H", 0) / (1 + 10 ** (ph - PKA_SIDE["H"]))
+            - counts.get("D", 0) / (1 + 10 ** (PKA_SIDE["D"] - ph))
+            - counts.get("E", 0) / (1 + 10 ** (PKA_SIDE["E"] - ph))
+            - counts.get("C", 0) / (1 + 10 ** (PKA_SIDE["C"] - ph))
+            - counts.get("Y", 0) / (1 + 10 ** (PKA_SIDE["Y"] - ph))
+        )
+        if net > 0:
+            lo = ph
+        else:
+            hi = ph
+    return (lo + hi) / 2
+def extract_features(fasta_path: Path) -> dict[str, float]:
+    contigs = list(read_fasta_records(fasta_path))
+    nt_total = sum(len(s) for _, s in contigs)
+    gc = sum(s.count("G") + s.count("C") for _, s in contigs)
+    gc_frac = gc / nt_total if nt_total else 0.0
+    proteins, _ = predict_proteins(contigs)
+    aa_total = sum(len(p) for p in proteins)
+    coding_density = (3 * aa_total) / nt_total if nt_total else 0.0
+    composition = aa_composition(proteins)
+    aromatic = sum(composition[f"aa_frac_{a}"] for a in AA_AROMATIC)
+    pos_charged = sum(composition[f"aa_frac_{a}"] for a in AA_CHARGED_POS)
+    neg_charged = sum(composition[f"aa_frac_{a}"] for a in AA_CHARGED_NEG)
+    ivywrel = sum(composition[f"aa_frac_{a}"] for a in AA_IVYWREL)
+    hydrophobicity = (
+        sum(composition[f"aa_frac_{a}"] * HYDROPHOBICITY[a] for a in AA_ALPHABET)
+        if proteins else 0.0
+    )
+    pi_values = [_isoelectric_point(p) for p in proteins[:1000]]  # cap at 1k proteins for speed
+    mean_pi = float(np.mean(pi_values)) if pi_values else 7.0
+    cds_lengths = [len(p) for p in proteins]
+    return {
+        "genome_size_nt": float(nt_total),
+        "n_contigs": float(len(contigs)),
+        "gc_content": gc_frac,
+        "n_predicted_cds": float(len(proteins)),
+        "coding_density": coding_density,
+        "mean_cds_aa_length": float(np.mean(cds_lengths)) if cds_lengths else 0.0,
+        "median_cds_aa_length": float(np.median(cds_lengths)) if cds_lengths else 0.0,
+        "aromatic_frac": aromatic,
+        "pos_charged_frac": pos_charged,
+        "neg_charged_frac": neg_charged,
+        "ivywrel_frac": ivywrel,
+        "mean_hydrophobicity": hydrophobicity,
+        "mean_isoelectric_point": mean_pi,
+        **composition,
+    }

src/microbe_model/train/__init__.py ADDED Viewed

File without changes

src/microbe_model/train/baseline.py ADDED Viewed

	@@ -0,0 +1,144 @@

+"""Multi-task XGBoost baseline.
+One model per phenotype target, evaluated with group K-fold by taxonomic family to prevent
+leakage from closely-related strains. This is the v0 "what's the floor on tabular performance"
+sanity check before we invest in transformers.
+"""
+from __future__ import annotations
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import f1_score, mean_absolute_error
+from sklearn.model_selection import GroupKFold
+from sklearn.preprocessing import LabelEncoder
+from microbe_model import config
+@dataclass
+class FoldResult:
+    target: str
+    task: str
+    metric_name: str
+    value: float
+    n_train: int
+    n_test: int
+@dataclass
+class TargetResult:
+    target: str
+    task: str
+    folds: list[FoldResult] = field(default_factory=list)
+    importances: dict[str, float] = field(default_factory=dict)
+    def mean(self) -> float:
+        return float(np.mean([f.value for f in self.folds])) if self.folds else float("nan")
+def _select_xy(df: pd.DataFrame, target: str, feature_cols: list[str]) -> tuple[pd.DataFrame, pd.Series]:
+    mask = df[target].notna()
+    return df.loc[mask, feature_cols], df.loc[mask, target]
+def train_target(
+    df: pd.DataFrame,
+    target: str,
+    task: str,
+    feature_cols: list[str],
+    group_col: str = "family",
+    n_splits: int = 5,
+) -> TargetResult:
+    X, y = _select_xy(df, target, feature_cols)
+    groups = df.loc[X.index, group_col].fillna("__unknown__")
+    if len(X) < n_splits * 2:
+        return TargetResult(target=target, task=task)
+    if task == "classification":
+        encoder = LabelEncoder()
+        y_enc = encoder.fit_transform(y.astype(str))
+    else:
+        y_enc = y.to_numpy(dtype=float)
+    n_unique_groups = groups.nunique()
+    splits = min(n_splits, max(2, n_unique_groups))
+    kfold = GroupKFold(n_splits=splits)
+    result = TargetResult(target=target, task=task)
+    importance_acc = np.zeros(len(feature_cols), dtype=float)
+    fold_count = 0
+    for tr_idx, te_idx in kfold.split(X, y_enc, groups):
+        if task == "classification":
+            n_classes = len(np.unique(y_enc[tr_idx]))
+            if n_classes < 2:
+                continue
+            model = xgb.XGBClassifier(
+                n_estimators=300,
+                max_depth=5,
+                learning_rate=0.05,
+                tree_method="hist",
+                n_jobs=-1,
+                eval_metric="mlogloss",
+            )
+            model.fit(X.iloc[tr_idx], y_enc[tr_idx])
+            preds = model.predict(X.iloc[te_idx])
+            score = f1_score(y_enc[te_idx], preds, average="macro")
+            metric = "f1_macro"
+        else:
+            model = xgb.XGBRegressor(
+                n_estimators=500,
+                max_depth=5,
+                learning_rate=0.05,
+                tree_method="hist",
+                n_jobs=-1,
+            )
+            model.fit(X.iloc[tr_idx], y_enc[tr_idx])
+            preds = model.predict(X.iloc[te_idx])
+            score = mean_absolute_error(y_enc[te_idx], preds)
+            metric = "mae"
+        result.folds.append(FoldResult(
+            target=target,
+            task=task,
+            metric_name=metric,
+            value=float(score),
+            n_train=int(len(tr_idx)),
+            n_test=int(len(te_idx)),
+        ))
+        importance_acc += model.feature_importances_
+        fold_count += 1
+    if fold_count:
+        importance_acc /= fold_count
+        result.importances = dict(zip(feature_cols, importance_acc.tolist(), strict=True))
+    return result
+def train_all(df: pd.DataFrame, feature_cols: list[str]) -> dict[str, TargetResult]:
+    results: dict[str, TargetResult] = {}
+    for target, task in config.PHENOTYPE_TARGETS.items():
+        if target not in df.columns:
+            continue
+        results[target] = train_target(df, target, task, feature_cols)
+    return results
+def save_results(results: dict[str, TargetResult], path: Path) -> None:
+    payload = {
+        target: {
+            "task": r.task,
+            "mean_metric": r.mean(),
+            "folds": [f.__dict__ for f in r.folds],
+            "top_features": dict(
+                sorted(r.importances.items(), key=lambda kv: kv[1], reverse=True)[:20]
+            ),
+        }
+        for target, r in results.items()
+    }
+    path.write_text(json.dumps(payload, indent=2))

tests/__init__.py ADDED Viewed

File without changes

tests/test_features.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""Smoke test the feature extractor on a tiny synthetic genome."""
+from __future__ import annotations
+import gzip
+from pathlib import Path
+from microbe_model.features.genome import extract_features
+def _write_fake_genome(path: Path) -> None:
+    """Write a tiny FASTA with two contigs of synthetic GC-balanced sequence."""
+    contigs = [
+        (">contig_1\n" + ("ATGCGTACGTAGCTAGCTAGCATGCGTACG" * 200) + "\n"),
+        (">contig_2\n" + ("CGTACGATCGATCGTACGTAGCTACGATGC" * 200) + "\n"),
+    ]
+    with gzip.open(path, "wt") as fh:
+        fh.write("".join(contigs))
+def test_extract_features_runs(tmp_path: Path) -> None:
+    fasta = tmp_path / "fake.fna.gz"
+    _write_fake_genome(fasta)
+    feats = extract_features(fasta)
+    assert feats["genome_size_nt"] > 0
+    assert 0 <= feats["gc_content"] <= 1
+    assert feats["n_contigs"] == 2
+    assert feats["n_predicted_cds"] >= 0  # synthetic seq may have no real ORFs
+    # Amino acid fractions should sum to ~1 if any proteins were found, else 0.
+    aa_total = sum(v for k, v in feats.items() if k.startswith("aa_frac_"))
+    assert aa_total == 0.0 or abs(aa_total - 1.0) < 1e-6
+def test_isoelectric_point_in_range() -> None:
+    from microbe_model.features.genome import _isoelectric_point
+    assert 0 <= _isoelectric_point("AAAAA") <= 14
+    assert 0 <= _isoelectric_point("DDDDD") <= 14
+    assert 0 <= _isoelectric_point("KKKKK") <= 14
+    # Acidic protein should have lower pI than basic
+    assert _isoelectric_point("DDDDD") < _isoelectric_point("KKKKK")

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff