Initial commit: distributed EDA pipeline, max non-zero reporting, and notebook

Browse files

Files changed (9) hide show

.gitignore +8 -0
README.md +125 -0
launch_jupyter.sh +10 -0
notebooks/max_nonzero_gene_report.ipynb +111 -0
pyproject.toml +20 -0
scripts/distributed_eda.py +377 -0
scripts/merge_eda_shards.py +72 -0
scripts/resource_probe.py +91 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+.venv/
+__pycache__/
+*.pyc
+.ipynb_checkpoints/
+output/eda/per_dataset/
+output/eda/*.csv
+output/eda/*.json
+output/eda/*.txt

README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# Distributed EDA for Cell x Gene
+This folder now includes a memory-safe EDA pipeline for large `.h5ad` files.
+All commands below assume your current directory is:
+```bash
+cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
+```
+## 1) Check resources first
+```bash
+uv run python scripts/resource_probe.py
+```
+## 2) Run EDA (single node, both species by default)
+```bash
+uv run python scripts/distributed_eda.py \
+  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
+  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
+  --output-dir output/eda \
+  --workers 32 \
+  --chunk-size 8192
+```
+Default output is clean `tqdm` progress only. If you want per-file logs:
+```bash
+uv run python scripts/distributed_eda.py \
+  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
+  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
+  --output-dir output/eda \
+  --workers 32 \
+  --chunk-size 8192 \
+  --log-each-dataset
+```
+If memory pressure appears, fallback to:
+```bash
+uv run python scripts/distributed_eda.py \
+  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
+  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
+  --output-dir output/eda \
+  --workers 24 \
+  --chunk-size 4096
+```
+Default input directories in the script are absolute:
+- `/project/GOV108018/cell_x_gene/homo_sapiens/h5ad`
+- `/project/GOV108018/cell_x_gene/mus_musculus/h5ad`
+## 3) Run EDA as distributed shards (multiple jobs)
+Example for 4 shards:
+```bash
+# job 0
+uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 0
+# job 1
+uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 1
+# job 2
+uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 2
+# job 3
+uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 3
+```
+Then merge:
+```bash
+uv run python scripts/merge_eda_shards.py
+```
+## 4) Report only global max non-zero gene count
+After merge, the script automatically writes:
+- `output/eda/max_nonzero_gene_count_all_cells.csv`
+- `output/eda/max_nonzero_gene_count_all_cells.json`
+These files contain the single dataset row with the highest `cell_nnz_max` (max non-zero genes in any one cell).
+## 5) Visualization notebook
+Open and run:
+- `notebooks/max_nonzero_gene_report.ipynb`
+Or launch Jupyter with `uv run`:
+```bash
+uv run jupyter lab
+```
+The notebook:
+- shows the global max row,
+- plots top datasets by `cell_nnz_max`,
+- plots the distribution of `cell_nnz_max`.
+## Outputs
+- Per shard summary CSV:
+  - `output/eda/eda_summary_shard_XXX_of_YYY.csv`
+- Per shard failure log:
+  - `output/eda/eda_failures_shard_XXX_of_YYY.json`
+- Per dataset JSON details:
+  - `output/eda/per_dataset/*.json`
+- Merged summary:
+  - `output/eda/eda_summary_all_shards.csv`
+- Global max-only report:
+  - `output/eda/max_nonzero_gene_count_all_cells.csv`
+  - `output/eda/max_nonzero_gene_count_all_cells.json`
+## Notes on large data safety
+- Uses `anndata.read_h5ad(..., backed="r")` so matrices are not fully loaded.
+- Scans expression matrix in chunks with `chunked_X`.
+- Uses process-level parallelism with configurable worker count.
+- Includes shard mode for cross-job distribution on HPC queues.
+- Shows a simple dataset-level `tqdm` progress bar during processing.
+- Per-dataset JSON now includes explicit schema blocks:
+  - `obs_schema` with all obs column names and dtypes
+  - `var_schema` with all var column names and dtypes

launch_jupyter.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+#!/bin/bash
+# Launch Jupyter Lab for EDA
+cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
+echo "Starting Jupyter Lab..."
+echo "Access at: http://localhost:8888"
+echo ""
+uv run jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

notebooks/max_nonzero_gene_report.ipynb ADDED Viewed

	@@ -0,0 +1,111 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Max Non-zero Gene Count Report\n",
+        "\n",
+        "This notebook reports only the global maximum non-zero gene count across all cells from all processed datasets."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from pathlib import Path\n",
+        "import pandas as pd\n",
+        "import plotly.express as px\n",
+        "\n",
+        "summary_path = Path('../output/eda/eda_summary_all_shards.csv')\n",
+        "if not summary_path.exists():\n",
+        "    raise FileNotFoundError(f'Missing merged summary: {summary_path}')\n",
+        "\n",
+        "df = pd.read_csv(summary_path)\n",
+        "if 'cell_nnz_max' not in df.columns:\n",
+        "    raise KeyError(\"Column 'cell_nnz_max' not found. Run distributed_eda.py + merge_eda_shards.py first.\")\n",
+        "\n",
+        "df['cell_nnz_max'] = pd.to_numeric(df['cell_nnz_max'], errors='coerce')\n",
+        "df = df.dropna(subset=['cell_nnz_max']).copy()\n",
+        "df.shape"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "max_row = df.loc[df['cell_nnz_max'].idxmax()].copy()\n",
+        "report_cols = [c for c in ['dataset_file', 'dataset_path', 'cell_nnz_max', 'n_obs', 'n_vars', 'file_size_gib'] if c in df.columns]\n",
+        "max_report = max_row[report_cols].to_frame().T\n",
+        "max_report"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "top_n = 20\n",
+        "plot_df = df.nlargest(top_n, 'cell_nnz_max')[['dataset_file', 'cell_nnz_max']].copy()\n",
+        "plot_df = plot_df.sort_values('cell_nnz_max', ascending=True)\n",
+        "\n",
+        "fig = px.bar(\n",
+        "    plot_df,\n",
+        "    x='cell_nnz_max',\n",
+        "    y='dataset_file',\n",
+        "    orientation='h',\n",
+        "    title=f'Top {top_n} Datasets by Max Non-zero Gene Count per Cell',\n",
+        "    labels={'cell_nnz_max': 'Max non-zero genes in one cell', 'dataset_file': 'Dataset'}\n",
+        ")\n",
+        "fig.update_layout(height=700)\n",
+        "fig.show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "fig2 = px.histogram(\n",
+        "    df,\n",
+        "    x='cell_nnz_max',\n",
+        "    nbins=50,\n",
+        "    title='Distribution of Dataset-level Max Non-zero Gene Count per Cell',\n",
+        "    labels={'cell_nnz_max': 'Max non-zero genes in one cell'}\n",
+        ")\n",
+        "fig2.show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "out_dir = Path('../output/eda')\n",
+        "out_dir.mkdir(parents=True, exist_ok=True)\n",
+        "max_report.to_csv(out_dir / 'max_nonzero_gene_count_all_cells_from_notebook.csv', index=False)\n",
+        "max_report"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.10"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

pyproject.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+[project]
+name = "cell-x-gene-visualization"
+version = "0.1.0"
+description = "Visualization for Cell x Gene dataset"
+requires-python = ">=3.10"
+dependencies = [
+    "scanpy>=1.10.0",
+    "anndata>=0.10.0",
+    "pandas>=2.0.0",
+    "matplotlib>=3.8.0",
+    "seaborn>=0.13.0",
+    "numpy>=1.24.0",
+    "jupyter>=1.0.0",
+    "ipykernel>=6.25.0",
+    "plotly>=6.5.2",
+    "kaleido>=1.2.0",
+    "numba>=0.58.0",
+    "tqdm>=4.66.0",
+    "joblib>=1.3.0",
+]

scripts/distributed_eda.py ADDED Viewed

	@@ -0,0 +1,377 @@

+#!/usr/bin/env python3
+"""Distributed and memory-safe EDA for large Cell x Gene .h5ad datasets."""
+from __future__ import annotations
+import argparse
+import concurrent.futures
+import hashlib
+import json
+import math
+import os
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable
+import anndata as ad
+import numpy as np
+import pandas as pd
+from scipy import sparse
+from tqdm import tqdm
+@dataclass
+class RunningStats:
+    count: int = 0
+    sum_value: float = 0.0
+    sum_sq: float = 0.0
+    min_value: float = math.inf
+    max_value: float = -math.inf
+    def update(self, values: np.ndarray) -> None:
+        arr = np.asarray(values, dtype=np.float64)
+        if arr.size == 0:
+            return
+        self.count += int(arr.size)
+        self.sum_value += float(arr.sum())
+        self.sum_sq += float(np.square(arr).sum())
+        self.min_value = min(self.min_value, float(arr.min()))
+        self.max_value = max(self.max_value, float(arr.max()))
+    def finalize(self) -> dict[str, float | int | None]:
+        if self.count == 0:
+            return {"count": 0, "mean": None, "std": None, "min": None, "max": None}
+        mean = self.sum_value / self.count
+        var = max(0.0, self.sum_sq / self.count - mean * mean)
+        return {
+            "count": self.count,
+            "mean": mean,
+            "std": math.sqrt(var),
+            "min": self.min_value,
+            "max": self.max_value,
+        }
+class ReservoirSampler:
+    def __init__(self, k: int, seed: int = 42) -> None:
+        self.k = k
+        self.values = np.empty((k,), dtype=np.float64)
+        self.filled = 0
+        self.seen = 0
+        self.rng = np.random.default_rng(seed)
+    def update(self, arr: np.ndarray) -> None:
+        vals = np.asarray(arr, dtype=np.float64).ravel()
+        for value in vals:
+            self.seen += 1
+            if self.filled < self.k:
+                self.values[self.filled] = value
+                self.filled += 1
+            else:
+                j = int(self.rng.integers(0, self.seen))
+                if j < self.k:
+                    self.values[j] = value
+    def quantiles(self, q: Iterable[float]) -> dict[str, float | None]:
+        if self.filled == 0:
+            return {f"q{int(x * 100)}": None for x in q}
+        sample = self.values[: self.filled]
+        out = np.quantile(sample, list(q))
+        return {f"q{int(k * 100)}": float(v) for k, v in zip(q, out)}
+def safe_name(path: Path) -> str:
+    digest = hashlib.md5(str(path).encode("utf-8"), usedforsecurity=False).hexdigest()[:10]
+    stem = path.stem.replace(" ", "_")
+    if len(stem) > 80:
+        stem = stem[:80]
+    return f"{stem}_{digest}"
+def auto_workers(mem_per_worker_gib: float) -> int:
+    cpu = os.cpu_count() or 1
+    mem_available_gib = 0.0
+    meminfo = Path("/proc/meminfo")
+    if meminfo.exists():
+        for line in meminfo.read_text().splitlines():
+            if line.startswith("MemAvailable:"):
+                kb = int(line.split()[1])
+                mem_available_gib = kb / (1024 * 1024)
+                break
+    # Fast profile for HPC nodes: higher core utilization.
+    by_cpu = max(1, int(cpu * 0.75))
+    by_mem = max(1, int(mem_available_gib // max(1.0, mem_per_worker_gib)))
+    return max(1, min(by_cpu, by_mem))
+def summarize_metadata(df: pd.DataFrame, max_cols: int, max_categories: int) -> dict[str, dict]:
+    if df.empty:
+        return {}
+    preferred = ["cell_type", "assay", "tissue", "disease", "sex", "donor_id"]
+    selected: list[str] = [c for c in preferred if c in df.columns]
+    for col in df.columns:
+        if col not in selected:
+            selected.append(col)
+        if len(selected) >= max_cols:
+            break
+    out: dict[str, dict] = {}
+    n_rows = max(1, len(df))
+    for col in selected:
+        s = df[col]
+        summary = {
+            "dtype": str(s.dtype),
+            "missing_fraction": float(s.isna().sum()) / n_rows,
+        }
+        if isinstance(s.dtype, pd.CategoricalDtype):
+            summary["n_unique"] = int(len(s.cat.categories))
+            vc = s.value_counts(dropna=False).head(max_categories)
+            summary["top_values"] = {str(k): int(v) for k, v in vc.items()}
+        elif pd.api.types.is_string_dtype(s.dtype) or s.dtype == object:
+            sample = s.dropna().astype(str).head(200_000)
+            summary["n_unique_sample"] = int(sample.nunique())
+            vc = sample.value_counts(dropna=False).head(max_categories)
+            summary["top_values_sample"] = {str(k): int(v) for k, v in vc.items()}
+        out[col] = summary
+    return out
+def extract_schema(df: pd.DataFrame) -> dict[str, object]:
+    return {
+        "n_columns": int(len(df.columns)),
+        "columns": [str(c) for c in df.columns],
+        "dtypes": {str(c): str(df[c].dtype) for c in df.columns},
+    }
+def process_dataset(path: Path, chunk_size: int, max_meta_cols: int, max_categories: int) -> dict:
+    t0 = time.time()
+    row: dict[str, object] = {
+        "dataset_path": str(path),
+        "dataset_file": path.name,
+        "file_size_gib": round(path.stat().st_size / (1024**3), 4),
+    }
+    adata = ad.read_h5ad(path, backed="r")
+    try:
+        n_obs = int(adata.n_obs)
+        n_vars = int(adata.n_vars)
+        total_entries = n_obs * n_vars
+        row.update(
+            {
+                "n_obs": n_obs,
+                "n_vars": n_vars,
+                "obs_columns": int(len(adata.obs.columns)),
+                "var_columns": int(len(adata.var.columns)),
+                "layers_count": int(len(adata.layers.keys())),
+                "obsm_count": int(len(adata.obsm.keys())),
+                "varm_count": int(len(adata.varm.keys())),
+            }
+        )
+        row["obs_schema"] = extract_schema(adata.obs)
+        row["var_schema"] = extract_schema(adata.var)
+        nnz_total = 0
+        x_sum = 0.0
+        x_sum_sq = 0.0
+        cell_sum_stats = RunningStats()
+        cell_nnz_stats = RunningStats()
+        cell_sum_sample = ReservoirSampler(k=200_000, seed=17)
+        cell_nnz_sample = ReservoirSampler(k=200_000, seed=23)
+        for chunk, start, end in adata.chunked_X(chunk_size):
+            if sparse.issparse(chunk):
+                nnz = int(chunk.nnz)
+                csr = chunk if sparse.isspmatrix_csr(chunk) else chunk.tocsr(copy=False)
+                data = csr.data.astype(np.float64, copy=False)
+                part_sum = float(data.sum())
+                part_sum_sq = float(np.square(data).sum())
+                row_sums = np.asarray(csr.sum(axis=1)).ravel()
+                row_nnz = np.diff(csr.indptr)
+            else:
+                arr = np.asarray(chunk)
+                arr64 = arr.astype(np.float64, copy=False)
+                nnz = int(np.count_nonzero(arr64))
+                part_sum = float(arr64.sum())
+                part_sum_sq = float(np.square(arr64).sum())
+                row_sums = np.sum(arr64, axis=1)
+                row_nnz = np.count_nonzero(arr64, axis=1)
+            nnz_total += nnz
+            x_sum += part_sum
+            x_sum_sq += part_sum_sq
+            cell_sum_stats.update(row_sums)
+            cell_nnz_stats.update(row_nnz)
+            cell_sum_sample.update(row_sums)
+            cell_nnz_sample.update(row_nnz)
+        row["nnz"] = int(nnz_total)
+        row["sparsity"] = float(1.0 - (nnz_total / total_entries)) if total_entries else None
+        row["x_mean"] = float(x_sum / total_entries) if total_entries else None
+        if total_entries:
+            var = max(0.0, x_sum_sq / total_entries - (x_sum / total_entries) ** 2)
+            row["x_std"] = float(math.sqrt(var))
+        else:
+            row["x_std"] = None
+        cell_sum_quantiles = cell_sum_sample.quantiles([0.05, 0.5, 0.95])
+        cell_nnz_quantiles = cell_nnz_sample.quantiles([0.05, 0.5, 0.95])
+        for key, value in cell_sum_stats.finalize().items():
+            row[f"cell_sum_{key}"] = value
+        for key, value in cell_nnz_stats.finalize().items():
+            row[f"cell_nnz_{key}"] = value
+        for key, value in cell_sum_quantiles.items():
+            row[f"cell_sum_{key}_approx"] = value
+        for key, value in cell_nnz_quantiles.items():
+            row[f"cell_nnz_{key}_approx"] = value
+        row["metadata_obs_summary"] = summarize_metadata(
+            adata.obs, max_cols=max_meta_cols, max_categories=max_categories
+        )
+        row["metadata_var_summary"] = summarize_metadata(
+            adata.var, max_cols=max_meta_cols, max_categories=max_categories
+        )
+        row["status"] = "ok"
+    finally:
+        adata.file.close()
+    row["elapsed_sec"] = round(time.time() - t0, 2)
+    return row
+def discover_h5ad(input_dirs: list[Path]) -> list[Path]:
+    files: list[Path] = []
+    for root in input_dirs:
+        if root.exists():
+            files.extend(sorted(root.rglob("*.h5ad")))
+    files = sorted(set(files))
+    return files
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--input-dir",
+        action="append",
+        default=[],
+        help="Input folder(s) containing .h5ad files. Can be repeated.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path("whats2000_work/cell_x_gene_visualization/output/eda"),
+    )
+    parser.add_argument("--workers", type=int, default=0, help="0 means auto.")
+    parser.add_argument("--chunk-size", type=int, default=4096)
+    parser.add_argument("--mem-per-worker-gib", type=float, default=8.0)
+    parser.add_argument("--num-shards", type=int, default=1)
+    parser.add_argument("--shard-index", type=int, default=0)
+    parser.add_argument("--max-meta-cols", type=int, default=20)
+    parser.add_argument("--max-categories", type=int, default=8)
+    parser.add_argument(
+        "--log-each-dataset",
+        action="store_true",
+        help="Print per-dataset success logs. Default is off for clean tqdm output.",
+    )
+    args = parser.parse_args()
+    if not args.input_dir:
+        args.input_dir = [
+            "/project/GOV108018/cell_x_gene/homo_sapiens/h5ad",
+            "/project/GOV108018/cell_x_gene/mus_musculus/h5ad",
+        ]
+    roots = [Path(p) for p in args.input_dir]
+    all_files = discover_h5ad(roots)
+    if not all_files:
+        raise SystemExit("No .h5ad files found in input directories.")
+    if args.num_shards < 1:
+        raise SystemExit("--num-shards must be >= 1")
+    if args.shard_index < 0 or args.shard_index >= args.num_shards:
+        raise SystemExit("--shard-index must satisfy 0 <= shard-index < num-shards")
+    shard_files = [p for i, p in enumerate(all_files) if i % args.num_shards == args.shard_index]
+    if not shard_files:
+        raise SystemExit("No files assigned to this shard.")
+    workers = args.workers if args.workers > 0 else auto_workers(args.mem_per_worker_gib)
+    workers = min(workers, len(shard_files))
+    args.output_dir.mkdir(parents=True, exist_ok=True)
+    per_dataset_dir = args.output_dir / "per_dataset"
+    per_dataset_dir.mkdir(parents=True, exist_ok=True)
+    manifest_path = args.output_dir / f"manifest_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.txt"
+    manifest_path.write_text("\n".join(str(x) for x in shard_files) + "\n")
+    summary_rows: list[dict] = []
+    failures: list[dict] = []
+    print(
+        json.dumps(
+            {
+                "total_files": len(all_files),
+                "files_in_shard": len(shard_files),
+                "workers": workers,
+                "chunk_size": args.chunk_size,
+                "num_shards": args.num_shards,
+                "shard_index": args.shard_index,
+            }
+        )
+    )
+    with concurrent.futures.ProcessPoolExecutor(max_workers=workers) as ex:
+        futures = {
+            ex.submit(
+                process_dataset,
+                path,
+                args.chunk_size,
+                args.max_meta_cols,
+                args.max_categories,
+            ): path
+            for path in shard_files
+        }
+        with tqdm(total=len(futures), desc="Datasets", unit="dataset") as pbar:
+            for fut in concurrent.futures.as_completed(futures):
+                path = futures[fut]
+                try:
+                    row = fut.result()
+                    summary_rows.append(row)
+                    payload_name = safe_name(path) + ".json"
+                    (per_dataset_dir / payload_name).write_text(json.dumps(row, indent=2))
+                    if args.log_each_dataset:
+                        tqdm.write(f"[ok] {path.name} ({row.get('elapsed_sec', 'na')}s)")
+                except Exception as exc:  # noqa: BLE001
+                    msg = {"dataset_path": str(path), "error": repr(exc), "status": "failed"}
+                    failures.append(msg)
+                    tqdm.write(f"[failed] {path.name}: {exc}")
+                finally:
+                    pbar.update(1)
+    summary_df = pd.DataFrame(summary_rows)
+    summary_csv = args.output_dir / f"eda_summary_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.csv"
+    summary_df.to_csv(summary_csv, index=False)
+    failures_path = args.output_dir / f"eda_failures_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.json"
+    failures_path.write_text(json.dumps(failures, indent=2))
+    print(
+        json.dumps(
+            {
+                "summary_csv": str(summary_csv),
+                "failures_json": str(failures_path),
+                "ok_count": len(summary_rows),
+                "failed_count": len(failures),
+            }
+        )
+    )
+if __name__ == "__main__":
+    main()

scripts/merge_eda_shards.py ADDED Viewed

	@@ -0,0 +1,72 @@

+#!/usr/bin/env python3
+"""Merge shard-level EDA outputs into one master summary."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+import pandas as pd
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path("whats2000_work/cell_x_gene_visualization/output/eda"),
+    )
+    args = parser.parse_args()
+    csv_files = sorted(args.output_dir.glob("eda_summary_shard_*_of_*.csv"))
+    fail_files = sorted(args.output_dir.glob("eda_failures_shard_*_of_*.json"))
+    if not csv_files:
+        raise SystemExit(f"No shard summary files found in {args.output_dir}")
+    merged = pd.concat([pd.read_csv(p) for p in csv_files], ignore_index=True)
+    merged = merged.sort_values(by=["dataset_path"]).drop_duplicates(subset=["dataset_path"], keep="first")
+    merged_csv = args.output_dir / "eda_summary_all_shards.csv"
+    merged.to_csv(merged_csv, index=False)
+    failures = []
+    for p in fail_files:
+        failures.extend(json.loads(p.read_text()))
+    dedup_failures = {}
+    for item in failures:
+        dedup_failures[item["dataset_path"]] = item
+    merged_failures = list(dedup_failures.values())
+    merged_failures_path = args.output_dir / "eda_failures_all_shards.json"
+    merged_failures_path.write_text(json.dumps(merged_failures, indent=2))
+    max_report_csv = None
+    max_report_json = None
+    if "cell_nnz_max" in merged.columns and not merged["cell_nnz_max"].dropna().empty:
+        max_idx = merged["cell_nnz_max"].astype(float).idxmax()
+        max_row = merged.loc[max_idx].to_dict()
+        max_report_csv = args.output_dir / "max_nonzero_gene_count_all_cells.csv"
+        pd.DataFrame([max_row]).to_csv(max_report_csv, index=False)
+        max_report_json = args.output_dir / "max_nonzero_gene_count_all_cells.json"
+        max_report_json.write_text(json.dumps(max_row, indent=2, default=str))
+    print(
+        json.dumps(
+            {
+                "merged_csv": str(merged_csv),
+                "merged_failures": str(merged_failures_path),
+                "max_nonzero_report_csv": str(max_report_csv) if max_report_csv else None,
+                "max_nonzero_report_json": str(max_report_json) if max_report_json else None,
+                "rows": int(len(merged)),
+                "failures": int(len(merged_failures)),
+            },
+            indent=2,
+        )
+    )
+if __name__ == "__main__":
+    main()

scripts/resource_probe.py ADDED Viewed

	@@ -0,0 +1,91 @@

+#!/usr/bin/env python3
+"""Probe local HPC resources and suggest safe EDA concurrency settings."""
+from __future__ import annotations
+import argparse
+import json
+import os
+import platform
+import shutil
+from pathlib import Path
+def _mem_available_gib() -> float:
+    meminfo = Path("/proc/meminfo")
+    if not meminfo.exists():
+        return 0.0
+    for line in meminfo.read_text().splitlines():
+        if line.startswith("MemAvailable:"):
+            kb = int(line.split()[1])
+            return kb / (1024 * 1024)
+    return 0.0
+def _mem_total_gib() -> float:
+    meminfo = Path("/proc/meminfo")
+    if not meminfo.exists():
+        return 0.0
+    for line in meminfo.read_text().splitlines():
+        if line.startswith("MemTotal:"):
+            kb = int(line.split()[1])
+            return kb / (1024 * 1024)
+    return 0.0
+def _recommend_workers(cpu_count: int, mem_available_gib: float, mem_per_worker_gib: float) -> int:
+    # Fast profile for HPC: use more cores while still leaving headroom.
+    by_cpu = max(1, int(cpu_count * 0.75))
+    by_mem = max(1, int(mem_available_gib // max(1.0, mem_per_worker_gib)))
+    return max(1, min(by_cpu, by_mem))
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--workdir",
+        type=Path,
+        default=Path("/project/GOV108018"),
+        help="Path to check disk usage for.",
+    )
+    parser.add_argument(
+        "--mem-per-worker-gib",
+        type=float,
+        default=8.0,
+        help="Memory budget per EDA worker to compute a safe recommendation.",
+    )
+    args = parser.parse_args()
+    cpu_count = os.cpu_count() or 1
+    mem_total_gib = _mem_total_gib()
+    mem_available_gib = _mem_available_gib()
+    disk_total, disk_used, disk_free = shutil.disk_usage(args.workdir)
+    recommended_workers = _recommend_workers(
+        cpu_count=cpu_count,
+        mem_available_gib=mem_available_gib,
+        mem_per_worker_gib=args.mem_per_worker_gib,
+    )
+    recommended_shards = max(1, min(8, cpu_count // max(1, recommended_workers)))
+    report = {
+        "hostname": platform.node(),
+        "platform": platform.platform(),
+        "cpu_count": cpu_count,
+        "memory_total_gib": round(mem_total_gib, 2),
+        "memory_available_gib": round(mem_available_gib, 2),
+        "disk_total_gib": round(disk_total / (1024**3), 2),
+        "disk_used_gib": round(disk_used / (1024**3), 2),
+        "disk_free_gib": round(disk_free / (1024**3), 2),
+        "assumptions": {"mem_per_worker_gib": args.mem_per_worker_gib},
+        "recommendation": {
+            "workers_per_node": recommended_workers,
+            "num_shards_suggestion": recommended_shards,
+            "chunk_size_suggestion": 4096,
+        },
+    }
+    print(json.dumps(report, indent=2))
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff