feat(pipeline): add YAML config, metadata-aware scheduling, and dataset slicing

- Add YAML-based configuration system with resource-aware settings
- Implement metadata pre-scan for intelligent dataset categorization
- Add automatic dataset slicing for large files (>75B entries)
- Enable parallel processing with priority ordering across all dataset sizes
- Create pipeline launcher scripts for single-command execution
- Update README with comprehensive usage guide and configuration examples
- Remove obsolete code and consolidate to single distributed_eda.py

Files changed (9) hide show

README.md +193 -66
configs/eda_config_template.yaml +87 -0
configs/eda_optimized.yaml +78 -0
scripts/build_metadata_cache.py +237 -0
scripts/distributed_eda.py +304 -236
scripts/merge_eda_shards.py +0 -0
scripts/resource_probe.py +0 -0
scripts/run_eda_pipeline.py +112 -0
scripts/run_eda_slurm.sh +60 -0

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Distributed EDA for Cell x Gene
-This folder now includes a memory-safe EDA pipeline for large `.h5ad` files.
 All commands below assume your current directory is:
@@ -8,96 +8,174 @@ All commands below assume your current directory is:
 cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
 ```
-## 1) Check resources first
 ```bash
-uv run python scripts/resource_probe.py
 ```
-## 2) Run EDA (single node, both species by default)
 ```bash
-uv run python scripts/distributed_eda.py \
-  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
-  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
-  --output-dir output/eda \
-  --workers 32 \
-  --chunk-size 8192
 ```
-Default output is clean `tqdm` progress only. If you want per-file logs:
 ```bash
-uv run python scripts/distributed_eda.py \
-  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
-  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
-  --output-dir output/eda \
-  --workers 32 \
-  --chunk-size 8192 \
-  --log-each-dataset
 ```
-If memory pressure appears, fallback to:
 ```bash
-uv run python scripts/distributed_eda.py \
-  --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad \
-  --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad \
-  --output-dir output/eda \
-  --workers 24 \
-  --chunk-size 4096
 ```
-Default input directories in the script are absolute:
-- `/project/GOV108018/cell_x_gene/homo_sapiens/h5ad`
-- `/project/GOV108018/cell_x_gene/mus_musculus/h5ad`
-## 3) Run EDA as distributed shards (multiple jobs)
-Example for 4 shards:
 ```bash
-# job 0
-uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 0
-# job 1
-uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 1
-# job 2
-uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 2
-# job 3
-uv run python scripts/distributed_eda.py --input-dir /project/GOV108018/cell_x_gene/homo_sapiens/h5ad --input-dir /project/GOV108018/cell_x_gene/mus_musculus/h5ad --num-shards 4 --shard-index 3
 ```
-Then merge:
 ```bash
-uv run python scripts/merge_eda_shards.py
 ```
-## 4) Report only global max non-zero gene count
-After merge, the script automatically writes:
-- `output/eda/max_nonzero_gene_count_all_cells.csv`
-- `output/eda/max_nonzero_gene_count_all_cells.json`
-These files contain the single dataset row with the highest `cell_nnz_max` (max non-zero genes in any one cell).
-## 5) Visualization notebook
-Open and run:
-- `notebooks/max_nonzero_gene_report.ipynb`
-Or launch Jupyter with `uv run`:
-```bash
-uv run jupyter lab
 ```
-The notebook:
-- shows the global max row,
-- plots top datasets by `cell_nnz_max`,
-- plots the distribution of `cell_nnz_max`.
 ## Outputs
@@ -107,19 +185,68 @@ The notebook:
   - `output/eda/eda_failures_shard_XXX_of_YYY.json`
 - Per dataset JSON details:
   - `output/eda/per_dataset/*.json`
-- Merged summary:
   - `output/eda/eda_summary_all_shards.csv`
-- Global max-only report:
   - `output/eda/max_nonzero_gene_count_all_cells.csv`
   - `output/eda/max_nonzero_gene_count_all_cells.json`
-## Notes on large data safety
-- Uses `anndata.read_h5ad(..., backed="r")` so matrices are not fully loaded.
-- Scans expression matrix in chunks with `chunked_X`.
-- Uses process-level parallelism with configurable worker count.
-- Includes shard mode for cross-job distribution on HPC queues.
-- Shows a simple dataset-level `tqdm` progress bar during processing.
-- Per-dataset JSON now includes explicit schema blocks:
-  - `obs_schema` with all obs column names and dtypes
-  - `var_schema` with all var column names and dtypes

 # Distributed EDA for Cell x Gene
+This folder includes a metadata-aware EDA pipeline for large `.h5ad` files with YAML-based configuration.
 All commands below assume your current directory is:
 cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
 ```
+## Quick Start
+### 1) Configure pipeline
+Use the optimized config (auto-generated for your system: 394 GB RAM, 56 cores):
 ```bash
+cat configs/eda_optimized.yaml
 ```
+Or create your own based on the template:
 ```bash
+cp configs/eda_config_template.yaml configs/my_config.yaml
+# Edit my_config.yaml with your paths and resource limits
 ```
+### 2) Build metadata cache
+Pre-scan all datasets to determine sizes and enable intelligent scheduling:
 ```bash
+uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
 ```
+This creates `output/cache/enhanced_metadata.parquet` with:
+- Dataset dimensions (n_obs × n_vars)
+- File sizes
+- Size categories (small/medium/large/xlarge)
+- Estimated memory requirements
+Cache is incremental - only new/changed files are rescanned. Use `--force-rescan` to rebuild.
+### 3) Run EDA pipeline
+Single command to run everything:
 ```bash
+uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml
 ```
+Or run steps individually:
+```bash
+# Step 1: Build metadata (if not done)
+uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step metadata
+# Step 2: Run EDA
+uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step eda
+# Step 3: Merge shards (if using sharding)
+uv run python scripts/run_eda_pipeline.py --config configs/eda_optimized.yaml --step merge
+```
+### 4) Direct script usage
+For more control:
 ```bash
+# Build metadata cache
+uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
+# Run EDA with all workers
+uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml
+# Override worker count
+uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --force-workers 32
 ```
+## Distributed Processing (SLURM)
+For multi-node HPC clusters, use array jobs:
 ```bash
+# Submit 4 parallel jobs
+sbatch --array=0-3 scripts/run_eda_slurm.sh configs/eda_optimized.yaml 4
+# After all jobs complete, merge results
+uv run python scripts/merge_eda_shards.py --output-dir output/eda
+```
+Or configure sharding in YAML:
+```yaml
+sharding:
+  enabled: true
+  num_shards: 4
+  shard_index: 0  # Override with --shard-index on command line
+  strategy: "size_balanced"  # Distribute by size for load balancing
 ```
+Then run each shard:
+```bash
+uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 0
+uv run python scripts/distributed_eda.py --config configs/eda_optimized.yaml --num-shards 4 --shard-index 1
+# ... etc
+```
+## Configuration Guide
+### Resource Management
+The pipeline respects your resource limits and adapts processing strategy by dataset size:
+```yaml
+resources:
+  max_memory_gib: 240      # Total memory available
+  max_workers: 42          # Maximum parallel workers
+  mem_per_worker_gib: 5.5  # Memory per worker
+  chunk_size: 12288        # Matrix chunk size
+dataset_thresholds:
+  small: 2_000_000_000      # < 2B entries
+  medium: 15_000_000_000    # < 15B entries
+  large: 75_000_000_000     # < 75B entries
+  max_entries: 200_000_000_000  # Reject larger datasets
+strategy:
+  small:
+    workers_fraction: 1.0     # Use all workers
+    chunk_size_multiplier: 1.0
+    priority: 1               # Process first
+  large:
+    workers_fraction: 0.4     # Fewer workers
+    chunk_size_multiplier: 0.6
+    priority: 3
+    require_slicing: true     # Slice into chunks
+```
+### Dataset Slicing
+Large datasets are automatically sliced to respect memory limits:
+```yaml
+slicing:
+  enabled: true
+  obs_slice_size: 75000  # Process 75k cells at a time
+  overlap: 0
+  merge_strategy: "combine"  # Combine slice statistics
+```
+### Metadata Integration
+Point to CELLxGENE metadata CSVs for enhanced context:
+```yaml
+paths:
+  metadata_csvs:
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv
+  enhanced_metadata_cache: output/cache/enhanced_metadata.parquet
 ```
+The pipeline merges this with quick-scanned dimensions for intelligent scheduling.
+## Processing Strategy
+The pipeline uses **parallel processing with priority ordering**:
+1. **Pre-scan phase**: Quick metadata scan (no matrix loading) categorizes datasets by size
+2. **Parallel execution**: All datasets process in parallel using full worker pool
+3. **Smart ordering**: Small datasets (priority 1) start first for quick wins
+4. **Automatic slicing**: Large datasets split into memory-safe chunks
+5. **Resource-aware**: Strategies adapt chunk sizes based on dataset category
+This approach fully leverages all available cores throughout the entire pipeline.
 ## Outputs
   - `output/eda/eda_failures_shard_XXX_of_YYY.json`
 - Per dataset JSON details:
   - `output/eda/per_dataset/*.json`
+- Merged summary (after sharding):
   - `output/eda/eda_summary_all_shards.csv`
+- Global max report:
   - `output/eda/max_nonzero_gene_count_all_cells.csv`
   - `output/eda/max_nonzero_gene_count_all_cells.json`
+- Metadata cache:
+  - `output/cache/enhanced_metadata.parquet`
+## Output Schema
+Each dataset result includes:
+- **Dimensions**: n_obs, n_vars, total_entries
+- **Sparsity**: nnz, sparsity
+- **Cell statistics**: cell_sum_*, cell_nnz_* (mean/std/min/max/quantiles)
+- **Matrix statistics**: x_mean, x_std
+- **Metadata summaries**: obs/var column types and top values
+- **Schema**: Complete column names and dtypes
+- **Processing info**: size_category, processing_mode (full/sliced), elapsed_sec
+## Visualization
+Open the notebook:
+```bash
+uv run jupyter lab notebooks/max_nonzero_gene_report.ipynb
+```
+The notebook provides:
+- Global max non-zero gene count
+- Distribution of cell-level statistics
+- Dataset size analysis
+- Processing time comparisons
+## Troubleshooting
+### Metadata cache not found
+```bash
+# Build it first
+uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml
+```
+### Memory errors
+Reduce workers and chunk size in config:
+```yaml
+resources:
+  max_workers: 24
+  chunk_size: 4096
+slicing:
+  obs_slice_size: 50000
+```
+### Dataset too large
+Adjust thresholds or enable more aggressive slicing:
+```yaml
+dataset_thresholds:
+  max_entries: 50_000_000_000  # Lower limit
+slicing:
+  obs_slice_size: 30000  # Smaller slices
+```

configs/eda_config_template.yaml ADDED Viewed

	@@ -0,0 +1,87 @@

+# EDA Pipeline Configuration Template
+# This file defines resource limits, dataset filtering, and processing strategies
+# Resource Limits
+resources:
+  max_memory_gib: 256  # Total memory available
+  max_workers: 32      # Maximum concurrent workers
+  mem_per_worker_gib: 8.0  # Memory per worker process
+  chunk_size: 8192     # Chunk size for reading X matrix
+# Input/Output Paths
+paths:
+  input_dirs:
+    - /project/GOV108018/cell_x_gene/homo_sapiens/h5ad
+    - /project/GOV108018/cell_x_gene/mus_musculus/h5ad
+  output_dir: output/eda
+  cache_dir: output/cache  # Store metadata cache
+  # Dataset metadata for intelligent scheduling
+  # These CSVs contain dataset_h5ad_path and dataset_total_cell_count
+  metadata_csvs:
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv
+  # Enhanced metadata cache (n_obs, n_vars, file_size) built by pre-scan
+  enhanced_metadata_cache: output/cache/enhanced_metadata.parquet
+# Dataset Size Thresholds (based on n_obs * n_vars)
+# Categorize datasets to apply different processing strategies
+dataset_thresholds:
+  small: 1_000_000_000      # < 1B entries: process normally
+  medium: 10_000_000_000    # < 10B entries: reduce workers
+  large: 50_000_000_000     # < 50B entries: slice into chunks
+  max_entries: 100_000_000_000  # > 100B entries: skip or special handling
+# Slicing Strategy for Large Datasets
+slicing:
+  enabled: true
+  obs_slice_size: 50000  # Process 50k cells at a time for large datasets
+  overlap: 0             # No overlap between slices
+  merge_strategy: "combine"  # How to combine stats from slices
+# Processing Strategy by Dataset Size
+strategy:
+  small:
+    workers_fraction: 1.0  # Use full worker pool
+    chunk_size_multiplier: 1.0
+    priority: 1  # Process first (fastest)
+  medium:
+    workers_fraction: 0.5  # Reduce workers to save memory
+    chunk_size_multiplier: 0.5
+    priority: 2
+  large:
+    workers_fraction: 0.25  # Minimal workers, use slicing
+    chunk_size_multiplier: 0.25
+    priority: 3
+    require_slicing: true
+# Sharding Configuration (for distributed jobs)
+sharding:
+  enabled: false
+  num_shards: 1
+  shard_index: 0
+  strategy: "round_robin"  # or "size_balanced"
+# Metadata Extraction Settings
+metadata:
+  max_meta_cols: 20
+  max_categories: 8
+  extract_schemas: true
+# Behavior Flags
+behavior:
+  log_each_dataset: false  # Clean tqdm output
+  skip_failures: true      # Continue on errors
+  save_per_dataset_json: true
+  pre_scan_enabled: true   # Scan metadata before processing
+  cache_metadata: true     # Cache dataset dimensions
+# Output Options
+output:
+  summary_csv: true
+  failures_json: true
+  global_max_report: true  # Report with max non-zero gene count
+  per_dataset_details: true

configs/eda_optimized.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+# Optimized EDA Configuration for Current System
+# System specs: 394 GB RAM, 56 cores, ~250 GB available
+# Balanced for maximum speed while respecting resource limits
+resources:
+  max_memory_gib: 240  # Leave ~10 GB buffer for system
+  max_workers: 42      # 75% of cores for stability
+  mem_per_worker_gib: 5.5  # ~231 GB total worker memory
+  chunk_size: 12288    # Good balance for large matrices
+paths:
+  input_dirs:
+    - /project/GOV108018/cell_x_gene/homo_sapiens/h5ad
+    - /project/GOV108018/cell_x_gene/mus_musculus/h5ad
+  output_dir: output/eda
+  cache_dir: output/cache
+  # Dataset metadata for intelligent scheduling
+  # These CSVs contain dataset_h5ad_path and dataset_total_cell_count
+  metadata_csvs:
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv
+    - /project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv
+  # Enhanced metadata cache (n_obs, n_vars, file_size) built by pre-scan
+  enhanced_metadata_cache: output/cache/enhanced_metadata.parquet
+dataset_thresholds:
+  small: 2_000_000_000      # < 2B entries: full speed
+  medium: 15_000_000_000    # < 15B entries: moderate
+  large: 75_000_000_000     # < 75B entries: slice required
+  max_entries: 200_000_000_000  # Max 200B entries
+slicing:
+  enabled: true
+  obs_slice_size: 75000  # 75k cells per slice for large datasets
+  overlap: 0
+  merge_strategy: "combine"
+strategy:
+  small:
+    workers_fraction: 1.0  # Use all 42 workers
+    chunk_size_multiplier: 1.0
+    priority: 1
+  medium:
+    workers_fraction: 0.7  # ~30 workers
+    chunk_size_multiplier: 0.85
+    priority: 2
+  large:
+    workers_fraction: 0.4  # ~17 workers with slicing
+    chunk_size_multiplier: 0.6
+    priority: 3
+    require_slicing: true
+sharding:
+  enabled: false
+  num_shards: 1
+  shard_index: 0
+  strategy: "size_balanced"
+metadata:
+  max_meta_cols: 20
+  max_categories: 8
+  extract_schemas: true
+behavior:
+  log_each_dataset: false  # Clean tqdm output
+  skip_failures: true
+  save_per_dataset_json: true
+  pre_scan_enabled: true   # Pre-scan to categorize by size
+  cache_metadata: true
+output:
+  summary_csv: true
+  failures_json: true
+  global_max_report: true
+  per_dataset_details: true

scripts/build_metadata_cache.py ADDED Viewed

	@@ -0,0 +1,237 @@

+#!/usr/bin/env python3
+"""Pre-scan datasets to build enhanced metadata for intelligent job scheduling."""
+from __future__ import annotations
+import argparse
+import json
+import time
+from pathlib import Path
+from typing import Any
+import anndata as ad
+import pandas as pd
+from tqdm import tqdm
+def quick_scan_dataset(h5ad_path: Path) -> dict[str, Any]:
+    """Quickly extract dimensions and size without loading data matrices."""
+    try:
+        t0 = time.time()
+        file_size_bytes = h5ad_path.stat().st_size
+        # Open in backed mode - only reads metadata, not matrices
+        adata = ad.read_h5ad(h5ad_path, backed="r")
+        try:
+            n_obs = int(adata.n_obs)
+            n_vars = int(adata.n_vars)
+            total_entries = n_obs * n_vars
+            result = {
+                "dataset_path": str(h5ad_path),
+                "dataset_file": h5ad_path.name,
+                "dataset_id": h5ad_path.stem,
+                "n_obs": n_obs,
+                "n_vars": n_vars,
+                "total_entries": total_entries,
+                "file_size_bytes": file_size_bytes,
+                "file_size_gib": round(file_size_bytes / (1024**3), 4),
+                "obs_columns": len(adata.obs.columns),
+                "var_columns": len(adata.var.columns),
+                "scan_time_sec": round(time.time() - t0, 3),
+                "status": "ok",
+            }
+        finally:
+            adata.file.close()
+        return result
+    except Exception as e:
+        return {
+            "dataset_path": str(h5ad_path),
+            "dataset_file": h5ad_path.name,
+            "dataset_id": h5ad_path.stem,
+            "error": str(e),
+            "status": "failed",
+        }
+def load_cellxgene_metadata(csv_paths: list[Path]) -> pd.DataFrame:
+    """Load and combine CELLxGENE metadata CSVs."""
+    dfs = []
+    for csv_path in csv_paths:
+        if csv_path.exists():
+            df = pd.read_csv(csv_path)
+            dfs.append(df)
+    if not dfs:
+        return pd.DataFrame()
+    combined = pd.concat(dfs, ignore_index=True)
+    return combined
+def build_enhanced_metadata(
+    input_dirs: list[Path],
+    cellxgene_metadata_csvs: list[Path],
+    output_path: Path,
+    force_rescan: bool = False,
+) -> pd.DataFrame:
+    """Build enhanced metadata by combining CELLxGENE metadata with quick scans."""
+    # Discover all h5ad files
+    all_files = []
+    for root in input_dirs:
+        if root.exists():
+            all_files.extend(root.rglob("*.h5ad"))
+    all_files = sorted(set(all_files))
+    if not all_files:
+        raise ValueError("No .h5ad files found in input directories")
+    # Load existing enhanced metadata if available
+    existing_metadata = pd.DataFrame()
+    if output_path.exists() and not force_rescan:
+        try:
+            existing_metadata = pd.read_parquet(output_path)
+            print(f"Loaded existing metadata: {len(existing_metadata)} records")
+        except Exception as e:
+            print(f"Could not load existing metadata: {e}")
+    # Load CELLxGENE metadata
+    cellxgene_meta = load_cellxgene_metadata(cellxgene_metadata_csvs)
+    print(f"Loaded CELLxGENE metadata: {len(cellxgene_meta)} records")
+    # Determine which files need scanning
+    scanned_paths = set(existing_metadata["dataset_path"].values) if not existing_metadata.empty else set()
+    files_to_scan = [f for f in all_files if str(f) not in scanned_paths or force_rescan]
+    if not files_to_scan:
+        print("All files already scanned. Use --force-rescan to rescan.")
+        return existing_metadata
+    print(f"Scanning {len(files_to_scan)} new/changed datasets...")
+    # Quick scan new files
+    scan_results = []
+    for h5ad_path in tqdm(files_to_scan, desc="Quick scan", unit="file"):
+        scan_results.append(quick_scan_dataset(h5ad_path))
+    new_scans_df = pd.DataFrame(scan_results)
+    # Combine with existing metadata
+    if not existing_metadata.empty:
+        # Remove re-scanned paths from existing
+        existing_metadata = existing_metadata[~existing_metadata["dataset_path"].isin(new_scans_df["dataset_path"])]
+        enhanced_df = pd.concat([existing_metadata, new_scans_df], ignore_index=True)
+    else:
+        enhanced_df = new_scans_df
+    # Merge with CELLxGENE metadata if available
+    if not cellxgene_meta.empty and "dataset_h5ad_path" in cellxgene_meta.columns:
+        enhanced_df["dataset_h5ad_filename"] = enhanced_df["dataset_file"]
+        cellxgene_meta_subset = cellxgene_meta[["dataset_h5ad_path", "dataset_total_cell_count", "organism", "collection_name", "dataset_title"]].copy()
+        cellxgene_meta_subset = cellxgene_meta_subset.rename(columns={"dataset_h5ad_path": "dataset_h5ad_filename"})
+        enhanced_df = enhanced_df.merge(cellxgene_meta_subset, on="dataset_h5ad_filename", how="left", suffixes=("", "_cellxgene"))
+    # Categorize by size
+    def categorize_size(row):
+        if row.get("status") != "ok":
+            return "failed"
+        entries = row.get("total_entries", 0)
+        if entries < 2_000_000_000:
+            return "small"
+        elif entries < 15_000_000_000:
+            return "medium"
+        elif entries < 75_000_000_000:
+            return "large"
+        else:
+            return "xlarge"
+    enhanced_df["size_category"] = enhanced_df.apply(categorize_size, axis=1)
+    # Add estimated memory requirement (rough)
+    enhanced_df["estimated_mem_gib"] = (enhanced_df["total_entries"] * 4 / (1024**3)).fillna(0).round(2)
+    # Save
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    enhanced_df.to_parquet(output_path, index=False)
+    print(f"Saved enhanced metadata: {output_path}")
+    # Print summary
+    print("\nDataset size distribution:")
+    print(enhanced_df["size_category"].value_counts().sort_index())
+    return enhanced_df
+def main():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--config",
+        type=Path,
+        help="YAML config file with metadata paths",
+    )
+    parser.add_argument(
+        "--input-dir",
+        action="append",
+        default=[],
+        help="Input directory with .h5ad files (can repeat)",
+    )
+    parser.add_argument(
+        "--metadata-csv",
+        action="append",
+        default=[],
+        help="CELLxGENE metadata CSV (can repeat)",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=Path("output/cache/enhanced_metadata.parquet"),
+        help="Output parquet file",
+    )
+    parser.add_argument(
+        "--force-rescan",
+        action="store_true",
+        help="Force rescan of all datasets",
+    )
+    args = parser.parse_args()
+    # Load from config if provided
+    if args.config:
+        import yaml
+        with open(args.config) as f:
+            config = yaml.safe_load(f)
+        input_dirs = [Path(p) for p in config["paths"]["input_dirs"]]
+        metadata_csvs = [Path(p) for p in config["paths"].get("metadata_csvs", [])]
+        output_path = Path(config["paths"].get("enhanced_metadata_cache", args.output))
+    else:
+        if not args.input_dir:
+            args.input_dir = [
+                "/project/GOV108018/cell_x_gene/homo_sapiens/h5ad",
+                "/project/GOV108018/cell_x_gene/mus_musculus/h5ad",
+            ]
+        if not args.metadata_csv:
+            args.metadata_csv = [
+                "/project/GOV108018/cell_x_gene/metadata/dataset_metadata_homo_sapiens.csv",
+                "/project/GOV108018/cell_x_gene/metadata/dataset_metadata_mus_musculus.csv",
+            ]
+        input_dirs = [Path(p) for p in args.input_dir]
+        metadata_csvs = [Path(p) for p in args.metadata_csv]
+        output_path = args.output
+    enhanced_df = build_enhanced_metadata(
+        input_dirs=input_dirs,
+        cellxgene_metadata_csvs=metadata_csvs,
+        output_path=output_path,
+        force_rescan=args.force_rescan,
+    )
+    print(f"\nTotal datasets: {len(enhanced_df)}")
+    print(f"Successfully scanned: {(enhanced_df['status'] == 'ok').sum()}")
+    print(f"Failed: {(enhanced_df['status'] == 'failed').sum()}")
+if __name__ == "__main__":
+    main()

scripts/distributed_eda.py CHANGED Viewed

@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-"""Distributed and memory-safe EDA for large Cell x Gene .h5ad datasets."""
 from __future__ import annotations
@@ -12,11 +12,12 @@ import os
 import time
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Iterable
 import anndata as ad
 import numpy as np
 import pandas as pd
 from concurrent.futures.process import BrokenProcessPool
 from scipy import sparse
 from tqdm import tqdm
@@ -90,22 +91,6 @@ def safe_name(path: Path) -> str:
     return f"{stem}_{digest}"
-def auto_workers(mem_per_worker_gib: float) -> int:
-    cpu = os.cpu_count() or 1
-    mem_available_gib = 0.0
-    meminfo = Path("/proc/meminfo")
-    if meminfo.exists():
-        for line in meminfo.read_text().splitlines():
-            if line.startswith("MemAvailable:"):
-                kb = int(line.split()[1])
-                mem_available_gib = kb / (1024 * 1024)
-                break
-    # Fast profile for HPC nodes: higher core utilization.
-    by_cpu = max(1, int(cpu * 0.75))
-    by_mem = max(1, int(mem_available_gib // max(1.0, mem_per_worker_gib)))
-    return max(1, min(by_cpu, by_mem))
 def summarize_metadata(df: pd.DataFrame, max_cols: int, max_categories: int) -> dict[str, dict]:
     if df.empty:
         return {}
@@ -147,33 +132,45 @@ def extract_schema(df: pd.DataFrame) -> dict[str, object]:
     }
-def process_dataset(path: Path, chunk_size: int, max_meta_cols: int, max_categories: int) -> dict:
     t0 = time.time()
     row: dict[str, object] = {
         "dataset_path": str(path),
         "dataset_file": path.name,
-        "file_size_gib": round(path.stat().st_size / (1024**3), 4),
     }
     adata = ad.read_h5ad(path, backed="r")
     try:
-        n_obs = int(adata.n_obs)
         n_vars = int(adata.n_vars)
         total_entries = n_obs * n_vars
-        row.update(
-            {
-                "n_obs": n_obs,
-                "n_vars": n_vars,
-                "obs_columns": int(len(adata.obs.columns)),
-                "var_columns": int(len(adata.var.columns)),
-                "layers_count": int(len(adata.layers.keys())),
-                "obsm_count": int(len(adata.obsm.keys())),
-                "varm_count": int(len(adata.varm.keys())),
-            }
-        )
-        row["obs_schema"] = extract_schema(adata.obs)
-        row["var_schema"] = extract_schema(adata.var)
         nnz_total = 0
         x_sum = 0.0
@@ -183,7 +180,11 @@ def process_dataset(path: Path, chunk_size: int, max_meta_cols: int, max_categor
         cell_sum_sample = ReservoirSampler(k=200_000, seed=17)
         cell_nnz_sample = ReservoirSampler(k=200_000, seed=23)
-        for chunk, start, end in adata.chunked_X(chunk_size):
             if sparse.issparse(chunk):
                 nnz = int(chunk.nnz)
                 csr = chunk if sparse.isspmatrix_csr(chunk) else chunk.tocsr(copy=False)
@@ -230,12 +231,16 @@ def process_dataset(path: Path, chunk_size: int, max_meta_cols: int, max_categor
         for key, value in cell_nnz_quantiles.items():
             row[f"cell_nnz_{key}_approx"] = value
-        row["metadata_obs_summary"] = summarize_metadata(
-            adata.obs, max_cols=max_meta_cols, max_categories=max_categories
-        )
-        row["metadata_var_summary"] = summarize_metadata(
-            adata.var, max_cols=max_meta_cols, max_categories=max_categories
-        )
         row["status"] = "ok"
     finally:
@@ -245,223 +250,286 @@ def process_dataset(path: Path, chunk_size: int, max_meta_cols: int, max_categor
     return row
-def discover_h5ad(input_dirs: list[Path]) -> list[Path]:
-    files: list[Path] = []
-    for root in input_dirs:
-        if root.exists():
-            files.extend(sorted(root.rglob("*.h5ad")))
-    files = sorted(set(files))
-    return files
-def run_parallel_batch(
-    paths: list[Path],
-    workers: int,
-    chunk_size: int,
-    max_meta_cols: int,
-    max_categories: int,
-    per_dataset_dir: Path,
-    summary_rows: list[dict],
-    failures: list[dict],
-    pbar: tqdm,
-    log_each_dataset: bool,
-) -> list[Path]:
-    remaining: list[Path] = []
-    finished_paths: set[Path] = set()
-    with concurrent.futures.ProcessPoolExecutor(max_workers=workers) as ex:
-        futures = {
-            ex.submit(
-                process_dataset,
-                path,
-                chunk_size,
-                max_meta_cols,
-                max_categories,
-            ): path
-            for path in paths
-        }
-        for fut in concurrent.futures.as_completed(futures):
-            path = futures[fut]
-            try:
-                row = fut.result()
-                summary_rows.append(row)
-                payload_name = safe_name(path) + ".json"
-                (per_dataset_dir / payload_name).write_text(json.dumps(row, indent=2))
-                if log_each_dataset:
-                    tqdm.write(f"[ok] {path.name} ({row.get('elapsed_sec', 'na')}s)")
-                finished_paths.add(path)
-                pbar.update(1)
-            except BrokenProcessPool as exc:
-                msg = (
-                    f"worker pool crashed while handling {path.name}; "
-                    "switching remaining datasets to isolated retries"
-                )
-                tqdm.write(f"[pool-broken] {msg}: {exc}")
-                # Re-run this path and all unfinished paths in isolated mode.
-                remaining = [path] + [p for p in paths if p not in finished_paths and p != path]
-                break
-            except Exception as exc:  # noqa: BLE001
-                msg = {"dataset_path": str(path), "error": repr(exc), "status": "failed"}
-                failures.append(msg)
-                finished_paths.add(path)
-                tqdm.write(f"[failed] {path.name}: {exc}")
-                pbar.update(1)
-    return remaining
-def run_isolated_retries(
-    paths: list[Path],
-    chunk_size: int,
-    max_meta_cols: int,
-    max_categories: int,
     per_dataset_dir: Path,
-    summary_rows: list[dict],
-    failures: list[dict],
-    pbar: tqdm,
-    log_each_dataset: bool,
-) -> None:
-    # One fresh process per dataset avoids "one crash poisons all remaining futures".
-    for path in paths:
-        try:
-            with concurrent.futures.ProcessPoolExecutor(max_workers=1) as ex:
-                fut = ex.submit(
-                    process_dataset,
-                    path,
-                    chunk_size,
-                    max_meta_cols,
-                    max_categories,
                 )
-                row = fut.result()
-            summary_rows.append(row)
-            payload_name = safe_name(path) + ".json"
-            (per_dataset_dir / payload_name).write_text(json.dumps(row, indent=2))
-            if log_each_dataset:
-                tqdm.write(f"[ok] {path.name} ({row.get('elapsed_sec', 'na')}s)")
-        except Exception as exc:  # noqa: BLE001
-            msg = {"dataset_path": str(path), "error": repr(exc), "status": "failed"}
-            failures.append(msg)
-            tqdm.write(f"[failed] {path.name}: {exc}")
-        finally:
-            pbar.update(1)
 def main() -> None:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument(
-        "--input-dir",
-        action="append",
-        default=[],
-        help="Input folder(s) containing .h5ad files. Can be repeated.",
     )
     parser.add_argument(
-        "--output-dir",
-        type=Path,
-        default=Path("whats2000_work/cell_x_gene_visualization/output/eda"),
     )
-    parser.add_argument("--workers", type=int, default=0, help="0 means auto.")
-    parser.add_argument("--chunk-size", type=int, default=4096)
-    parser.add_argument("--mem-per-worker-gib", type=float, default=8.0)
-    parser.add_argument("--num-shards", type=int, default=1)
-    parser.add_argument("--shard-index", type=int, default=0)
-    parser.add_argument("--max-meta-cols", type=int, default=20)
-    parser.add_argument("--max-categories", type=int, default=8)
     parser.add_argument(
-        "--log-each-dataset",
-        action="store_true",
-        help="Print per-dataset success logs. Default is off for clean tqdm output.",
     )
     args = parser.parse_args()
-    if not args.input_dir:
-        args.input_dir = [
-            "/project/GOV108018/cell_x_gene/homo_sapiens/h5ad",
-            "/project/GOV108018/cell_x_gene/mus_musculus/h5ad",
-        ]
-    roots = [Path(p) for p in args.input_dir]
-    all_files = discover_h5ad(roots)
-    if not all_files:
-        raise SystemExit("No .h5ad files found in input directories.")
-    if args.num_shards < 1:
-        raise SystemExit("--num-shards must be >= 1")
-    if args.shard_index < 0 or args.shard_index >= args.num_shards:
-        raise SystemExit("--shard-index must satisfy 0 <= shard-index < num-shards")
-    shard_files = [p for i, p in enumerate(all_files) if i % args.num_shards == args.shard_index]
-    if not shard_files:
-        raise SystemExit("No files assigned to this shard.")
-    workers = args.workers if args.workers > 0 else auto_workers(args.mem_per_worker_gib)
-    workers = min(workers, len(shard_files))
-    args.output_dir.mkdir(parents=True, exist_ok=True)
-    per_dataset_dir = args.output_dir / "per_dataset"
     per_dataset_dir.mkdir(parents=True, exist_ok=True)
-    manifest_path = args.output_dir / f"manifest_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.txt"
-    manifest_path.write_text("\n".join(str(x) for x in shard_files) + "\n")
     summary_rows: list[dict] = []
     failures: list[dict] = []
-    print(
-        json.dumps(
-            {
-                "total_files": len(all_files),
-                "files_in_shard": len(shard_files),
-                "workers": workers,
-                "chunk_size": args.chunk_size,
-                "num_shards": args.num_shards,
-                "shard_index": args.shard_index,
             }
-        )
-    )
-    with tqdm(total=len(shard_files), desc="Datasets", unit="dataset") as pbar:
-        remaining_paths = run_parallel_batch(
-            paths=shard_files,
-            workers=workers,
-            chunk_size=args.chunk_size,
-            max_meta_cols=args.max_meta_cols,
-            max_categories=args.max_categories,
-            per_dataset_dir=per_dataset_dir,
-            summary_rows=summary_rows,
-            failures=failures,
-            pbar=pbar,
-            log_each_dataset=args.log_each_dataset,
-        )
-        if remaining_paths:
-            run_isolated_retries(
-                paths=remaining_paths,
-                chunk_size=args.chunk_size,
-                max_meta_cols=args.max_meta_cols,
-                max_categories=args.max_categories,
-                per_dataset_dir=per_dataset_dir,
-                summary_rows=summary_rows,
-                failures=failures,
-                pbar=pbar,
-                log_each_dataset=args.log_each_dataset,
-            )
     summary_df = pd.DataFrame(summary_rows)
-    summary_csv = args.output_dir / f"eda_summary_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.csv"
     summary_df.to_csv(summary_csv, index=False)
-    failures_path = args.output_dir / f"eda_failures_shard_{args.shard_index:03d}_of_{args.num_shards:03d}.json"
     failures_path.write_text(json.dumps(failures, indent=2))
-    print(
-        json.dumps(
-            {
-                "summary_csv": str(summary_csv),
-                "failures_json": str(failures_path),
-                "ok_count": len(summary_rows),
-                "failed_count": len(failures),
-            }
-        )
-    )
 if __name__ == "__main__":

 #!/usr/bin/env python3
+"""Metadata-aware distributed EDA with YAML configuration and intelligent scheduling."""
 from __future__ import annotations
 import time
 from dataclasses import dataclass
 from pathlib import Path
+from typing import Any, Iterable
 import anndata as ad
 import numpy as np
 import pandas as pd
+import yaml
 from concurrent.futures.process import BrokenProcessPool
 from scipy import sparse
 from tqdm import tqdm
     return f"{stem}_{digest}"
 def summarize_metadata(df: pd.DataFrame, max_cols: int, max_categories: int) -> dict[str, dict]:
     if df.empty:
         return {}
     }
+def process_dataset_slice(
+    path: Path,
+    obs_start: int,
+    obs_end: int,
+    chunk_size: int,
+    max_meta_cols: int,
+    max_categories: int,
+) -> dict:
+    """Process a slice of a dataset (obs_start:obs_end rows)."""
     t0 = time.time()
     row: dict[str, object] = {
         "dataset_path": str(path),
         "dataset_file": path.name,
+        "obs_slice_start": obs_start,
+        "obs_slice_end": obs_end,
     }
     adata = ad.read_h5ad(path, backed="r")
     try:
+        n_obs_full = int(adata.n_obs)
         n_vars = int(adata.n_vars)
+        # Adjust slice bounds
+        obs_end = min(obs_end, n_obs_full)
+        n_obs = obs_end - obs_start
+        if n_obs <= 0:
+            row["status"] = "empty_slice"
+            return row
         total_entries = n_obs * n_vars
+        row.update({
+            "n_obs": n_obs,
+            "n_obs_full": n_obs_full,
+            "n_vars": n_vars,
+            "obs_columns": int(len(adata.obs.columns)),
+            "var_columns": int(len(adata.var.columns)),
+        })
         nnz_total = 0
         x_sum = 0.0
         cell_sum_sample = ReservoirSampler(k=200_000, seed=17)
         cell_nnz_sample = ReservoirSampler(k=200_000, seed=23)
+        # Process slice in chunks
+        for start_chunk in range(obs_start, obs_end, chunk_size):
+            end_chunk = min(start_chunk + chunk_size, obs_end)
+            chunk = adata.X[start_chunk:end_chunk, :]
             if sparse.issparse(chunk):
                 nnz = int(chunk.nnz)
                 csr = chunk if sparse.isspmatrix_csr(chunk) else chunk.tocsr(copy=False)
         for key, value in cell_nnz_quantiles.items():
             row[f"cell_nnz_{key}_approx"] = value
+        # Only extract metadata for first slice
+        if obs_start == 0:
+            row["metadata_obs_summary"] = summarize_metadata(
+                adata.obs, max_cols=max_meta_cols, max_categories=max_categories
+            )
+            row["metadata_var_summary"] = summarize_metadata(
+                adata.var, max_cols=max_meta_cols, max_categories=max_categories
+            )
+            row["obs_schema"] = extract_schema(adata.obs)
+            row["var_schema"] = extract_schema(adata.var)
         row["status"] = "ok"
     finally:
     return row
+def process_dataset_full(path: Path, chunk_size: int, max_meta_cols: int, max_categories: int) -> dict:
+    """Process entire dataset (wrapper for backwards compatibility)."""
+    adata = ad.read_h5ad(path, backed="r")
+    n_obs = int(adata.n_obs)
+    adata.file.close()
+    return process_dataset_slice(path, 0, n_obs, chunk_size, max_meta_cols, max_categories)
+def merge_slice_results(slice_results: list[dict]) -> dict:
+    """Merge statistics from multiple slices of the same dataset."""
+    if not slice_results:
+        return {}
+    if len(slice_results) == 1:
+        result = slice_results[0].copy()
+        result.pop("obs_slice_start", None)
+        result.pop("obs_slice_end", None)
+        return result
+    # Merge strategy: combine running stats
+    merged = slice_results[0].copy()
+    merged["n_obs"] = merged["n_obs_full"]
+    merged.pop("obs_slice_start", None)
+    merged.pop("obs_slice_end", None)
+    # Sum/max/min across slices
+    merged["nnz"] = sum(r["nnz"] for r in slice_results)
+    merged["cell_nnz_max"] = max(r.get("cell_nnz_max", 0) for r in slice_results)
+    merged["cell_nnz_min"] = min(r.get("cell_nnz_min", float('inf')) for r in slice_results)
+    merged["cell_sum_max"] = max(r.get("cell_sum_max", 0) for r in slice_results)
+    merged["cell_sum_min"] = min(r.get("cell_sum_min", float('inf')) for r in slice_results)
+    # Weighted average for means
+    total_cells = sum(r["n_obs"] for r in slice_results)
+    if total_cells > 0:
+        merged["cell_nnz_mean"] = sum(r["n_obs"] * r.get("cell_nnz_mean", 0) for r in slice_results) / total_cells
+        merged["cell_sum_mean"] = sum(r["n_obs"] * r.get("cell_sum_mean", 0) for r in slice_results) / total_cells
+    merged["elapsed_sec"] = sum(r.get("elapsed_sec", 0) for r in slice_results)
+    merged["num_slices_processed"] = len(slice_results)
+    return merged
+def load_config(config_path: Path) -> dict:
+    """Load YAML configuration."""
+    with open(config_path) as f:
+        return yaml.safe_load(f)
+def load_enhanced_metadata(cache_path: Path) -> pd.DataFrame:
+    """Load enhanced metadata cache."""
+    if not cache_path.exists():
+        raise FileNotFoundError(
+            f"Enhanced metadata cache not found: {cache_path}\n"
+            "Run: uv run python scripts/build_metadata_cache.py --config <config.yaml>"
+        )
+    return pd.read_parquet(cache_path)
+def schedule_datasets(
+    metadata_df: pd.DataFrame,
+    config: dict,
+    num_shards: int,
+    shard_index: int,
+) -> list[tuple[Path, str, dict]]:
+    """
+    Schedule datasets based on size category and resource constraints.
+    Returns: list of (path, size_category, strategy) tuples
+    """
+    # Filter to this shard
+    if num_shards > 1:
+        if config["sharding"].get("strategy") == "size_balanced":
+            # Sort by size, distribute round-robin
+            metadata_df = metadata_df.sort_values("total_entries", ascending=False).reset_index(drop=True)
+        shard_df = metadata_df[metadata_df.index % num_shards == shard_index].copy()
+    else:
+        shard_df = metadata_df.copy()
+    # Filter successful scans only
+    shard_df = shard_df[shard_df["status"] == "ok"].copy()
+    # Filter by max entries threshold
+    max_entries = config["dataset_thresholds"]["max_entries"]
+    shard_df = shard_df[shard_df["total_entries"] <= max_entries].copy()
+    # Sort by priority (small first for fast initial progress)
+    priority_map = {"small": 1, "medium": 2, "large": 3, "xlarge": 4}
+    shard_df["priority"] = shard_df["size_category"].map(priority_map).fillna(99)
+    shard_df = shard_df.sort_values("priority").reset_index(drop=True)
+    # Build schedule
+    schedule = []
+    for _, row in shard_df.iterrows():
+        path = Path(row["dataset_path"])
+        size_cat = row["size_category"]
+        strategy = config["strategy"].get(size_cat, config["strategy"]["small"])
+        schedule.append((path, size_cat, strategy))
+    return schedule
+def run_with_strategy(
+    path: Path,
+    size_category: str,
+    strategy: dict,
+    config: dict,
     per_dataset_dir: Path,
+) -> dict:
+    """Run EDA on a single dataset with specified strategy."""
+    chunk_size = int(config["resources"]["chunk_size"] * strategy["chunk_size_multiplier"])
+    max_meta_cols = config["metadata"]["max_meta_cols"]
+    max_categories = config["metadata"]["max_categories"]
+    try:
+        # Check if slicing is required
+        if strategy.get("require_slicing") and config["slicing"]["enabled"]:
+            # Process in slices
+            adata = ad.read_h5ad(path, backed="r")
+            n_obs = int(adata.n_obs)
+            adata.file.close()
+            obs_slice_size = config["slicing"]["obs_slice_size"]
+            slice_results = []
+            for start in range(0, n_obs, obs_slice_size):
+                end = min(start + obs_slice_size, n_obs)
+                slice_result = process_dataset_slice(
+                    path, start, end, chunk_size, max_meta_cols, max_categories
                 )
+                slice_results.append(slice_result)
+            # Merge slices
+            row = merge_slice_results(slice_results)
+            row["processing_mode"] = "sliced"
+        else:
+            # Process whole dataset
+            row = process_dataset_full(path, chunk_size, max_meta_cols, max_categories)
+            row["processing_mode"] = "full"
+        row["size_category"] = size_category
+        row["file_size_gib"] = round(path.stat().st_size / (1024**3), 4)
+        payload_name = safe_name(path) + ".json"
+        (per_dataset_dir / payload_name).write_text(json.dumps(row, indent=2))
+        return row
+    except Exception as exc:
+        raise RuntimeError(f"Failed to process {path}: {exc}") from exc
 def main() -> None:
     parser = argparse.ArgumentParser(description=__doc__)
     parser.add_argument(
+        "--config",
+        type=Path,
+        required=True,
+        help="YAML configuration file",
     )
     parser.add_argument(
+        "--num-shards",
+        type=int,
+        help="Override num_shards from config",
     )
     parser.add_argument(
+        "--shard-index",
+        type=int,
+        help="Override shard_index from config",
+    )
+    parser.add_argument(
+        "--force-workers",
+        type=int,
+        help="Override worker count",
     )
     args = parser.parse_args()
+    # Load config
+    config = load_config(args.config)
+    # Override sharding if specified
+    if args.num_shards is not None:
+        config["sharding"]["num_shards"] = args.num_shards
+        config["sharding"]["enabled"] = args.num_shards > 1
+    if args.shard_index is not None:
+        config["sharding"]["shard_index"] = args.shard_index
+    num_shards = config["sharding"]["num_shards"]
+    shard_index = config["sharding"]["shard_index"]
+    # Load enhanced metadata
+    cache_path = Path(config["paths"]["enhanced_metadata_cache"])
+    if not cache_path.is_absolute():
+        cache_path = Path(args.config).parent.parent / cache_path
+    print(f"Loading metadata from: {cache_path}")
+    metadata_df = load_enhanced_metadata(cache_path)
+    # Schedule datasets
+    schedule = schedule_datasets(metadata_df, config, num_shards, shard_index)
+    if not schedule:
+        print("No datasets scheduled for this shard.")
+        return
+    # Setup output
+    output_dir = Path(config["paths"]["output_dir"])
+    if not output_dir.is_absolute():
+        output_dir = Path(args.config).parent.parent / output_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+    per_dataset_dir = output_dir / "per_dataset"
     per_dataset_dir.mkdir(parents=True, exist_ok=True)
+    # Print schedule summary
+    schedule_summary = {}
+    for _, size_cat, _ in schedule:
+        schedule_summary[size_cat] = schedule_summary.get(size_cat, 0) + 1
+    print(json.dumps({
+        "total_datasets": len(schedule),
+        "by_size": schedule_summary,
+        "shard_index": shard_index,
+        "num_shards": num_shards,
+    }, indent=2))
     summary_rows: list[dict] = []
     failures: list[dict] = []
+    # Process all datasets in parallel with full worker pool
+    # Priority ordering ensures small datasets finish first while large ones process in parallel
+    max_workers = args.force_workers or config["resources"]["max_workers"]
+    print(f"\nProcessing {len(schedule)} datasets with up to {max_workers} workers...")
+    print("Strategy: Processing all sizes in parallel with priority ordering\n")
+    with tqdm(total=len(schedule), desc="All datasets", unit="dataset") as pbar:
+        with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as ex:
+            futures = {
+                ex.submit(
+                    run_with_strategy,
+                    path,
+                    size_cat,
+                    strategy,
+                    config,
+                    per_dataset_dir,
+                ): (path, size_cat)
+                for path, size_cat, strategy in schedule
             }
+            for fut in concurrent.futures.as_completed(futures):
+                path, size_cat = futures[fut]
+                try:
+                    row = fut.result()
+                    summary_rows.append(row)
+                    if config["behavior"]["log_each_dataset"]:
+                        elapsed = row.get("elapsed_sec", "?")
+                        tqdm.write(f"[ok] {path.name} ({size_cat}, {elapsed}s)")
+                except Exception as exc:
+                    msg = {"dataset_path": str(path), "error": repr(exc), "status": "failed", "size_category": size_cat}
+                    failures.append(msg)
+                    tqdm.write(f"[failed] {path.name} ({size_cat}): {exc}")
+                finally:
+                    pbar.update(1)
+    # Save results
     summary_df = pd.DataFrame(summary_rows)
+    summary_csv = output_dir / f"eda_summary_shard_{shard_index:03d}_of_{num_shards:03d}.csv"
     summary_df.to_csv(summary_csv, index=False)
+    failures_path = output_dir / f"eda_failures_shard_{shard_index:03d}_of_{num_shards:03d}.json"
     failures_path.write_text(json.dumps(failures, indent=2))
+    print(json.dumps({
+        "summary_csv": str(summary_csv),
+        "failures_json": str(failures_path),
+        "ok_count": len(summary_rows),
+        "failed_count": len(failures),
+    }, indent=2))
 if __name__ == "__main__":

scripts/merge_eda_shards.py CHANGED Viewed

File without changes

scripts/resource_probe.py CHANGED Viewed

File without changes

scripts/run_eda_pipeline.py ADDED Viewed

	@@ -0,0 +1,112 @@

+#!/usr/bin/env python3
+"""Launcher script for YAML-configured EDA pipeline."""
+import argparse
+import subprocess
+import sys
+from pathlib import Path
+import yaml
+def run_command(cmd: list[str], description: str) -> None:
+    """Run a command and handle errors."""
+    print(f"\n{'='*80}")
+    print(f"{description}")
+    print(f"{'='*80}")
+    print(f"Command: {' '.join(cmd)}\n")
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        print(f"\n[ERROR] {description} failed with exit code {result.returncode}")
+        sys.exit(result.returncode)
+def main():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--config",
+        type=Path,
+        required=True,
+        help="YAML configuration file",
+    )
+    parser.add_argument(
+        "--step",
+        choices=["metadata", "eda", "merge", "all"],
+        default="all",
+        help="Which step to run",
+    )
+    parser.add_argument(
+        "--num-shards",
+        type=int,
+        help="Number of shards for distributed processing",
+    )
+    parser.add_argument(
+        "--shard-index",
+        type=int,
+        help="Shard index to process (0-based)",
+    )
+    parser.add_argument(
+        "--force-rescan",
+        action="store_true",
+        help="Force metadata rescan",
+    )
+    args = parser.parse_args()
+    if not args.config.exists():
+        print(f"[ERROR] Config file not found: {args.config}")
+        sys.exit(1)
+    # Load config to check paths
+    with open(args.config) as f:
+        config = yaml.safe_load(f)
+    # Step 1: Build metadata cache
+    if args.step in ["metadata", "all"]:
+        cmd = [
+            "uv", "run", "python",
+            "scripts/build_metadata_cache.py",
+            "--config", str(args.config),
+        ]
+        if args.force_rescan:
+            cmd.append("--force-rescan")
+        run_command(cmd, "Step 1: Building metadata cache")
+    # Step 2: Run EDA
+    if args.step in ["eda", "all"]:
+        cmd = [
+            "uv", "run", "python",
+            "scripts/distributed_eda.py",
+            "--config", str(args.config),
+        ]
+        if args.num_shards is not None:
+            cmd.extend(["--num-shards", str(args.num_shards)])
+        if args.shard_index is not None:
+            cmd.extend(["--shard-index", str(args.shard_index)])
+        run_command(cmd, "Step 2: Running EDA")
+    # Step 3: Merge shards (if sharding was used)
+    if args.step in ["merge", "all"]:
+        if args.num_shards and args.num_shards > 1 and args.shard_index is None:
+            # Only merge if we're running all shards or explicitly asked
+            print("\n[INFO] Sharding enabled but not merging (specify --step merge to merge manually)")
+        elif args.step == "merge":
+            cmd = [
+                "uv", "run", "python",
+                "scripts/merge_eda_shards.py",
+                "--output-dir", config["paths"]["output_dir"],
+            ]
+            run_command(cmd, "Step 3: Merging shard results")
+        elif args.step == "all" and not args.num_shards:
+            print("\n[INFO] Single shard run, no merge needed")
+    print(f"\n{'='*80}")
+    print("Pipeline completed successfully!")
+    print(f"{'='*80}")
+    print(f"\nResults written to: {config['paths']['output_dir']}")
+if __name__ == "__main__":
+    main()

scripts/run_eda_slurm.sh ADDED Viewed

	@@ -0,0 +1,60 @@

+#!/bin/bash
+#SBATCH --job-name=eda_pipeline
+#SBATCH --output=logs/eda_%A_%a.out
+#SBATCH --error=logs/eda_%A_%a.err
+#SBATCH --time=24:00:00
+#SBATCH --mem=256G
+#SBATCH --cpus-per-task=42
+#SBATCH --array=0-3
+# SLURM batch script for distributed EDA with YAML config
+# Usage: sbatch --array=0-N scripts/run_eda_slurm.sh configs/eda_optimized.yaml
+# where N is num_shards - 1
+CONFIG_FILE=${1:-configs/eda_optimized.yaml}
+NUM_SHARDS=${2:-4}
+SHARD_INDEX=${SLURM_ARRAY_TASK_ID}
+echo "========================================="
+echo "EDA Pipeline - Shard ${SHARD_INDEX}/${NUM_SHARDS}"
+echo "Config: ${CONFIG_FILE}"
+echo "========================================="
+cd /project/GOV108018/whats2000_work/cell_x_gene_visualization
+# Build metadata cache (only first job)
+if [ ${SHARD_INDEX} -eq 0 ]; then
+    echo "Building metadata cache..."
+    uv run python scripts/build_metadata_cache.py --config "${CONFIG_FILE}"
+    # Wait a bit for cache to be written
+    sleep 30
+else
+    # Wait for first job to build cache
+    echo "Waiting for metadata cache..."
+    CACHE_PATH=$(python -c "import yaml; c=yaml.safe_load(open('${CONFIG_FILE}')); print(c['paths']['enhanced_metadata_cache'])")
+    # Wait up to 10 minutes for cache
+    for i in {1..60}; do
+        if [ -f "${CACHE_PATH}" ]; then
+            echo "Cache found!"
+            break
+        fi
+        echo "Waiting for cache... ($i/60)"
+        sleep 10
+    done
+    if [ ! -f "${CACHE_PATH}" ]; then
+        echo "ERROR: Metadata cache not found after waiting"
+        exit 1
+    fi
+fi
+# Run EDA for this shard
+echo "Running EDA for shard ${SHARD_INDEX}..."
+uv run python scripts/distributed_eda.py \
+    --config "${CONFIG_FILE}" \
+    --num-shards "${NUM_SHARDS}" \
+    --shard-index "${SHARD_INDEX}"
+echo "Shard ${SHARD_INDEX} completed!"