Spaces:

dmannk
/

Paper2Agent-scglue-mcp

Sleeping

App Files Files Community

Dylan Mann-Krzisnik commited on 13 days ago

Commit

dee34fb

1 Parent(s): 8e2b525

Add GLUE remote MCP server

Browse files

Files changed (10) hide show

Dockerfile +30 -0
GLUE_Agent_mcp.py +40 -0
requirements.txt +32 -0
tools/__pycache__/preprocessing.cpython-310.pyc +0 -0
tools/__pycache__/preprocessing.cpython-311.pyc +0 -0
tools/__pycache__/training.cpython-310.pyc +0 -0
tools/preprocessing.py +280 -0
tools/preprocessing_implementation_log.md +209 -0
tools/training.py +525 -0
tools/training_summary.md +133 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.12-slim
+WORKDIR /app
+# bedtools is required by pybedtools (used in scglue genomics)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        bedtools \
+        build-essential \
+        git \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy server entry-point and tool modules from src/
+COPY src/GLUE_Agent_mcp.py .
+COPY src/tools/ tools/
+# Redirect I/O to /data so outputs survive across requests and can use
+# HF Spaces persistent storage if enabled
+ENV PREPROCESSING_INPUT_DIR=/data/inputs
+ENV PREPROCESSING_OUTPUT_DIR=/data/outputs
+ENV TRAINING_INPUT_DIR=/data/inputs
+ENV TRAINING_OUTPUT_DIR=/data/outputs
+RUN mkdir -p /data/inputs /data/outputs
+EXPOSE 7860
+CMD ["uvicorn", "GLUE_Agent_mcp:app", "--host", "0.0.0.0", "--port", "7860"]

GLUE_Agent_mcp.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+Model Context Protocol (MCP) for GLUE_Agent
+GLUE_Agent provides comprehensive multi-omics data integration tools for single-cell RNA-seq and ATAC-seq analysis. This framework enables preprocessing, model training, and visualization of integrated multi-modal datasets.
+This MCP Server contains tools extracted from the following tutorial files:
+1. preprocessing
+    - glue_preprocess_scrna: Preprocess scRNA-seq data with HVG selection, normalization, and PCA
+    - glue_preprocess_scatac: Preprocess scATAC-seq data with LSI dimension reduction
+    - glue_construct_regulatory_graph: Construct prior regulatory graph linking RNA and ATAC features
+2. training
+    - glue_configure_datasets: Configure RNA-seq and ATAC-seq datasets for GLUE model training
+    - glue_train_model: Train GLUE model for multi-omics integration
+    - glue_check_integration_consistency: Evaluate integration quality with consistency scores
+    - glue_generate_embeddings: Generate cell and feature embeddings from trained GLUE model
+"""
+import os
+from fastmcp import FastMCP
+# Import statements (alphabetical order)
+from tools.preprocessing import preprocessing_mcp
+from tools.training import training_mcp
+# Server definition and mounting
+mcp = FastMCP(name="GLUE_Agent")
+mcp.mount(preprocessing_mcp)
+mcp.mount(training_mcp)
+# ASGI app for uvicorn (used when deployed as a remote HTTP server)
+app = mcp.http_app(path="/mcp")
+if __name__ == "__main__":
+    mcp.run(
+        transport="http",
+        host="0.0.0.0",
+        port=int(os.getenv("PORT", 7860)),
+        path="/mcp",
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+# MCP server & HTTP transport
+fastmcp==2.14.5
+uvicorn==0.40.0
+fastapi
+starlette==0.52.1
+# Bioinformatics core
+anndata==0.11.4
+scanpy==1.11.5
+scglue==0.4.0
+# Graph / numerics
+networkx==3.4.2
+numpy==2.2.6
+pandas==2.3.3
+scipy==1.15.3
+scikit-learn==1.7.2
+# Plotting
+matplotlib==3.10.8
+seaborn==0.13.2
+# scglue deep-learning backend
+torch==2.10.0
+pyro-ppl==1.9.1
+# scglue genomics (requires bedtools system package)
+pybedtools==0.12.0
+# Utilities
+tqdm==4.67.3
+dill==0.4.1

tools/__pycache__/preprocessing.cpython-310.pyc ADDED Viewed

Binary file (7.61 kB). View file

tools/__pycache__/preprocessing.cpython-311.pyc ADDED Viewed

Binary file (13.6 kB). View file

tools/__pycache__/training.cpython-310.pyc ADDED Viewed

Binary file (10.7 kB). View file

tools/preprocessing.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""
+GLUE preprocessing tutorial for scRNA-seq and scATAC-seq data integration.
+This MCP Server provides 3 tools:
+1. glue_preprocess_scrna: Preprocess scRNA-seq data with HVG selection, normalization, and PCA
+2. glue_preprocess_scatac: Preprocess scATAC-seq data with LSI dimension reduction
+3. glue_construct_regulatory_graph: Construct prior regulatory graph linking RNA and ATAC features
+All tools extracted from `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`.
+"""
+import os
+from datetime import datetime
+from pathlib import Path
+# Standard imports
+from typing import Annotated, Any, Literal
+import anndata as ad
+# Domain-specific imports
+import matplotlib.pyplot as plt
+import networkx as nx
+import numpy as np
+import pandas as pd
+import scanpy as sc
+import scglue
+from fastmcp import FastMCP
+from matplotlib import rcParams
+# Project structure
+PROJECT_ROOT = Path(__file__).parent.parent.parent.resolve()
+DEFAULT_INPUT_DIR = PROJECT_ROOT / "tmp" / "inputs"
+DEFAULT_OUTPUT_DIR = PROJECT_ROOT / "tmp" / "outputs"
+INPUT_DIR = Path(os.environ.get("PREPROCESSING_INPUT_DIR", DEFAULT_INPUT_DIR))
+OUTPUT_DIR = Path(os.environ.get("PREPROCESSING_OUTPUT_DIR", DEFAULT_OUTPUT_DIR))
+# Ensure directories exist
+INPUT_DIR.mkdir(parents=True, exist_ok=True)
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# Timestamp for unique outputs
+timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+# Set plotting parameters
+plt.rcParams["figure.dpi"] = 300
+plt.rcParams["savefig.dpi"] = 300
+scglue.plot.set_publication_params()
+rcParams["figure.figsize"] = (4, 4)
+# MCP server instance
+preprocessing_mcp = FastMCP(name="preprocessing")
+@preprocessing_mcp.tool
+def glue_preprocess_scrna(
+    rna_path: Annotated[
+        str | None, "Path to scRNA-seq data file in h5ad format"
+    ] = None,
+    n_top_genes: Annotated[int, "Number of highly variable genes to select"] = 2000,
+    flavor: Annotated[
+        Literal["seurat", "cell_ranger", "seurat_v3"], "Method for HVG selection"
+    ] = "seurat_v3",
+    n_comps: Annotated[int, "Number of principal components"] = 100,
+    svd_solver: Annotated[
+        Literal["auto", "arpack", "randomized"], "SVD solver for PCA"
+    ] = "auto",
+    color_var: Annotated[str, "Variable name for UMAP coloring"] = "cell_type",
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Preprocess scRNA-seq data with highly variable gene selection, normalization, scaling, and PCA.
+    Input is scRNA-seq data in h5ad format and output is preprocessed data with PCA embedding and UMAP visualization.
+    """
+    # Input validation
+    if rna_path is None:
+        raise ValueError("Path to scRNA-seq data file must be provided")
+    # File existence validation
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA data file not found: {rna_path}")
+    # Set output prefix
+    if out_prefix is None:
+        out_prefix = "glue_rna"
+    # Load data
+    rna = ad.read_h5ad(rna_path)
+    # Backup raw counts to "counts" layer
+    rna.layers["counts"] = rna.X.copy()
+    # Select highly variable genes
+    sc.pp.highly_variable_genes(rna, n_top_genes=n_top_genes, flavor=flavor)
+    # Normalize, log-transform, and scale
+    sc.pp.normalize_total(rna)
+    sc.pp.log1p(rna)
+    sc.pp.scale(rna)
+    # Perform PCA
+    sc.tl.pca(rna, n_comps=n_comps, svd_solver=svd_solver)
+    # Generate UMAP visualization
+    sc.pp.neighbors(rna, metric="cosine")
+    sc.tl.umap(rna)
+    # Save UMAP plot
+    fig_output = OUTPUT_DIR / f"{out_prefix}_umap_{timestamp}.png"
+    sc.pl.umap(rna, color=color_var, show=False)
+    plt.savefig(fig_output, dpi=300, bbox_inches="tight")
+    plt.close()
+    # Save preprocessed data
+    rna_output = OUTPUT_DIR / f"{out_prefix}_preprocessed_{timestamp}.h5ad"
+    rna.write(str(rna_output), compression="gzip")
+    return {
+        "message": f"Preprocessed RNA data: {n_top_genes} HVGs, {n_comps} PCs, UMAP generated",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
+        "artifacts": [
+            {"description": "Preprocessed RNA data", "path": str(rna_output.resolve())},
+            {
+                "description": "RNA UMAP visualization",
+                "path": str(fig_output.resolve()),
+            },
+        ],
+    }
+@preprocessing_mcp.tool
+def glue_preprocess_scatac(
+    atac_path: Annotated[
+        str | None, "Path to scATAC-seq data file in h5ad format"
+    ] = None,
+    n_components: Annotated[int, "Number of LSI components"] = 100,
+    n_iter: Annotated[int, "Number of iterations for randomized SVD in LSI"] = 15,
+    color_var: Annotated[str, "Variable name for UMAP coloring"] = "cell_type",
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Preprocess scATAC-seq data with latent semantic indexing (LSI) dimension reduction.
+    Input is scATAC-seq data in h5ad format and output is preprocessed data with LSI embedding and UMAP visualization.
+    """
+    # Input validation
+    if atac_path is None:
+        raise ValueError("Path to scATAC-seq data file must be provided")
+    # File existence validation
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC data file not found: {atac_path}")
+    # Set output prefix
+    if out_prefix is None:
+        out_prefix = "glue_atac"
+    # Load data
+    atac = ad.read_h5ad(atac_path)
+    # Perform LSI dimension reduction
+    scglue.data.lsi(atac, n_components=n_components, n_iter=n_iter)
+    # Generate UMAP visualization
+    sc.pp.neighbors(atac, use_rep="X_lsi", metric="cosine")
+    sc.tl.umap(atac)
+    # Save UMAP plot
+    fig_output = OUTPUT_DIR / f"{out_prefix}_umap_{timestamp}.png"
+    sc.pl.umap(atac, color=color_var, show=False)
+    plt.savefig(fig_output, dpi=300, bbox_inches="tight")
+    plt.close()
+    # Save preprocessed data
+    atac_output = OUTPUT_DIR / f"{out_prefix}_preprocessed_{timestamp}.h5ad"
+    atac.write(str(atac_output), compression="gzip")
+    return {
+        "message": f"Preprocessed ATAC data: {n_components} LSI components, UMAP generated",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
+        "artifacts": [
+            {
+                "description": "Preprocessed ATAC data",
+                "path": str(atac_output.resolve()),
+            },
+            {
+                "description": "ATAC UMAP visualization",
+                "path": str(fig_output.resolve()),
+            },
+        ],
+    }
+@preprocessing_mcp.tool
+def glue_construct_regulatory_graph(
+    rna_path: Annotated[
+        str | None, "Path to preprocessed scRNA-seq data file in h5ad format"
+    ] = None,
+    atac_path: Annotated[
+        str | None, "Path to preprocessed scATAC-seq data file in h5ad format"
+    ] = None,
+    gtf_path: Annotated[
+        str | None, "Path to GTF annotation file for gene coordinates"
+    ] = None,
+    gtf_by: Annotated[str, "GTF attribute to match gene names"] = "gene_name",
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Construct prior regulatory graph linking RNA genes and ATAC peaks via genomic proximity.
+    Input is preprocessed RNA and ATAC data with GTF annotation and output is NetworkX guidance graph.
+    """
+    # Input validation
+    if rna_path is None:
+        raise ValueError("Path to preprocessed scRNA-seq data file must be provided")
+    if atac_path is None:
+        raise ValueError("Path to preprocessed scATAC-seq data file must be provided")
+    if gtf_path is None:
+        raise ValueError("Path to GTF annotation file must be provided")
+    # File existence validation
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA data file not found: {rna_path}")
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC data file not found: {atac_path}")
+    gtf_file = Path(gtf_path)
+    if not gtf_file.exists():
+        raise FileNotFoundError(f"GTF annotation file not found: {gtf_path}")
+    # Set output prefix
+    if out_prefix is None:
+        out_prefix = "glue_guidance"
+    # Load data
+    rna = ad.read_h5ad(rna_path)
+    atac = ad.read_h5ad(atac_path)
+    # Get gene annotation from GTF
+    scglue.data.get_gene_annotation(rna, gtf=gtf_path, gtf_by=gtf_by)
+    # Extract ATAC peak coordinates from var_names
+    split = atac.var_names.str.split(r"[:-]")
+    atac.var["chrom"] = split.map(lambda x: x[0])
+    atac.var["chromStart"] = split.map(lambda x: x[1]).astype(int)
+    atac.var["chromEnd"] = split.map(lambda x: x[2]).astype(int)
+    # Construct guidance graph
+    guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)
+    # Verify graph compliance
+    scglue.graph.check_graph(guidance, [rna, atac])
+    # Save guidance graph
+    graph_output = OUTPUT_DIR / f"{out_prefix}_graph_{timestamp}.graphml.gz"
+    nx.write_graphml(guidance, str(graph_output))
+    # Save updated data with coordinates
+    rna_output = OUTPUT_DIR / f"{out_prefix}_rna_annotated_{timestamp}.h5ad"
+    atac_output = OUTPUT_DIR / f"{out_prefix}_atac_annotated_{timestamp}.h5ad"
+    rna.write(str(rna_output), compression="gzip")
+    atac.write(str(atac_output), compression="gzip")
+    return {
+        "message": f"Constructed guidance graph with {guidance.number_of_nodes()} nodes and {guidance.number_of_edges()} edges",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
+        "artifacts": [
+            {"description": "Guidance graph", "path": str(graph_output.resolve())},
+            {
+                "description": "RNA data with coordinates",
+                "path": str(rna_output.resolve()),
+            },
+            {
+                "description": "ATAC data with coordinates",
+                "path": str(atac_output.resolve()),
+            },
+        ],
+    }

tools/preprocessing_implementation_log.md ADDED Viewed

	@@ -0,0 +1,209 @@

+# Implementation Log: GLUE Preprocessing Tools
+**Tutorial Source**: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`
+**Implementation Date**: 2026-02-14
+**Output File**: `src/tools/preprocessing.py`
+## Tool Design Decisions
+### Tools Extracted (3 tools)
+1. **glue_preprocess_scrna**
+   - **Section**: "Preprocess scRNA-seq data"
+   - **Rationale**: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
+   - **Classification**: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
+   - **Parameters Preserved**: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial
+   - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
+2. **glue_preprocess_scatac**
+   - **Section**: "Preprocess scATAC-seq data"
+   - **Rationale**: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
+   - **Classification**: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
+   - **Parameters Preserved**: `n_components=100`, `n_iter=15` explicitly set in tutorial
+   - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
+3. **glue_construct_regulatory_graph**
+   - **Section**: "Construct prior regulatory graph"
+   - **Rationale**: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
+   - **Classification**: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
+   - **Parameters Preserved**: `gtf_by="gene_name"` as tutorial default
+   - **Input Requirements**: Requires GTF annotation file which users must provide for their organism
+### Tools Excluded (1 tool)
+1. **glue_read_paired_data** (initially present, removed in revision)
+   - **Section**: "Read data"
+   - **Rationale for Exclusion**: Only loads tutorial example data with no analytical transformation
+   - **Classification**: NOT Applicable to New Data - data loading is trivial and should be handled by users
+## Parameter Design Rationale
+### Primary Data Inputs
+- All tools use **file paths** as primary inputs (h5ad format for AnnData objects)
+- No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow
+- All data paths default to `None` with validation in function body for clear error messages
+### Analysis Parameters
+**Parameters Explicitly Set in Tutorial (Parameterized)**:
+- `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection
+- `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA
+- `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI
+- `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing
+**Tutorial-Specific Values (Parameterized)**:
+- `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data
+**Library Defaults (Preserved)**:
+- `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is
+- `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults
+- `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults
+- `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults
+### Critical Rule Adherence
+**NEVER ADD PARAMETERS NOT IN TUTORIAL**: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.
+**PRESERVE EXACT TUTORIAL STRUCTURE**: All function calls preserve the exact structure from the tutorial:
+- `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` → parameterized as shown
+- `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` → parameterized as shown
+- `scglue.data.lsi(atac, n_components=100, n_iter=15)` → parameterized as shown
+- `sc.pp.neighbors(rna, metric="cosine")` → preserved exactly as shown
+## Output Requirements
+### Visualization Outputs
+**Code-Generated Figures Only**:
+- `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
+- `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
+- No static figures or diagrams included (tutorial has none)
+**Figure Specifications**:
+- Format: PNG with `dpi=300`, `bbox_inches='tight'`
+- Naming: `{out_prefix}_umap_{timestamp}.png`
+- Always generated (no user control parameter)
+### Data Outputs
+**Essential Results Saved**:
+- Preprocessed AnnData objects with all transformations applied
+- Guidance graph in NetworkX GraphML format
+- Annotated data with genomic coordinates
+**File Formats**:
+- AnnData: h5ad with gzip compression (standard for single-cell data)
+- Graph: graphml.gz (standard for NetworkX graphs)
+**Naming Convention**:
+- `{out_prefix}_preprocessed_{timestamp}.h5ad`
+- `{out_prefix}_graph_{timestamp}.graphml.gz`
+- `{out_prefix}_rna_annotated_{timestamp}.h5ad`
+### Return Format
+All tools return standardized dict:
+```python
+{
+    "message": "<concise status ≤120 chars>",
+    "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
+    "artifacts": [
+        {
+            "description": "<description ≤50 chars>",
+            "path": "/absolute/path/to/file"
+        }
+    ]
+}
+```
+## Quality Review Results
+### Iteration 1 (Final)
+**Date**: 2026-02-14
+**Status**: All checks passed
+**Tool Design Validation**: [✓] All 7 checks passed
+- Tool definition, naming, description, classification, order, boundaries, independence all correct
+**Implementation Validation**: [✓] All 8 checks passed
+- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct
+**Output Validation**: [✓] All 5 checks passed
+- Figure generation, data outputs, return format, file paths, reference links all correct
+**Code Quality Validation**: [✓] All 6 checks passed
+- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct
+**Summary**: 3/3 tools passing all checks. No issues found. Implementation is production-ready.
+## Implementation Choices
+### Libraries Used
+- **anndata**: Standard format for single-cell data (AnnData objects)
+- **scanpy**: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
+- **scglue**: GLUE-specific functions (LSI, graph construction, gene annotation)
+- **networkx**: Standard graph library for guidance graph representation
+- **matplotlib**: Visualization library for UMAP plots
+### Error Handling Approach
+**Basic Input Validation Only**:
+- Required parameter validation (data_path must be provided)
+- File existence checks (FileNotFoundError if file not found)
+- No intermediate processing validation (trust library error messages)
+**Rationale**: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.
+### Parameterization Rationale
+**Why Parameterize `color_var`?**
+- Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset
+- User datasets will have different column names for cell annotations
+- Parameterizing enables tool to work with any AnnData object with different metadata columns
+**Why Parameterize `gtf_by`?**
+- Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes
+- Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes
+- Parameterizing enables tool to work with different GTF annotation standards
+**Why Keep Default `n_top_genes=2000`?**
+- This is a standard value in single-cell RNA-seq analysis
+- Tutorial explicitly sets this value, not using library default
+- Value represents a scientific choice about feature selection stringency
+**Why Keep Default `n_components=100`?**
+- This is the standard dimensionality for GLUE model training
+- Tutorial explicitly sets this value for downstream model compatibility
+- Changing this value would require adjusting the GLUE model architecture
+## Known Limitations
+1. **Coordinate Extraction Assumption**: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.
+2. **GTF Compatibility**: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`).
+3. **Memory Requirements**: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.
+4. **Visualization Dependency**: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.
+5. **File Format Constraints**: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.
+## Testing Recommendations
+1. **Test with tutorial data**: Verify tools reproduce exact tutorial results with Chen-2019 dataset
+2. **Test with different organisms**: Verify GTF annotation works with different reference genomes
+3. **Test with different annotation columns**: Verify `color_var` parameter works with different metadata
+4. **Test with edge cases**:
+   - Very small datasets (<100 cells)
+   - Very large datasets (>100k cells)
+   - Datasets with missing or malformed peak coordinates
+   - GTF files with different attribute names
+## Revision History
+### Initial Implementation
+- 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph`
+### Revision 1 (2026-02-14)
+**Changes Made**:
+1. **Removed `glue_read_paired_data` tool**: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
+2. **Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`**: Better matches tutorial section title "Construct prior regulatory graph"
+3. **Updated documentation**: Corrected tool count from 4 to 3 tools
+**Rationale**: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.
+**Result**: All 3 remaining tools pass quality review with all checks passing.

tools/training.py ADDED Viewed

	@@ -0,0 +1,525 @@

+"""
+GLUE model training workflow for multi-omics data integration.
+This MCP Server provides 4 tools:
+1. glue_configure_datasets: Configure RNA-seq and ATAC-seq datasets for GLUE model training
+2. glue_train_model: Train GLUE model for multi-omics integration
+3. glue_check_integration_consistency: Evaluate integration quality with consistency scores
+4. glue_generate_embeddings: Generate cell and feature embeddings from trained GLUE model
+All tools extracted from `gao-lab/GLUE/docs/training.ipynb`.
+"""
+import os
+from datetime import datetime
+from itertools import chain
+from pathlib import Path
+# Standard imports
+from typing import Annotated, Any, Literal
+# Domain-specific imports
+import anndata as ad
+import matplotlib.pyplot as plt
+import networkx as nx
+import numpy as np
+import pandas as pd
+import scanpy as sc
+import scglue
+import seaborn as sns
+from fastmcp import FastMCP
+from matplotlib import rcParams
+# Project structure
+PROJECT_ROOT = Path(__file__).parent.parent.parent.resolve()
+DEFAULT_INPUT_DIR = PROJECT_ROOT / "tmp" / "inputs"
+DEFAULT_OUTPUT_DIR = PROJECT_ROOT / "tmp" / "outputs"
+INPUT_DIR = Path(os.environ.get("TRAINING_INPUT_DIR", DEFAULT_INPUT_DIR))
+OUTPUT_DIR = Path(os.environ.get("TRAINING_OUTPUT_DIR", DEFAULT_OUTPUT_DIR))
+# Ensure directories exist
+INPUT_DIR.mkdir(parents=True, exist_ok=True)
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# Timestamp for unique outputs
+timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+# MCP server instance
+training_mcp = FastMCP(name="training")
+# Set plot parameters
+plt.rcParams["figure.dpi"] = 300
+plt.rcParams["savefig.dpi"] = 300
+scglue.plot.set_publication_params()
+rcParams["figure.figsize"] = (4, 4)
+@training_mcp.tool
+def glue_configure_datasets(
+    # Primary data inputs
+    rna_path: Annotated[
+        str | None, "Path to preprocessed RNA-seq data file with extension .h5ad"
+    ] = None,
+    atac_path: Annotated[
+        str | None, "Path to preprocessed ATAC-seq data file with extension .h5ad"
+    ] = None,
+    guidance_path: Annotated[
+        str | None, "Path to guidance graph file with extension .graphml.gz"
+    ] = None,
+    # Configuration parameters with tutorial defaults
+    prob_model: Annotated[
+        Literal["NB", "ZINB", "ZIP"], "Probabilistic generative model"
+    ] = "NB",
+    use_highly_variable: Annotated[bool, "Use only highly variable features"] = True,
+    rna_use_layer: Annotated[
+        str | None, "RNA data layer to use (None uses .X)"
+    ] = "counts",
+    rna_use_rep: Annotated[str, "RNA preprocessing embedding to use"] = "X_pca",
+    atac_use_rep: Annotated[str, "ATAC preprocessing embedding to use"] = "X_lsi",
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Configure RNA-seq and ATAC-seq datasets for GLUE model training.
+    Input is preprocessed RNA/ATAC h5ad files and guidance graph, output is configured h5ad files and HVF-filtered guidance graph.
+    """
+    # Input file validation
+    if rna_path is None:
+        raise ValueError("Path to RNA-seq data file must be provided")
+    if atac_path is None:
+        raise ValueError("Path to ATAC-seq data file must be provided")
+    if guidance_path is None:
+        raise ValueError("Path to guidance graph file must be provided")
+    # File existence validation
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA-seq file not found: {rna_path}")
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC-seq file not found: {atac_path}")
+    guidance_file = Path(guidance_path)
+    if not guidance_file.exists():
+        raise FileNotFoundError(f"Guidance graph file not found: {guidance_path}")
+    # Load data
+    rna = ad.read_h5ad(rna_path)
+    atac = ad.read_h5ad(atac_path)
+    guidance = nx.read_graphml(guidance_path)
+    # Configure datasets
+    scglue.models.configure_dataset(
+        rna,
+        prob_model,
+        use_highly_variable=use_highly_variable,
+        use_layer=rna_use_layer,
+        use_rep=rna_use_rep,
+    )
+    scglue.models.configure_dataset(
+        atac, prob_model, use_highly_variable=use_highly_variable, use_rep=atac_use_rep
+    )
+    # Extract subgraph with highly variable features
+    guidance_hvf = guidance.subgraph(
+        chain(
+            rna.var.query("highly_variable").index,
+            atac.var.query("highly_variable").index,
+        )
+    ).copy()
+    # Note: anndata drops None values during save/load, but scglue's configure_dataset
+    # creates these fields. We preserve them by converting None to a special string marker.
+    for adata in [rna, atac]:
+        if "__scglue__" in adata.uns:
+            config = adata.uns["__scglue__"]
+            # Convert None values to string markers that will survive serialization
+            for key in [
+                "batches",
+                "use_batch",
+                "use_cell_type",
+                "cell_types",
+                "use_dsc_weight",
+                "use_layer",
+            ]:
+                if key in config and config[key] is None:
+                    config[key] = "__none__"
+    # Save configured datasets and HVF guidance graph
+    if out_prefix is None:
+        out_prefix = f"glue_configured_{timestamp}"
+    rna_output = OUTPUT_DIR / f"{out_prefix}_rna_configured.h5ad"
+    atac_output = OUTPUT_DIR / f"{out_prefix}_atac_configured.h5ad"
+    guidance_hvf_output = OUTPUT_DIR / f"{out_prefix}_guidance_hvf.graphml.gz"
+    rna.write(str(rna_output), compression="gzip")
+    atac.write(str(atac_output), compression="gzip")
+    nx.write_graphml(guidance_hvf, str(guidance_hvf_output))
+    # Return standardized format
+    return {
+        "message": f"Configured datasets with {len(rna.var.query('highly_variable'))} RNA and {len(atac.var.query('highly_variable'))} ATAC HVFs",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/training.ipynb",
+        "artifacts": [
+            {
+                "description": "Configured RNA-seq data",
+                "path": str(rna_output.resolve()),
+            },
+            {
+                "description": "Configured ATAC-seq data",
+                "path": str(atac_output.resolve()),
+            },
+            {
+                "description": "HVF-filtered guidance graph",
+                "path": str(guidance_hvf_output.resolve()),
+            },
+        ],
+    }
+@training_mcp.tool
+def glue_train_model(
+    # Primary data inputs
+    rna_path: Annotated[
+        str | None, "Path to configured RNA-seq data file with extension .h5ad"
+    ] = None,
+    atac_path: Annotated[
+        str | None, "Path to configured ATAC-seq data file with extension .h5ad"
+    ] = None,
+    guidance_hvf_path: Annotated[
+        str | None,
+        "Path to HVF-filtered guidance graph file with extension .graphml.gz",
+    ] = None,
+    # Training parameters
+    training_dir: Annotated[
+        str | None, "Directory to store model snapshots and training logs"
+    ] = None,
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Train GLUE model for multi-omics integration.
+    Input is configured RNA/ATAC h5ad files and HVF guidance graph, output is trained GLUE model.
+    """
+    # Input file validation
+    if rna_path is None:
+        raise ValueError("Path to configured RNA-seq data file must be provided")
+    if atac_path is None:
+        raise ValueError("Path to configured ATAC-seq data file must be provided")
+    if guidance_hvf_path is None:
+        raise ValueError("Path to HVF-filtered guidance graph file must be provided")
+    # File existence validation
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA-seq file not found: {rna_path}")
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC-seq file not found: {atac_path}")
+    guidance_hvf_file = Path(guidance_hvf_path)
+    if not guidance_hvf_file.exists():
+        raise FileNotFoundError(
+            f"Guidance HVF graph file not found: {guidance_hvf_path}"
+        )
+    # Load data
+    rna = ad.read_h5ad(rna_path)
+    atac = ad.read_h5ad(atac_path)
+    guidance_hvf = nx.read_graphml(guidance_hvf_path)
+    # Convert string markers back to None for scglue compatibility
+    for adata in [rna, atac]:
+        if "__scglue__" in adata.uns:
+            config = adata.uns["__scglue__"]
+            for key in [
+                "batches",
+                "use_batch",
+                "use_cell_type",
+                "cell_types",
+                "use_dsc_weight",
+                "use_layer",
+            ]:
+                if key in config and config[key] == "__none__":
+                    config[key] = None
+    # Set training directory
+    if training_dir is None:
+        if out_prefix is None:
+            out_prefix = f"glue_model_{timestamp}"
+        training_dir = str(OUTPUT_DIR / f"{out_prefix}_training")
+    # Create training directory
+    Path(training_dir).mkdir(parents=True, exist_ok=True)
+    # Train GLUE model
+    glue = scglue.models.fit_SCGLUE(
+        {"rna": rna, "atac": atac}, guidance_hvf, fit_kws={"directory": training_dir}
+    )
+    # Save trained model
+    if out_prefix is None:
+        out_prefix = f"glue_model_{timestamp}"
+    model_output = OUTPUT_DIR / f"{out_prefix}.dill"
+    glue.save(str(model_output))
+    # Return standardized format
+    return {
+        "message": "GLUE model training completed successfully",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/training.ipynb",
+        "artifacts": [
+            {"description": "Trained GLUE model", "path": str(model_output.resolve())},
+            {
+                "description": "Training logs directory",
+                "path": str(Path(training_dir).resolve()),
+            },
+        ],
+    }
+@training_mcp.tool
+def glue_check_integration_consistency(
+    # Primary data inputs
+    model_path: Annotated[
+        str | None, "Path to trained GLUE model file with extension .dill"
+    ] = None,
+    rna_path: Annotated[
+        str | None, "Path to configured RNA-seq data file with extension .h5ad"
+    ] = None,
+    atac_path: Annotated[
+        str | None, "Path to configured ATAC-seq data file with extension .h5ad"
+    ] = None,
+    guidance_hvf_path: Annotated[
+        str | None,
+        "Path to HVF-filtered guidance graph file with extension .graphml.gz",
+    ] = None,
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Evaluate integration quality with consistency scores across metacell granularities.
+    Input is trained model, RNA/ATAC data, and HVF guidance graph, output is consistency scores table and plot.
+    """
+    # Input file validation
+    if model_path is None:
+        raise ValueError("Path to trained GLUE model file must be provided")
+    if rna_path is None:
+        raise ValueError("Path to configured RNA-seq data file must be provided")
+    if atac_path is None:
+        raise ValueError("Path to configured ATAC-seq data file must be provided")
+    if guidance_hvf_path is None:
+        raise ValueError("Path to HVF-filtered guidance graph file must be provided")
+    # File existence validation
+    model_file = Path(model_path)
+    if not model_file.exists():
+        raise FileNotFoundError(f"Model file not found: {model_path}")
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA-seq file not found: {rna_path}")
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC-seq file not found: {atac_path}")
+    guidance_hvf_file = Path(guidance_hvf_path)
+    if not guidance_hvf_file.exists():
+        raise FileNotFoundError(
+            f"Guidance HVF graph file not found: {guidance_hvf_path}"
+        )
+    # Load data
+    glue = scglue.models.load_model(model_path)
+    rna = ad.read_h5ad(rna_path)
+    atac = ad.read_h5ad(atac_path)
+    guidance_hvf = nx.read_graphml(guidance_hvf_path)
+    # Convert string markers back to None for scglue compatibility
+    for adata in [rna, atac]:
+        if "__scglue__" in adata.uns:
+            config = adata.uns["__scglue__"]
+            for key in [
+                "batches",
+                "use_batch",
+                "use_cell_type",
+                "cell_types",
+                "use_dsc_weight",
+                "use_layer",
+            ]:
+                if key in config and config[key] == "__none__":
+                    config[key] = None
+    # Compute integration consistency
+    dx = scglue.models.integration_consistency(
+        glue, {"rna": rna, "atac": atac}, guidance_hvf
+    )
+    # Save consistency scores
+    if out_prefix is None:
+        out_prefix = f"glue_consistency_{timestamp}"
+    consistency_table = OUTPUT_DIR / f"{out_prefix}_scores.csv"
+    dx.to_csv(str(consistency_table), index=False)
+    # Generate consistency plot
+    plt.figure(figsize=(4, 4))
+    ax = sns.lineplot(x="n_meta", y="consistency", data=dx)
+    ax.axhline(y=0.05, c="darkred", ls="--")
+    plt.xlabel("Number of metacells")
+    plt.ylabel("Consistency score")
+    plt.tight_layout()
+    consistency_plot = OUTPUT_DIR / f"{out_prefix}_plot.png"
+    plt.savefig(str(consistency_plot), dpi=300, bbox_inches="tight")
+    plt.close()
+    # Return standardized format
+    return {
+        "message": f"Integration consistency computed (range: {dx['consistency'].min():.3f}-{dx['consistency'].max():.3f})",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/training.ipynb",
+        "artifacts": [
+            {
+                "description": "Consistency scores table",
+                "path": str(consistency_table.resolve()),
+            },
+            {
+                "description": "Consistency plot",
+                "path": str(consistency_plot.resolve()),
+            },
+        ],
+    }
+@training_mcp.tool
+def glue_generate_embeddings(
+    # Primary data inputs
+    model_path: Annotated[
+        str | None, "Path to trained GLUE model file with extension .dill"
+    ] = None,
+    rna_path: Annotated[
+        str | None, "Path to configured RNA-seq data file with extension .h5ad"
+    ] = None,
+    atac_path: Annotated[
+        str | None, "Path to configured ATAC-seq data file with extension .h5ad"
+    ] = None,
+    guidance_hvf_path: Annotated[
+        str | None,
+        "Path to HVF-filtered guidance graph file with extension .graphml.gz",
+    ] = None,
+    # Visualization parameters with tutorial defaults
+    color_vars: Annotated[list, "Variables to color UMAP by"] = ["cell_type", "domain"],
+    out_prefix: Annotated[str | None, "Output file prefix"] = None,
+) -> dict:
+    """
+    Generate cell and feature embeddings from trained GLUE model and visualize alignment.
+    Input is trained model and RNA/ATAC data, output is h5ad files with embeddings and UMAP visualization.
+    """
+    # Input file validation
+    if model_path is None:
+        raise ValueError("Path to trained GLUE model file must be provided")
+    if rna_path is None:
+        raise ValueError("Path to configured RNA-seq data file must be provided")
+    if atac_path is None:
+        raise ValueError("Path to configured ATAC-seq data file must be provided")
+    if guidance_hvf_path is None:
+        raise ValueError("Path to HVF-filtered guidance graph file must be provided")
+    # File existence validation
+    model_file = Path(model_path)
+    if not model_file.exists():
+        raise FileNotFoundError(f"Model file not found: {model_path}")
+    rna_file = Path(rna_path)
+    if not rna_file.exists():
+        raise FileNotFoundError(f"RNA-seq file not found: {rna_path}")
+    atac_file = Path(atac_path)
+    if not atac_file.exists():
+        raise FileNotFoundError(f"ATAC-seq file not found: {atac_path}")
+    guidance_hvf_file = Path(guidance_hvf_path)
+    if not guidance_hvf_file.exists():
+        raise FileNotFoundError(
+            f"Guidance HVF graph file not found: {guidance_hvf_path}"
+        )
+    # Load data
+    glue = scglue.models.load_model(model_path)
+    rna = ad.read_h5ad(rna_path)
+    atac = ad.read_h5ad(atac_path)
+    guidance_hvf = nx.read_graphml(guidance_hvf_path)
+    # Convert string markers back to None for scglue compatibility
+    for adata in [rna, atac]:
+        if "__scglue__" in adata.uns:
+            config = adata.uns["__scglue__"]
+            for key in [
+                "batches",
+                "use_batch",
+                "use_cell_type",
+                "cell_types",
+                "use_dsc_weight",
+                "use_layer",
+            ]:
+                if key in config and config[key] == "__none__":
+                    config[key] = None
+    # Generate cell embeddings
+    rna.obsm["X_glue"] = glue.encode_data("rna", rna)
+    atac.obsm["X_glue"] = glue.encode_data("atac", atac)
+    # Generate feature embeddings
+    feature_embeddings = glue.encode_graph(guidance_hvf)
+    feature_embeddings = pd.DataFrame(feature_embeddings, index=glue.vertices)
+    rna.varm["X_glue"] = feature_embeddings.reindex(rna.var_names).to_numpy()
+    atac.varm["X_glue"] = feature_embeddings.reindex(atac.var_names).to_numpy()
+    # Create combined dataset for visualization
+    combined = ad.concat([rna, atac])
+    # Generate UMAP visualization
+    sc.pp.neighbors(combined, use_rep="X_glue", metric="cosine")
+    sc.tl.umap(combined)
+    sc.pl.umap(combined, color=color_vars, wspace=0.65)
+    # Save UMAP plot
+    if out_prefix is None:
+        out_prefix = f"glue_embeddings_{timestamp}"
+    umap_plot = OUTPUT_DIR / f"{out_prefix}_umap.png"
+    plt.savefig(str(umap_plot), dpi=300, bbox_inches="tight")
+    plt.close()
+    # Save h5ad files with embeddings
+    rna_output = OUTPUT_DIR / f"{out_prefix}_rna_emb.h5ad"
+    atac_output = OUTPUT_DIR / f"{out_prefix}_atac_emb.h5ad"
+    guidance_hvf_output = OUTPUT_DIR / f"{out_prefix}_guidance_hvf.graphml.gz"
+    rna.write(str(rna_output), compression="gzip")
+    atac.write(str(atac_output), compression="gzip")
+    nx.write_graphml(guidance_hvf, str(guidance_hvf_output))
+    # Return standardized format
+    return {
+        "message": f"Generated embeddings for {rna.n_obs} RNA and {atac.n_obs} ATAC cells",
+        "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/training.ipynb",
+        "artifacts": [
+            {
+                "description": "RNA data with embeddings",
+                "path": str(rna_output.resolve()),
+            },
+            {
+                "description": "ATAC data with embeddings",
+                "path": str(atac_output.resolve()),
+            },
+            {
+                "description": "HVF guidance graph",
+                "path": str(guidance_hvf_output.resolve()),
+            },
+            {"description": "UMAP visualization", "path": str(umap_plot.resolve())},
+        ],
+    }

tools/training_summary.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Training Tutorial - Tool Extraction Summary
+## Source Information
+- **Tutorial**: GLUE model training workflow
+- **Source URL**: https://github.com/gao-lab/GLUE/blob/master/docs/training.ipynb
+- **Notebook**: notebooks/training/training_execution_final.ipynb
+- **Output File**: src/tools/training.py
+## Extracted Tools
+### 1. glue_configure_datasets
+**Purpose**: Configure RNA-seq and ATAC-seq datasets for GLUE model training
+**When to use**: First step in GLUE workflow after preprocessing; prepares datasets for model training
+**Inputs**:
+- `rna_path`: Preprocessed RNA-seq h5ad file
+- `atac_path`: Preprocessed ATAC-seq h5ad file
+- `guidance_path`: Guidance graph file
+- Configuration parameters (prob_model, use_highly_variable, etc.)
+**Outputs**:
+- Configured RNA h5ad file
+- Configured ATAC h5ad file
+- HVF-filtered guidance graph
+**Tutorial Section**: "Configure data"
+---
+### 2. glue_train_model
+**Purpose**: Train GLUE model for multi-omics integration
+**When to use**: After configuring datasets; core model training step
+**Inputs**:
+- `rna_path`: Configured RNA-seq h5ad file
+- `atac_path`: Configured ATAC-seq h5ad file
+- `guidance_hvf_path`: HVF-filtered guidance graph
+- `training_dir`: Directory for model snapshots and logs (optional)
+**Outputs**:
+- Trained GLUE model (.dill file)
+- Training logs directory
+**Tutorial Section**: "Train GLUE model"
+---
+### 3. glue_check_integration_consistency
+**Purpose**: Evaluate integration quality with consistency scores
+**When to use**: After model training to validate integration quality
+**Inputs**:
+- `model_path`: Trained GLUE model file
+- `rna_path`: Configured RNA-seq h5ad file
+- `atac_path`: Configured ATAC-seq h5ad file
+- `guidance_hvf_path`: HVF-filtered guidance graph
+**Outputs**:
+- Consistency scores table (CSV)
+- Consistency plot (PNG)
+**Tutorial Section**: "Check integration diagnostics"
+**Interpretation**: Consistency scores above 0.05 indicate reliable integration
+---
+### 4. glue_generate_embeddings
+**Purpose**: Generate cell and feature embeddings from trained GLUE model and visualize alignment
+**When to use**: After successful model training and validation; produces final embeddings for downstream analysis
+**Inputs**:
+- `model_path`: Trained GLUE model file
+- `rna_path`: Configured RNA-seq h5ad file
+- `atac_path`: Configured ATAC-seq h5ad file
+- `guidance_hvf_path`: HVF-filtered guidance graph
+- `color_vars`: Variables to color UMAP by (default: ["cell_type", "domain"])
+**Outputs**:
+- RNA h5ad with cell and feature embeddings
+- ATAC h5ad with cell and feature embeddings
+- HVF guidance graph
+- UMAP visualization (PNG)
+**Tutorial Section**: "Apply model for cell and feature embedding"
+---
+## Typical Workflow
+```
+1. glue_configure_datasets
+   ↓ (produces configured h5ad files + HVF guidance graph)
+2. glue_train_model
+   ↓ (produces trained model)
+3. glue_check_integration_consistency
+   ↓ (validates integration quality)
+4. glue_generate_embeddings
+   ↓ (produces final embeddings for downstream analysis)
+```
+## Key Design Decisions
+1. **Parameter Preservation**: All function calls exactly match the tutorial - no additional parameters added
+2. **Structure Preservation**: Data structures like lists are preserved exactly as in tutorial
+3. **Input Design**: All tools use file paths as primary inputs for maximum reusability
+4. **Workflow Integration**: Tools designed for sequential execution matching tutorial flow
+5. **Output Completeness**: All code-generated figures and essential data are saved automatically
+## Quality Validation
+All 4 tools passed comprehensive quality review on first iteration:
+- ✓ Tool design validation
+- ✓ Input/output validation
+- ✓ Tutorial logic adherence validation
+- ✓ Implementation quality checks
+- ✓ Syntax and import verification
+## Testing Readiness
+The implementation is production-ready and follows all extraction guidelines:
+- Conservative approach with exact tutorial fidelity
+- Scientific rigor maintained throughout
+- Real-world applicability for user data
+- No mock data or demonstration code
+- Ready for testing phase