Paper2Agent-scglue-mcp / src /tools /preprocessing_implementation_log.md
Dylan Mann-Krzisnik
Fix repo layout / Dockerfile paths
5c47821

Implementation Log: GLUE Preprocessing Tools

Tutorial Source: gao-lab/GLUE/blob/master/docs/preprocessing.ipynb Implementation Date: 2026-02-14 Output File: src/tools/preprocessing.py

Tool Design Decisions

Tools Extracted (3 tools)

  1. glue_preprocess_scrna

    • Section: "Preprocess scRNA-seq data"
    • Rationale: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
    • Classification: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
    • Parameters Preserved: n_top_genes=2000, flavor="seurat_v3", n_comps=100, svd_solver="auto" all explicitly set in tutorial
    • Parameters Parameterized: color_var="cell_type" is tutorial-specific and must be configurable for user datasets
  2. glue_preprocess_scatac

    • Section: "Preprocess scATAC-seq data"
    • Rationale: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
    • Classification: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
    • Parameters Preserved: n_components=100, n_iter=15 explicitly set in tutorial
    • Parameters Parameterized: color_var="cell_type" is tutorial-specific and must be configurable for user datasets
  3. glue_construct_regulatory_graph

    • Section: "Construct prior regulatory graph"
    • Rationale: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
    • Classification: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
    • Parameters Preserved: gtf_by="gene_name" as tutorial default
    • Input Requirements: Requires GTF annotation file which users must provide for their organism

Tools Excluded (1 tool)

  1. glue_read_paired_data (initially present, removed in revision)
    • Section: "Read data"
    • Rationale for Exclusion: Only loads tutorial example data with no analytical transformation
    • Classification: NOT Applicable to New Data - data loading is trivial and should be handled by users

Parameter Design Rationale

Primary Data Inputs

  • All tools use file paths as primary inputs (h5ad format for AnnData objects)
  • No data object parameters (e.g., adata: AnnData) to enforce file-based workflow
  • All data paths default to None with validation in function body for clear error messages

Analysis Parameters

Parameters Explicitly Set in Tutorial (Parameterized):

  • n_top_genes=2000, flavor="seurat_v3" - Tutorial shows explicit values for HVG selection
  • n_comps=100, svd_solver="auto" - Tutorial shows explicit values for PCA
  • n_components=100, n_iter=15 - Tutorial shows explicit values for LSI
  • gtf_by="gene_name" - Tutorial shows explicit attribute for GTF parsing

Tutorial-Specific Values (Parameterized):

  • color_var="cell_type" - Column name specific to tutorial dataset, must be configurable for user data

Library Defaults (Preserved):

  • sc.pp.neighbors(rna, metric="cosine") - Tutorial shows this exact call, preserved as-is
  • sc.pp.normalize_total(rna) - No parameters in tutorial, using library defaults
  • sc.pp.log1p(rna) - No parameters in tutorial, using library defaults
  • sc.pp.scale(rna) - No parameters in tutorial, using library defaults

Critical Rule Adherence

NEVER ADD PARAMETERS NOT IN TUTORIAL: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.

PRESERVE EXACT TUTORIAL STRUCTURE: All function calls preserve the exact structure from the tutorial:

  • sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3") β†’ parameterized as shown
  • sc.tl.pca(rna, n_comps=100, svd_solver="auto") β†’ parameterized as shown
  • scglue.data.lsi(atac, n_components=100, n_iter=15) β†’ parameterized as shown
  • sc.pp.neighbors(rna, metric="cosine") β†’ preserved exactly as shown

Output Requirements

Visualization Outputs

Code-Generated Figures Only:

  • glue_preprocess_scrna: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
  • glue_preprocess_scatac: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
  • No static figures or diagrams included (tutorial has none)

Figure Specifications:

  • Format: PNG with dpi=300, bbox_inches='tight'
  • Naming: {out_prefix}_umap_{timestamp}.png
  • Always generated (no user control parameter)

Data Outputs

Essential Results Saved:

  • Preprocessed AnnData objects with all transformations applied
  • Guidance graph in NetworkX GraphML format
  • Annotated data with genomic coordinates

File Formats:

  • AnnData: h5ad with gzip compression (standard for single-cell data)
  • Graph: graphml.gz (standard for NetworkX graphs)

Naming Convention:

  • {out_prefix}_preprocessed_{timestamp}.h5ad
  • {out_prefix}_graph_{timestamp}.graphml.gz
  • {out_prefix}_rna_annotated_{timestamp}.h5ad

Return Format

All tools return standardized dict:

{
    "message": "<concise status ≀120 chars>",
    "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
    "artifacts": [
        {
            "description": "<description ≀50 chars>",
            "path": "/absolute/path/to/file"
        }
    ]
}

Quality Review Results

Iteration 1 (Final)

Date: 2026-02-14 Status: All checks passed

Tool Design Validation: [βœ“] All 7 checks passed

  • Tool definition, naming, description, classification, order, boundaries, independence all correct

Implementation Validation: [βœ“] All 8 checks passed

  • Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct

Output Validation: [βœ“] All 5 checks passed

  • Figure generation, data outputs, return format, file paths, reference links all correct

Code Quality Validation: [βœ“] All 6 checks passed

  • Error handling, type annotations, documentation, template compliance, import management, environment setup all correct

Summary: 3/3 tools passing all checks. No issues found. Implementation is production-ready.

Implementation Choices

Libraries Used

  • anndata: Standard format for single-cell data (AnnData objects)
  • scanpy: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
  • scglue: GLUE-specific functions (LSI, graph construction, gene annotation)
  • networkx: Standard graph library for guidance graph representation
  • matplotlib: Visualization library for UMAP plots

Error Handling Approach

Basic Input Validation Only:

  • Required parameter validation (data_path must be provided)
  • File existence checks (FileNotFoundError if file not found)
  • No intermediate processing validation (trust library error messages)

Rationale: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.

Parameterization Rationale

Why Parameterize color_var?

  • Tutorial uses "cell_type" which is a column specific to the tutorial dataset
  • User datasets will have different column names for cell annotations
  • Parameterizing enables tool to work with any AnnData object with different metadata columns

Why Parameterize gtf_by?

  • Tutorial uses "gene_name" attribute in GTF, but GTF files can use different attributes
  • Some GTF files use "gene_id", "transcript_name", or other attributes
  • Parameterizing enables tool to work with different GTF annotation standards

Why Keep Default n_top_genes=2000?

  • This is a standard value in single-cell RNA-seq analysis
  • Tutorial explicitly sets this value, not using library default
  • Value represents a scientific choice about feature selection stringency

Why Keep Default n_components=100?

  • This is the standard dimensionality for GLUE model training
  • Tutorial explicitly sets this value for downstream model compatibility
  • Changing this value would require adjusting the GLUE model architecture

Known Limitations

  1. Coordinate Extraction Assumption: glue_construct_regulatory_graph assumes ATAC peak names follow the format "chr:start-end". If user data uses different formats (e.g., "chr_start_end" or "chr:start:end"), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.

  2. GTF Compatibility: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: "gene_name").

  3. Memory Requirements: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.

  4. Visualization Dependency: UMAP visualizations require the color_var column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.

  5. File Format Constraints: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.

Testing Recommendations

  1. Test with tutorial data: Verify tools reproduce exact tutorial results with Chen-2019 dataset
  2. Test with different organisms: Verify GTF annotation works with different reference genomes
  3. Test with different annotation columns: Verify color_var parameter works with different metadata
  4. Test with edge cases:
    • Very small datasets (<100 cells)
    • Very large datasets (>100k cells)
    • Datasets with missing or malformed peak coordinates
    • GTF files with different attribute names

Revision History

Initial Implementation

  • 4 tools: glue_read_paired_data, glue_preprocess_scrna, glue_preprocess_scatac, glue_construct_guidance_graph

Revision 1 (2026-02-14)

Changes Made:

  1. Removed glue_read_paired_data tool: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
  2. Renamed glue_construct_guidance_graph to glue_construct_regulatory_graph: Better matches tutorial section title "Construct prior regulatory graph"
  3. Updated documentation: Corrected tool count from 4 to 3 tools

Rationale: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.

Result: All 3 remaining tools pass quality review with all checks passing.