Paper2Agent-scglue-mcp / src /tools /preprocessing_implementation_log.md
Dylan Mann-Krzisnik
Fix repo layout / Dockerfile paths
5c47821
# Implementation Log: GLUE Preprocessing Tools
**Tutorial Source**: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`
**Implementation Date**: 2026-02-14
**Output File**: `src/tools/preprocessing.py`
## Tool Design Decisions
### Tools Extracted (3 tools)
1. **glue_preprocess_scrna**
- **Section**: "Preprocess scRNA-seq data"
- **Rationale**: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
- **Classification**: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
- **Parameters Preserved**: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial
- **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
2. **glue_preprocess_scatac**
- **Section**: "Preprocess scATAC-seq data"
- **Rationale**: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
- **Classification**: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
- **Parameters Preserved**: `n_components=100`, `n_iter=15` explicitly set in tutorial
- **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
3. **glue_construct_regulatory_graph**
- **Section**: "Construct prior regulatory graph"
- **Rationale**: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
- **Classification**: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
- **Parameters Preserved**: `gtf_by="gene_name"` as tutorial default
- **Input Requirements**: Requires GTF annotation file which users must provide for their organism
### Tools Excluded (1 tool)
1. **glue_read_paired_data** (initially present, removed in revision)
- **Section**: "Read data"
- **Rationale for Exclusion**: Only loads tutorial example data with no analytical transformation
- **Classification**: NOT Applicable to New Data - data loading is trivial and should be handled by users
## Parameter Design Rationale
### Primary Data Inputs
- All tools use **file paths** as primary inputs (h5ad format for AnnData objects)
- No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow
- All data paths default to `None` with validation in function body for clear error messages
### Analysis Parameters
**Parameters Explicitly Set in Tutorial (Parameterized)**:
- `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection
- `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA
- `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI
- `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing
**Tutorial-Specific Values (Parameterized)**:
- `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data
**Library Defaults (Preserved)**:
- `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is
- `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults
### Critical Rule Adherence
**NEVER ADD PARAMETERS NOT IN TUTORIAL**: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.
**PRESERVE EXACT TUTORIAL STRUCTURE**: All function calls preserve the exact structure from the tutorial:
- `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` β†’ parameterized as shown
- `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` β†’ parameterized as shown
- `scglue.data.lsi(atac, n_components=100, n_iter=15)` β†’ parameterized as shown
- `sc.pp.neighbors(rna, metric="cosine")` β†’ preserved exactly as shown
## Output Requirements
### Visualization Outputs
**Code-Generated Figures Only**:
- `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
- `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
- No static figures or diagrams included (tutorial has none)
**Figure Specifications**:
- Format: PNG with `dpi=300`, `bbox_inches='tight'`
- Naming: `{out_prefix}_umap_{timestamp}.png`
- Always generated (no user control parameter)
### Data Outputs
**Essential Results Saved**:
- Preprocessed AnnData objects with all transformations applied
- Guidance graph in NetworkX GraphML format
- Annotated data with genomic coordinates
**File Formats**:
- AnnData: h5ad with gzip compression (standard for single-cell data)
- Graph: graphml.gz (standard for NetworkX graphs)
**Naming Convention**:
- `{out_prefix}_preprocessed_{timestamp}.h5ad`
- `{out_prefix}_graph_{timestamp}.graphml.gz`
- `{out_prefix}_rna_annotated_{timestamp}.h5ad`
### Return Format
All tools return standardized dict:
```python
{
"message": "<concise status ≀120 chars>",
"reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
"artifacts": [
{
"description": "<description ≀50 chars>",
"path": "/absolute/path/to/file"
}
]
}
```
## Quality Review Results
### Iteration 1 (Final)
**Date**: 2026-02-14
**Status**: All checks passed
**Tool Design Validation**: [βœ“] All 7 checks passed
- Tool definition, naming, description, classification, order, boundaries, independence all correct
**Implementation Validation**: [βœ“] All 8 checks passed
- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct
**Output Validation**: [βœ“] All 5 checks passed
- Figure generation, data outputs, return format, file paths, reference links all correct
**Code Quality Validation**: [βœ“] All 6 checks passed
- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct
**Summary**: 3/3 tools passing all checks. No issues found. Implementation is production-ready.
## Implementation Choices
### Libraries Used
- **anndata**: Standard format for single-cell data (AnnData objects)
- **scanpy**: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
- **scglue**: GLUE-specific functions (LSI, graph construction, gene annotation)
- **networkx**: Standard graph library for guidance graph representation
- **matplotlib**: Visualization library for UMAP plots
### Error Handling Approach
**Basic Input Validation Only**:
- Required parameter validation (data_path must be provided)
- File existence checks (FileNotFoundError if file not found)
- No intermediate processing validation (trust library error messages)
**Rationale**: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.
### Parameterization Rationale
**Why Parameterize `color_var`?**
- Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset
- User datasets will have different column names for cell annotations
- Parameterizing enables tool to work with any AnnData object with different metadata columns
**Why Parameterize `gtf_by`?**
- Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes
- Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes
- Parameterizing enables tool to work with different GTF annotation standards
**Why Keep Default `n_top_genes=2000`?**
- This is a standard value in single-cell RNA-seq analysis
- Tutorial explicitly sets this value, not using library default
- Value represents a scientific choice about feature selection stringency
**Why Keep Default `n_components=100`?**
- This is the standard dimensionality for GLUE model training
- Tutorial explicitly sets this value for downstream model compatibility
- Changing this value would require adjusting the GLUE model architecture
## Known Limitations
1. **Coordinate Extraction Assumption**: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.
2. **GTF Compatibility**: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`).
3. **Memory Requirements**: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.
4. **Visualization Dependency**: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.
5. **File Format Constraints**: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.
## Testing Recommendations
1. **Test with tutorial data**: Verify tools reproduce exact tutorial results with Chen-2019 dataset
2. **Test with different organisms**: Verify GTF annotation works with different reference genomes
3. **Test with different annotation columns**: Verify `color_var` parameter works with different metadata
4. **Test with edge cases**:
- Very small datasets (<100 cells)
- Very large datasets (>100k cells)
- Datasets with missing or malformed peak coordinates
- GTF files with different attribute names
## Revision History
### Initial Implementation
- 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph`
### Revision 1 (2026-02-14)
**Changes Made**:
1. **Removed `glue_read_paired_data` tool**: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
2. **Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`**: Better matches tutorial section title "Construct prior regulatory graph"
3. **Updated documentation**: Corrected tool count from 4 to 3 tools
**Rationale**: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.
**Result**: All 3 remaining tools pass quality review with all checks passing.