Spaces:

dmannk
/

Paper2Agent-scglue-mcp

Sleeping

App Files Files Community

Paper2Agent-scglue-mcp / src /tools /preprocessing_implementation_log.md

Dylan Mann-Krzisnik

Fix repo layout / Dockerfile paths

5c47821 13 days ago

preview code

raw

history blame contribute delete

10.9 kB

	# Implementation Log: GLUE Preprocessing Tools

	Tutorial Source: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`
	Implementation Date: 2026-02-14
	Output File: `src/tools/preprocessing.py`

	## Tool Design Decisions

	### Tools Extracted (3 tools)

	1. glue_preprocess_scrna
	- Section: "Preprocess scRNA-seq data"
	- Rationale: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
	- Classification: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
	- Parameters Preserved: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial
	- Parameters Parameterized: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets

	2. glue_preprocess_scatac
	- Section: "Preprocess scATAC-seq data"
	- Rationale: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
	- Classification: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
	- Parameters Preserved: `n_components=100`, `n_iter=15` explicitly set in tutorial
	- Parameters Parameterized: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets

	3. glue_construct_regulatory_graph
	- Section: "Construct prior regulatory graph"
	- Rationale: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
	- Classification: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
	- Parameters Preserved: `gtf_by="gene_name"` as tutorial default
	- Input Requirements: Requires GTF annotation file which users must provide for their organism

	### Tools Excluded (1 tool)

	1. glue_read_paired_data (initially present, removed in revision)
	- Section: "Read data"
	- Rationale for Exclusion: Only loads tutorial example data with no analytical transformation
	- Classification: NOT Applicable to New Data - data loading is trivial and should be handled by users

	## Parameter Design Rationale

	### Primary Data Inputs
	- All tools use file paths as primary inputs (h5ad format for AnnData objects)
	- No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow
	- All data paths default to `None` with validation in function body for clear error messages

	### Analysis Parameters
	Parameters Explicitly Set in Tutorial (Parameterized):
	- `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection
	- `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA
	- `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI
	- `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing

	Tutorial-Specific Values (Parameterized):
	- `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data

	Library Defaults (Preserved):
	- `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is
	- `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults
	- `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults
	- `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults

	### Critical Rule Adherence
	NEVER ADD PARAMETERS NOT IN TUTORIAL: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.

	PRESERVE EXACT TUTORIAL STRUCTURE: All function calls preserve the exact structure from the tutorial:
	- `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` → parameterized as shown
	- `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` → parameterized as shown
	- `scglue.data.lsi(atac, n_components=100, n_iter=15)` → parameterized as shown
	- `sc.pp.neighbors(rna, metric="cosine")` → preserved exactly as shown

	## Output Requirements

	### Visualization Outputs
	Code-Generated Figures Only:
	- `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
	- `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
	- No static figures or diagrams included (tutorial has none)

	Figure Specifications:
	- Format: PNG with `dpi=300`, `bbox_inches='tight'`
	- Naming: `{out_prefix}_umap_{timestamp}.png`
	- Always generated (no user control parameter)

	### Data Outputs
	Essential Results Saved:
	- Preprocessed AnnData objects with all transformations applied
	- Guidance graph in NetworkX GraphML format
	- Annotated data with genomic coordinates

	File Formats:
	- AnnData: h5ad with gzip compression (standard for single-cell data)
	- Graph: graphml.gz (standard for NetworkX graphs)

	Naming Convention:
	- `{out_prefix}_preprocessed_{timestamp}.h5ad`
	- `{out_prefix}_graph_{timestamp}.graphml.gz`
	- `{out_prefix}_rna_annotated_{timestamp}.h5ad`

	### Return Format
	All tools return standardized dict:
	```python
	{
	"message": "<concise status ≤120 chars>",
	"reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
	"artifacts": [
	{
	"description": "<description ≤50 chars>",
	"path": "/absolute/path/to/file"
	}
	]
	}
	```

	## Quality Review Results

	### Iteration 1 (Final)
	Date: 2026-02-14
	Status: All checks passed

	Tool Design Validation: [✓] All 7 checks passed
	- Tool definition, naming, description, classification, order, boundaries, independence all correct

	Implementation Validation: [✓] All 8 checks passed
	- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct

	Output Validation: [✓] All 5 checks passed
	- Figure generation, data outputs, return format, file paths, reference links all correct

	Code Quality Validation: [✓] All 6 checks passed
	- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct

	Summary: 3/3 tools passing all checks. No issues found. Implementation is production-ready.

	## Implementation Choices

	### Libraries Used
	- anndata: Standard format for single-cell data (AnnData objects)
	- scanpy: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
	- scglue: GLUE-specific functions (LSI, graph construction, gene annotation)
	- networkx: Standard graph library for guidance graph representation
	- matplotlib: Visualization library for UMAP plots

	### Error Handling Approach
	Basic Input Validation Only:
	- Required parameter validation (data_path must be provided)
	- File existence checks (FileNotFoundError if file not found)
	- No intermediate processing validation (trust library error messages)

	Rationale: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.

	### Parameterization Rationale

	Why Parameterize `color_var`?
	- Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset
	- User datasets will have different column names for cell annotations
	- Parameterizing enables tool to work with any AnnData object with different metadata columns

	Why Parameterize `gtf_by`?
	- Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes
	- Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes
	- Parameterizing enables tool to work with different GTF annotation standards

	Why Keep Default `n_top_genes=2000`?
	- This is a standard value in single-cell RNA-seq analysis
	- Tutorial explicitly sets this value, not using library default
	- Value represents a scientific choice about feature selection stringency

	Why Keep Default `n_components=100`?
	- This is the standard dimensionality for GLUE model training
	- Tutorial explicitly sets this value for downstream model compatibility
	- Changing this value would require adjusting the GLUE model architecture

	## Known Limitations

	1. Coordinate Extraction Assumption: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.

	2. GTF Compatibility: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`).

	3. Memory Requirements: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.

	4. Visualization Dependency: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.

	5. File Format Constraints: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.

	## Testing Recommendations

	1. Test with tutorial data: Verify tools reproduce exact tutorial results with Chen-2019 dataset
	2. Test with different organisms: Verify GTF annotation works with different reference genomes
	3. Test with different annotation columns: Verify `color_var` parameter works with different metadata
	4. Test with edge cases:
	- Very small datasets (<100 cells)
	- Very large datasets (>100k cells)
	- Datasets with missing or malformed peak coordinates
	- GTF files with different attribute names

	## Revision History

	### Initial Implementation
	- 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph`

	### Revision 1 (2026-02-14)
	Changes Made:
	1. Removed `glue_read_paired_data` tool: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
	2. Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`: Better matches tutorial section title "Construct prior regulatory graph"
	3. Updated documentation: Corrected tool count from 4 to 3 tools

	Rationale: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.

	Result: All 3 remaining tools pass quality review with all checks passing.