Spaces:
Sleeping
Implementation Log: GLUE Preprocessing Tools
Tutorial Source: gao-lab/GLUE/blob/master/docs/preprocessing.ipynb
Implementation Date: 2026-02-14
Output File: src/tools/preprocessing.py
Tool Design Decisions
Tools Extracted (3 tools)
glue_preprocess_scrna
- Section: "Preprocess scRNA-seq data"
- Rationale: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
- Classification: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
- Parameters Preserved:
n_top_genes=2000,flavor="seurat_v3",n_comps=100,svd_solver="auto"all explicitly set in tutorial - Parameters Parameterized:
color_var="cell_type"is tutorial-specific and must be configurable for user datasets
glue_preprocess_scatac
- Section: "Preprocess scATAC-seq data"
- Rationale: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
- Classification: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
- Parameters Preserved:
n_components=100,n_iter=15explicitly set in tutorial - Parameters Parameterized:
color_var="cell_type"is tutorial-specific and must be configurable for user datasets
glue_construct_regulatory_graph
- Section: "Construct prior regulatory graph"
- Rationale: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
- Classification: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
- Parameters Preserved:
gtf_by="gene_name"as tutorial default - Input Requirements: Requires GTF annotation file which users must provide for their organism
Tools Excluded (1 tool)
- glue_read_paired_data (initially present, removed in revision)
- Section: "Read data"
- Rationale for Exclusion: Only loads tutorial example data with no analytical transformation
- Classification: NOT Applicable to New Data - data loading is trivial and should be handled by users
Parameter Design Rationale
Primary Data Inputs
- All tools use file paths as primary inputs (h5ad format for AnnData objects)
- No data object parameters (e.g.,
adata: AnnData) to enforce file-based workflow - All data paths default to
Nonewith validation in function body for clear error messages
Analysis Parameters
Parameters Explicitly Set in Tutorial (Parameterized):
n_top_genes=2000,flavor="seurat_v3"- Tutorial shows explicit values for HVG selectionn_comps=100,svd_solver="auto"- Tutorial shows explicit values for PCAn_components=100,n_iter=15- Tutorial shows explicit values for LSIgtf_by="gene_name"- Tutorial shows explicit attribute for GTF parsing
Tutorial-Specific Values (Parameterized):
color_var="cell_type"- Column name specific to tutorial dataset, must be configurable for user data
Library Defaults (Preserved):
sc.pp.neighbors(rna, metric="cosine")- Tutorial shows this exact call, preserved as-issc.pp.normalize_total(rna)- No parameters in tutorial, using library defaultssc.pp.log1p(rna)- No parameters in tutorial, using library defaultssc.pp.scale(rna)- No parameters in tutorial, using library defaults
Critical Rule Adherence
NEVER ADD PARAMETERS NOT IN TUTORIAL: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.
PRESERVE EXACT TUTORIAL STRUCTURE: All function calls preserve the exact structure from the tutorial:
sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")β parameterized as shownsc.tl.pca(rna, n_comps=100, svd_solver="auto")β parameterized as shownscglue.data.lsi(atac, n_components=100, n_iter=15)β parameterized as shownsc.pp.neighbors(rna, metric="cosine")β preserved exactly as shown
Output Requirements
Visualization Outputs
Code-Generated Figures Only:
glue_preprocess_scrna: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")glue_preprocess_scatac: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")- No static figures or diagrams included (tutorial has none)
Figure Specifications:
- Format: PNG with
dpi=300,bbox_inches='tight' - Naming:
{out_prefix}_umap_{timestamp}.png - Always generated (no user control parameter)
Data Outputs
Essential Results Saved:
- Preprocessed AnnData objects with all transformations applied
- Guidance graph in NetworkX GraphML format
- Annotated data with genomic coordinates
File Formats:
- AnnData: h5ad with gzip compression (standard for single-cell data)
- Graph: graphml.gz (standard for NetworkX graphs)
Naming Convention:
{out_prefix}_preprocessed_{timestamp}.h5ad{out_prefix}_graph_{timestamp}.graphml.gz{out_prefix}_rna_annotated_{timestamp}.h5ad
Return Format
All tools return standardized dict:
{
"message": "<concise status β€120 chars>",
"reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
"artifacts": [
{
"description": "<description β€50 chars>",
"path": "/absolute/path/to/file"
}
]
}
Quality Review Results
Iteration 1 (Final)
Date: 2026-02-14 Status: All checks passed
Tool Design Validation: [β] All 7 checks passed
- Tool definition, naming, description, classification, order, boundaries, independence all correct
Implementation Validation: [β] All 8 checks passed
- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct
Output Validation: [β] All 5 checks passed
- Figure generation, data outputs, return format, file paths, reference links all correct
Code Quality Validation: [β] All 6 checks passed
- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct
Summary: 3/3 tools passing all checks. No issues found. Implementation is production-ready.
Implementation Choices
Libraries Used
- anndata: Standard format for single-cell data (AnnData objects)
- scanpy: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
- scglue: GLUE-specific functions (LSI, graph construction, gene annotation)
- networkx: Standard graph library for guidance graph representation
- matplotlib: Visualization library for UMAP plots
Error Handling Approach
Basic Input Validation Only:
- Required parameter validation (data_path must be provided)
- File existence checks (FileNotFoundError if file not found)
- No intermediate processing validation (trust library error messages)
Rationale: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.
Parameterization Rationale
Why Parameterize color_var?
- Tutorial uses
"cell_type"which is a column specific to the tutorial dataset - User datasets will have different column names for cell annotations
- Parameterizing enables tool to work with any AnnData object with different metadata columns
Why Parameterize gtf_by?
- Tutorial uses
"gene_name"attribute in GTF, but GTF files can use different attributes - Some GTF files use
"gene_id","transcript_name", or other attributes - Parameterizing enables tool to work with different GTF annotation standards
Why Keep Default n_top_genes=2000?
- This is a standard value in single-cell RNA-seq analysis
- Tutorial explicitly sets this value, not using library default
- Value represents a scientific choice about feature selection stringency
Why Keep Default n_components=100?
- This is the standard dimensionality for GLUE model training
- Tutorial explicitly sets this value for downstream model compatibility
- Changing this value would require adjusting the GLUE model architecture
Known Limitations
Coordinate Extraction Assumption:
glue_construct_regulatory_graphassumes ATAC peak names follow the format"chr:start-end". If user data uses different formats (e.g.,"chr_start_end"or"chr:start:end"), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.GTF Compatibility: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default:
"gene_name").Memory Requirements: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.
Visualization Dependency: UMAP visualizations require the
color_varcolumn to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.File Format Constraints: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.
Testing Recommendations
- Test with tutorial data: Verify tools reproduce exact tutorial results with Chen-2019 dataset
- Test with different organisms: Verify GTF annotation works with different reference genomes
- Test with different annotation columns: Verify
color_varparameter works with different metadata - Test with edge cases:
- Very small datasets (<100 cells)
- Very large datasets (>100k cells)
- Datasets with missing or malformed peak coordinates
- GTF files with different attribute names
Revision History
Initial Implementation
- 4 tools:
glue_read_paired_data,glue_preprocess_scrna,glue_preprocess_scatac,glue_construct_guidance_graph
Revision 1 (2026-02-14)
Changes Made:
- Removed
glue_read_paired_datatool: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation) - Renamed
glue_construct_guidance_graphtoglue_construct_regulatory_graph: Better matches tutorial section title "Construct prior regulatory graph" - Updated documentation: Corrected tool count from 4 to 3 tools
Rationale: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.
Result: All 3 remaining tools pass quality review with all checks passing.