Spaces:
Sleeping
Sleeping
| # Implementation Log: GLUE Preprocessing Tools | |
| **Tutorial Source**: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb` | |
| **Implementation Date**: 2026-02-14 | |
| **Output File**: `src/tools/preprocessing.py` | |
| ## Tool Design Decisions | |
| ### Tools Extracted (3 tools) | |
| 1. **glue_preprocess_scrna** | |
| - **Section**: "Preprocess scRNA-seq data" | |
| - **Rationale**: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization | |
| - **Classification**: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix | |
| - **Parameters Preserved**: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial | |
| - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets | |
| 2. **glue_preprocess_scatac** | |
| - **Section**: "Preprocess scATAC-seq data" | |
| - **Rationale**: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization | |
| - **Classification**: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix | |
| - **Parameters Preserved**: `n_components=100`, `n_iter=15` explicitly set in tutorial | |
| - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets | |
| 3. **glue_construct_regulatory_graph** | |
| - **Section**: "Construct prior regulatory graph" | |
| - **Rationale**: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity | |
| - **Classification**: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets | |
| - **Parameters Preserved**: `gtf_by="gene_name"` as tutorial default | |
| - **Input Requirements**: Requires GTF annotation file which users must provide for their organism | |
| ### Tools Excluded (1 tool) | |
| 1. **glue_read_paired_data** (initially present, removed in revision) | |
| - **Section**: "Read data" | |
| - **Rationale for Exclusion**: Only loads tutorial example data with no analytical transformation | |
| - **Classification**: NOT Applicable to New Data - data loading is trivial and should be handled by users | |
| ## Parameter Design Rationale | |
| ### Primary Data Inputs | |
| - All tools use **file paths** as primary inputs (h5ad format for AnnData objects) | |
| - No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow | |
| - All data paths default to `None` with validation in function body for clear error messages | |
| ### Analysis Parameters | |
| **Parameters Explicitly Set in Tutorial (Parameterized)**: | |
| - `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection | |
| - `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA | |
| - `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI | |
| - `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing | |
| **Tutorial-Specific Values (Parameterized)**: | |
| - `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data | |
| **Library Defaults (Preserved)**: | |
| - `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is | |
| - `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults | |
| - `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults | |
| - `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults | |
| ### Critical Rule Adherence | |
| **NEVER ADD PARAMETERS NOT IN TUTORIAL**: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial. | |
| **PRESERVE EXACT TUTORIAL STRUCTURE**: All function calls preserve the exact structure from the tutorial: | |
| - `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` β parameterized as shown | |
| - `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` β parameterized as shown | |
| - `scglue.data.lsi(atac, n_components=100, n_iter=15)` β parameterized as shown | |
| - `sc.pp.neighbors(rna, metric="cosine")` β preserved exactly as shown | |
| ## Output Requirements | |
| ### Visualization Outputs | |
| **Code-Generated Figures Only**: | |
| - `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...") | |
| - `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...") | |
| - No static figures or diagrams included (tutorial has none) | |
| **Figure Specifications**: | |
| - Format: PNG with `dpi=300`, `bbox_inches='tight'` | |
| - Naming: `{out_prefix}_umap_{timestamp}.png` | |
| - Always generated (no user control parameter) | |
| ### Data Outputs | |
| **Essential Results Saved**: | |
| - Preprocessed AnnData objects with all transformations applied | |
| - Guidance graph in NetworkX GraphML format | |
| - Annotated data with genomic coordinates | |
| **File Formats**: | |
| - AnnData: h5ad with gzip compression (standard for single-cell data) | |
| - Graph: graphml.gz (standard for NetworkX graphs) | |
| **Naming Convention**: | |
| - `{out_prefix}_preprocessed_{timestamp}.h5ad` | |
| - `{out_prefix}_graph_{timestamp}.graphml.gz` | |
| - `{out_prefix}_rna_annotated_{timestamp}.h5ad` | |
| ### Return Format | |
| All tools return standardized dict: | |
| ```python | |
| { | |
| "message": "<concise status β€120 chars>", | |
| "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb", | |
| "artifacts": [ | |
| { | |
| "description": "<description β€50 chars>", | |
| "path": "/absolute/path/to/file" | |
| } | |
| ] | |
| } | |
| ``` | |
| ## Quality Review Results | |
| ### Iteration 1 (Final) | |
| **Date**: 2026-02-14 | |
| **Status**: All checks passed | |
| **Tool Design Validation**: [β] All 7 checks passed | |
| - Tool definition, naming, description, classification, order, boundaries, independence all correct | |
| **Implementation Validation**: [β] All 8 checks passed | |
| - Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct | |
| **Output Validation**: [β] All 5 checks passed | |
| - Figure generation, data outputs, return format, file paths, reference links all correct | |
| **Code Quality Validation**: [β] All 6 checks passed | |
| - Error handling, type annotations, documentation, template compliance, import management, environment setup all correct | |
| **Summary**: 3/3 tools passing all checks. No issues found. Implementation is production-ready. | |
| ## Implementation Choices | |
| ### Libraries Used | |
| - **anndata**: Standard format for single-cell data (AnnData objects) | |
| - **scanpy**: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP) | |
| - **scglue**: GLUE-specific functions (LSI, graph construction, gene annotation) | |
| - **networkx**: Standard graph library for guidance graph representation | |
| - **matplotlib**: Visualization library for UMAP plots | |
| ### Error Handling Approach | |
| **Basic Input Validation Only**: | |
| - Required parameter validation (data_path must be provided) | |
| - File existence checks (FileNotFoundError if file not found) | |
| - No intermediate processing validation (trust library error messages) | |
| **Rationale**: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues. | |
| ### Parameterization Rationale | |
| **Why Parameterize `color_var`?** | |
| - Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset | |
| - User datasets will have different column names for cell annotations | |
| - Parameterizing enables tool to work with any AnnData object with different metadata columns | |
| **Why Parameterize `gtf_by`?** | |
| - Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes | |
| - Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes | |
| - Parameterizing enables tool to work with different GTF annotation standards | |
| **Why Keep Default `n_top_genes=2000`?** | |
| - This is a standard value in single-cell RNA-seq analysis | |
| - Tutorial explicitly sets this value, not using library default | |
| - Value represents a scientific choice about feature selection stringency | |
| **Why Keep Default `n_components=100`?** | |
| - This is the standard dimensionality for GLUE model training | |
| - Tutorial explicitly sets this value for downstream model compatibility | |
| - Changing this value would require adjusting the GLUE model architecture | |
| ## Known Limitations | |
| 1. **Coordinate Extraction Assumption**: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data. | |
| 2. **GTF Compatibility**: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`). | |
| 3. **Memory Requirements**: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations. | |
| 4. **Visualization Dependency**: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column. | |
| 5. **File Format Constraints**: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools. | |
| ## Testing Recommendations | |
| 1. **Test with tutorial data**: Verify tools reproduce exact tutorial results with Chen-2019 dataset | |
| 2. **Test with different organisms**: Verify GTF annotation works with different reference genomes | |
| 3. **Test with different annotation columns**: Verify `color_var` parameter works with different metadata | |
| 4. **Test with edge cases**: | |
| - Very small datasets (<100 cells) | |
| - Very large datasets (>100k cells) | |
| - Datasets with missing or malformed peak coordinates | |
| - GTF files with different attribute names | |
| ## Revision History | |
| ### Initial Implementation | |
| - 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph` | |
| ### Revision 1 (2026-02-14) | |
| **Changes Made**: | |
| 1. **Removed `glue_read_paired_data` tool**: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation) | |
| 2. **Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`**: Better matches tutorial section title "Construct prior regulatory graph" | |
| 3. **Updated documentation**: Corrected tool count from 4 to 3 tools | |
| **Rationale**: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool. | |
| **Result**: All 3 remaining tools pass quality review with all checks passing. | |