Spaces:
Sleeping
Sleeping
File size: 10,855 Bytes
dee34fb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | # Implementation Log: GLUE Preprocessing Tools
**Tutorial Source**: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`
**Implementation Date**: 2026-02-14
**Output File**: `src/tools/preprocessing.py`
## Tool Design Decisions
### Tools Extracted (3 tools)
1. **glue_preprocess_scrna**
- **Section**: "Preprocess scRNA-seq data"
- **Rationale**: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
- **Classification**: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
- **Parameters Preserved**: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial
- **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
2. **glue_preprocess_scatac**
- **Section**: "Preprocess scATAC-seq data"
- **Rationale**: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
- **Classification**: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
- **Parameters Preserved**: `n_components=100`, `n_iter=15` explicitly set in tutorial
- **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets
3. **glue_construct_regulatory_graph**
- **Section**: "Construct prior regulatory graph"
- **Rationale**: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
- **Classification**: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
- **Parameters Preserved**: `gtf_by="gene_name"` as tutorial default
- **Input Requirements**: Requires GTF annotation file which users must provide for their organism
### Tools Excluded (1 tool)
1. **glue_read_paired_data** (initially present, removed in revision)
- **Section**: "Read data"
- **Rationale for Exclusion**: Only loads tutorial example data with no analytical transformation
- **Classification**: NOT Applicable to New Data - data loading is trivial and should be handled by users
## Parameter Design Rationale
### Primary Data Inputs
- All tools use **file paths** as primary inputs (h5ad format for AnnData objects)
- No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow
- All data paths default to `None` with validation in function body for clear error messages
### Analysis Parameters
**Parameters Explicitly Set in Tutorial (Parameterized)**:
- `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection
- `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA
- `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI
- `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing
**Tutorial-Specific Values (Parameterized)**:
- `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data
**Library Defaults (Preserved)**:
- `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is
- `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults
### Critical Rule Adherence
**NEVER ADD PARAMETERS NOT IN TUTORIAL**: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.
**PRESERVE EXACT TUTORIAL STRUCTURE**: All function calls preserve the exact structure from the tutorial:
- `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` β parameterized as shown
- `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` β parameterized as shown
- `scglue.data.lsi(atac, n_components=100, n_iter=15)` β parameterized as shown
- `sc.pp.neighbors(rna, metric="cosine")` β preserved exactly as shown
## Output Requirements
### Visualization Outputs
**Code-Generated Figures Only**:
- `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
- `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
- No static figures or diagrams included (tutorial has none)
**Figure Specifications**:
- Format: PNG with `dpi=300`, `bbox_inches='tight'`
- Naming: `{out_prefix}_umap_{timestamp}.png`
- Always generated (no user control parameter)
### Data Outputs
**Essential Results Saved**:
- Preprocessed AnnData objects with all transformations applied
- Guidance graph in NetworkX GraphML format
- Annotated data with genomic coordinates
**File Formats**:
- AnnData: h5ad with gzip compression (standard for single-cell data)
- Graph: graphml.gz (standard for NetworkX graphs)
**Naming Convention**:
- `{out_prefix}_preprocessed_{timestamp}.h5ad`
- `{out_prefix}_graph_{timestamp}.graphml.gz`
- `{out_prefix}_rna_annotated_{timestamp}.h5ad`
### Return Format
All tools return standardized dict:
```python
{
"message": "<concise status β€120 chars>",
"reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
"artifacts": [
{
"description": "<description β€50 chars>",
"path": "/absolute/path/to/file"
}
]
}
```
## Quality Review Results
### Iteration 1 (Final)
**Date**: 2026-02-14
**Status**: All checks passed
**Tool Design Validation**: [β] All 7 checks passed
- Tool definition, naming, description, classification, order, boundaries, independence all correct
**Implementation Validation**: [β] All 8 checks passed
- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct
**Output Validation**: [β] All 5 checks passed
- Figure generation, data outputs, return format, file paths, reference links all correct
**Code Quality Validation**: [β] All 6 checks passed
- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct
**Summary**: 3/3 tools passing all checks. No issues found. Implementation is production-ready.
## Implementation Choices
### Libraries Used
- **anndata**: Standard format for single-cell data (AnnData objects)
- **scanpy**: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
- **scglue**: GLUE-specific functions (LSI, graph construction, gene annotation)
- **networkx**: Standard graph library for guidance graph representation
- **matplotlib**: Visualization library for UMAP plots
### Error Handling Approach
**Basic Input Validation Only**:
- Required parameter validation (data_path must be provided)
- File existence checks (FileNotFoundError if file not found)
- No intermediate processing validation (trust library error messages)
**Rationale**: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.
### Parameterization Rationale
**Why Parameterize `color_var`?**
- Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset
- User datasets will have different column names for cell annotations
- Parameterizing enables tool to work with any AnnData object with different metadata columns
**Why Parameterize `gtf_by`?**
- Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes
- Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes
- Parameterizing enables tool to work with different GTF annotation standards
**Why Keep Default `n_top_genes=2000`?**
- This is a standard value in single-cell RNA-seq analysis
- Tutorial explicitly sets this value, not using library default
- Value represents a scientific choice about feature selection stringency
**Why Keep Default `n_components=100`?**
- This is the standard dimensionality for GLUE model training
- Tutorial explicitly sets this value for downstream model compatibility
- Changing this value would require adjusting the GLUE model architecture
## Known Limitations
1. **Coordinate Extraction Assumption**: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.
2. **GTF Compatibility**: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`).
3. **Memory Requirements**: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.
4. **Visualization Dependency**: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.
5. **File Format Constraints**: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.
## Testing Recommendations
1. **Test with tutorial data**: Verify tools reproduce exact tutorial results with Chen-2019 dataset
2. **Test with different organisms**: Verify GTF annotation works with different reference genomes
3. **Test with different annotation columns**: Verify `color_var` parameter works with different metadata
4. **Test with edge cases**:
- Very small datasets (<100 cells)
- Very large datasets (>100k cells)
- Datasets with missing or malformed peak coordinates
- GTF files with different attribute names
## Revision History
### Initial Implementation
- 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph`
### Revision 1 (2026-02-14)
**Changes Made**:
1. **Removed `glue_read_paired_data` tool**: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
2. **Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`**: Better matches tutorial section title "Construct prior regulatory graph"
3. **Updated documentation**: Corrected tool count from 4 to 3 tools
**Rationale**: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.
**Result**: All 3 remaining tools pass quality review with all checks passing.
|