File size: 10,855 Bytes
dee34fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# Implementation Log: GLUE Preprocessing Tools

**Tutorial Source**: `gao-lab/GLUE/blob/master/docs/preprocessing.ipynb`
**Implementation Date**: 2026-02-14
**Output File**: `src/tools/preprocessing.py`

## Tool Design Decisions

### Tools Extracted (3 tools)

1. **glue_preprocess_scrna**
   - **Section**: "Preprocess scRNA-seq data"
   - **Rationale**: Complete preprocessing workflow for scRNA-seq data including HVG selection, normalization, scaling, PCA, and UMAP visualization
   - **Classification**: Applicable to New Data - performs standard scRNA-seq preprocessing on any raw count matrix
   - **Parameters Preserved**: `n_top_genes=2000`, `flavor="seurat_v3"`, `n_comps=100`, `svd_solver="auto"` all explicitly set in tutorial
   - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets

2. **glue_preprocess_scatac**
   - **Section**: "Preprocess scATAC-seq data"
   - **Rationale**: Complete preprocessing workflow for scATAC-seq data using LSI dimension reduction and UMAP visualization
   - **Classification**: Applicable to New Data - performs standard scATAC-seq preprocessing on any raw accessibility matrix
   - **Parameters Preserved**: `n_components=100`, `n_iter=15` explicitly set in tutorial
   - **Parameters Parameterized**: `color_var="cell_type"` is tutorial-specific and must be configurable for user datasets

3. **glue_construct_regulatory_graph**
   - **Section**: "Construct prior regulatory graph"
   - **Rationale**: Constructs prior regulatory graph linking RNA and ATAC features via genomic proximity
   - **Classification**: Applicable to New Data - essential for GLUE model training with any paired RNA-ATAC datasets
   - **Parameters Preserved**: `gtf_by="gene_name"` as tutorial default
   - **Input Requirements**: Requires GTF annotation file which users must provide for their organism

### Tools Excluded (1 tool)

1. **glue_read_paired_data** (initially present, removed in revision)
   - **Section**: "Read data"
   - **Rationale for Exclusion**: Only loads tutorial example data with no analytical transformation
   - **Classification**: NOT Applicable to New Data - data loading is trivial and should be handled by users

## Parameter Design Rationale

### Primary Data Inputs
- All tools use **file paths** as primary inputs (h5ad format for AnnData objects)
- No data object parameters (e.g., `adata: AnnData`) to enforce file-based workflow
- All data paths default to `None` with validation in function body for clear error messages

### Analysis Parameters
**Parameters Explicitly Set in Tutorial (Parameterized)**:
- `n_top_genes=2000`, `flavor="seurat_v3"` - Tutorial shows explicit values for HVG selection
- `n_comps=100`, `svd_solver="auto"` - Tutorial shows explicit values for PCA
- `n_components=100`, `n_iter=15` - Tutorial shows explicit values for LSI
- `gtf_by="gene_name"` - Tutorial shows explicit attribute for GTF parsing

**Tutorial-Specific Values (Parameterized)**:
- `color_var="cell_type"` - Column name specific to tutorial dataset, must be configurable for user data

**Library Defaults (Preserved)**:
- `sc.pp.neighbors(rna, metric="cosine")` - Tutorial shows this exact call, preserved as-is
- `sc.pp.normalize_total(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.log1p(rna)` - No parameters in tutorial, using library defaults
- `sc.pp.scale(rna)` - No parameters in tutorial, using library defaults

### Critical Rule Adherence
**NEVER ADD PARAMETERS NOT IN TUTORIAL**: All function parameters correspond to explicit values in the tutorial code. No parameters were added that weren't shown in the original tutorial.

**PRESERVE EXACT TUTORIAL STRUCTURE**: All function calls preserve the exact structure from the tutorial:
- `sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")` β†’ parameterized as shown
- `sc.tl.pca(rna, n_comps=100, svd_solver="auto")` β†’ parameterized as shown
- `scglue.data.lsi(atac, n_components=100, n_iter=15)` β†’ parameterized as shown
- `sc.pp.neighbors(rna, metric="cosine")` β†’ preserved exactly as shown

## Output Requirements

### Visualization Outputs
**Code-Generated Figures Only**:
- `glue_preprocess_scrna`: UMAP visualization of RNA data (from tutorial section "Optionally, we can visualize...")
- `glue_preprocess_scatac`: UMAP visualization of ATAC data (from tutorial section "Optionally, we may also visualize...")
- No static figures or diagrams included (tutorial has none)

**Figure Specifications**:
- Format: PNG with `dpi=300`, `bbox_inches='tight'`
- Naming: `{out_prefix}_umap_{timestamp}.png`
- Always generated (no user control parameter)

### Data Outputs
**Essential Results Saved**:
- Preprocessed AnnData objects with all transformations applied
- Guidance graph in NetworkX GraphML format
- Annotated data with genomic coordinates

**File Formats**:
- AnnData: h5ad with gzip compression (standard for single-cell data)
- Graph: graphml.gz (standard for NetworkX graphs)

**Naming Convention**:
- `{out_prefix}_preprocessed_{timestamp}.h5ad`
- `{out_prefix}_graph_{timestamp}.graphml.gz`
- `{out_prefix}_rna_annotated_{timestamp}.h5ad`

### Return Format
All tools return standardized dict:
```python
{
    "message": "<concise status ≀120 chars>",
    "reference": "https://github.com/gao-lab/GLUE/blob/master/docs/preprocessing.ipynb",
    "artifacts": [
        {
            "description": "<description ≀50 chars>",
            "path": "/absolute/path/to/file"
        }
    ]
}
```

## Quality Review Results

### Iteration 1 (Final)
**Date**: 2026-02-14
**Status**: All checks passed

**Tool Design Validation**: [βœ“] All 7 checks passed
- Tool definition, naming, description, classification, order, boundaries, independence all correct

**Implementation Validation**: [βœ“] All 8 checks passed
- Function coverage, parameter design, input validation, tutorial fidelity, real-world focus, no hardcoding, library compliance, exact function calls all correct

**Output Validation**: [βœ“] All 5 checks passed
- Figure generation, data outputs, return format, file paths, reference links all correct

**Code Quality Validation**: [βœ“] All 6 checks passed
- Error handling, type annotations, documentation, template compliance, import management, environment setup all correct

**Summary**: 3/3 tools passing all checks. No issues found. Implementation is production-ready.

## Implementation Choices

### Libraries Used
- **anndata**: Standard format for single-cell data (AnnData objects)
- **scanpy**: Standard toolkit for scRNA-seq analysis (HVG, normalization, PCA, UMAP)
- **scglue**: GLUE-specific functions (LSI, graph construction, gene annotation)
- **networkx**: Standard graph library for guidance graph representation
- **matplotlib**: Visualization library for UMAP plots

### Error Handling Approach
**Basic Input Validation Only**:
- Required parameter validation (data_path must be provided)
- File existence checks (FileNotFoundError if file not found)
- No intermediate processing validation (trust library error messages)

**Rationale**: Tutorial assumes valid input data. Error handling focused on user input mistakes, not data quality issues.

### Parameterization Rationale

**Why Parameterize `color_var`?**
- Tutorial uses `"cell_type"` which is a column specific to the tutorial dataset
- User datasets will have different column names for cell annotations
- Parameterizing enables tool to work with any AnnData object with different metadata columns

**Why Parameterize `gtf_by`?**
- Tutorial uses `"gene_name"` attribute in GTF, but GTF files can use different attributes
- Some GTF files use `"gene_id"`, `"transcript_name"`, or other attributes
- Parameterizing enables tool to work with different GTF annotation standards

**Why Keep Default `n_top_genes=2000`?**
- This is a standard value in single-cell RNA-seq analysis
- Tutorial explicitly sets this value, not using library default
- Value represents a scientific choice about feature selection stringency

**Why Keep Default `n_components=100`?**
- This is the standard dimensionality for GLUE model training
- Tutorial explicitly sets this value for downstream model compatibility
- Changing this value would require adjusting the GLUE model architecture

## Known Limitations

1. **Coordinate Extraction Assumption**: `glue_construct_regulatory_graph` assumes ATAC peak names follow the format `"chr:start-end"`. If user data uses different formats (e.g., `"chr_start_end"` or `"chr:start:end"`), the coordinate extraction will fail. Users must ensure their peak names follow the expected format or pre-process their data.

2. **GTF Compatibility**: Gene annotation requires GTF file with specific attributes. Not all GTF formats are compatible. Users must ensure their GTF file contains the required attributes (default: `"gene_name"`).

3. **Memory Requirements**: LSI and PCA operations on large datasets can be memory-intensive. Users with datasets >100k cells may encounter memory issues on standard workstations.

4. **Visualization Dependency**: UMAP visualizations require the `color_var` column to exist in the AnnData object. If the column is missing, the tool will fail. Users must ensure their data contains the specified annotation column.

5. **File Format Constraints**: Tools only accept h5ad format for input/output. Users with data in other formats (csv, mtx, loom) must convert to h5ad before using these tools.

## Testing Recommendations

1. **Test with tutorial data**: Verify tools reproduce exact tutorial results with Chen-2019 dataset
2. **Test with different organisms**: Verify GTF annotation works with different reference genomes
3. **Test with different annotation columns**: Verify `color_var` parameter works with different metadata
4. **Test with edge cases**:
   - Very small datasets (<100 cells)
   - Very large datasets (>100k cells)
   - Datasets with missing or malformed peak coordinates
   - GTF files with different attribute names

## Revision History

### Initial Implementation
- 4 tools: `glue_read_paired_data`, `glue_preprocess_scrna`, `glue_preprocess_scatac`, `glue_construct_guidance_graph`

### Revision 1 (2026-02-14)
**Changes Made**:
1. **Removed `glue_read_paired_data` tool**: Classified as NOT Applicable to New Data (only loads tutorial data without analytical transformation)
2. **Renamed `glue_construct_guidance_graph` to `glue_construct_regulatory_graph`**: Better matches tutorial section title "Construct prior regulatory graph"
3. **Updated documentation**: Corrected tool count from 4 to 3 tools

**Rationale**: Enforce strict adherence to "Applicable to New Data" classification. Data loading without analytical transformation should not be a standalone tool.

**Result**: All 3 remaining tools pass quality review with all checks passing.