copilot-swe-agent[bot] raylim commited on
Commit
71ae2f0
·
1 Parent(s): e6c73c0

Add comprehensive documentation improvements

Browse files

- Fix installation instructions in README (correct repo URL)
- Fix command name inconsistency (mosaic_app -> mosaic)
- Add detailed examples section to README
- Add CSV file format documentation
- Add cancer subtypes reference
- Add troubleshooting section
- Add advanced usage examples
- Create CONTRIBUTING.md with development guidelines
- Add comprehensive docstrings to all modules
- Add module-level docstrings to core modules

Co-authored-by: raylim <3074310+raylim@users.noreply.github.com>

CONTRIBUTING.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to Mosaic
2
+
3
+ Thank you for your interest in contributing to Mosaic! This document provides guidelines and instructions for contributing to the project.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Getting Started](#getting-started)
8
+ - [Development Setup](#development-setup)
9
+ - [Code Style](#code-style)
10
+ - [Testing](#testing)
11
+ - [Submitting Changes](#submitting-changes)
12
+ - [Reporting Issues](#reporting-issues)
13
+
14
+ ## Getting Started
15
+
16
+ 1. Fork the repository on GitHub
17
+ 2. Clone your fork locally
18
+ 3. Set up the development environment
19
+ 4. Create a new branch for your changes
20
+ 5. Make your changes
21
+ 6. Test your changes
22
+ 7. Submit a pull request
23
+
24
+ ## Development Setup
25
+
26
+ ### Prerequisites
27
+
28
+ - Python 3.10 or higher
29
+ - [uv](https://docs.astral.sh/uv/) package manager
30
+ - NVIDIA GPU with CUDA support (for model inference)
31
+
32
+ ### Installation
33
+
34
+ 1. Clone the repository:
35
+
36
+ ```bash
37
+ git clone https://github.com/pathology-data-mining/mosaic.git
38
+ cd mosaic
39
+ ```
40
+
41
+ 2. Install dependencies including development tools:
42
+
43
+ ```bash
44
+ uv sync
45
+ ```
46
+
47
+ This will install all dependencies, including development tools like pytest, pylint, and black.
48
+
49
+ ### Running Tests
50
+
51
+ Run all tests:
52
+
53
+ ```bash
54
+ pytest tests/
55
+ ```
56
+
57
+ Run tests with coverage report:
58
+
59
+ ```bash
60
+ pytest tests/ --cov=src/mosaic --cov-report=term-missing
61
+ ```
62
+
63
+ Run a specific test file:
64
+
65
+ ```bash
66
+ pytest tests/inference/test_data.py -v
67
+ ```
68
+
69
+ ### Code Quality
70
+
71
+ #### Linting
72
+
73
+ We use pylint for code linting. Run it with:
74
+
75
+ ```bash
76
+ pylint src/mosaic
77
+ ```
78
+
79
+ #### Code Formatting
80
+
81
+ We use black for code formatting. Format your code with:
82
+
83
+ ```bash
84
+ black src/mosaic tests/
85
+ ```
86
+
87
+ ## Code Style
88
+
89
+ ### Python Style Guide
90
+
91
+ - Follow [PEP 8](https://pep8.org/) style guidelines
92
+ - Use meaningful variable and function names
93
+ - Add docstrings to all public functions, classes, and modules
94
+ - Keep functions focused and concise
95
+ - Use type hints where appropriate
96
+
97
+ ### Docstring Format
98
+
99
+ Use Google-style docstrings:
100
+
101
+ ```python
102
+ def function_name(param1: str, param2: int) -> bool:
103
+ """Brief description of the function.
104
+
105
+ More detailed description if needed.
106
+
107
+ Args:
108
+ param1: Description of param1
109
+ param2: Description of param2
110
+
111
+ Returns:
112
+ Description of return value
113
+
114
+ Raises:
115
+ ValueError: Description of when this error is raised
116
+ """
117
+ pass
118
+ ```
119
+
120
+ ### Commit Messages
121
+
122
+ - Use clear and descriptive commit messages
123
+ - Start with a verb in the imperative mood (e.g., "Add", "Fix", "Update")
124
+ - Keep the first line under 72 characters
125
+ - Provide additional context in the commit body if needed
126
+
127
+ Example:
128
+
129
+ ```
130
+ Add docstrings to inference module functions
131
+
132
+ - Added comprehensive docstrings to all public functions
133
+ - Included type hints for better code clarity
134
+ - Updated existing docstrings to follow Google style
135
+ ```
136
+
137
+ ## Testing
138
+
139
+ ### Writing Tests
140
+
141
+ - Write tests for all new features and bug fixes
142
+ - Place tests in the appropriate directory under `tests/`
143
+ - Use pytest fixtures for common setup code
144
+ - Mock external dependencies (e.g., model loading, network requests)
145
+ - Ensure tests can run without GPU access or large model downloads
146
+
147
+ ### Test Structure
148
+
149
+ ```python
150
+ import pytest
151
+ from mosaic.module import function_to_test
152
+
153
+ def test_function_basic_case():
154
+ """Test basic functionality of the function."""
155
+ result = function_to_test(input_data)
156
+ assert result == expected_output
157
+
158
+ def test_function_edge_case():
159
+ """Test edge cases."""
160
+ with pytest.raises(ValueError):
161
+ function_to_test(invalid_input)
162
+ ```
163
+
164
+ ## Submitting Changes
165
+
166
+ ### Pull Request Process
167
+
168
+ 1. **Create a feature branch**:
169
+ ```bash
170
+ git checkout -b feature/your-feature-name
171
+ ```
172
+
173
+ 2. **Make your changes**:
174
+ - Write clear, focused commits
175
+ - Add tests for new functionality
176
+ - Update documentation as needed
177
+
178
+ 3. **Ensure code quality**:
179
+ ```bash
180
+ black src/mosaic tests/
181
+ pylint src/mosaic
182
+ pytest tests/
183
+ ```
184
+
185
+ 4. **Push to your fork**:
186
+ ```bash
187
+ git push origin feature/your-feature-name
188
+ ```
189
+
190
+ 5. **Create a Pull Request**:
191
+ - Go to the GitHub repository
192
+ - Click "New Pull Request"
193
+ - Select your branch
194
+ - Provide a clear description of your changes
195
+ - Reference any related issues
196
+
197
+ ### Pull Request Guidelines
198
+
199
+ - Keep pull requests focused on a single feature or fix
200
+ - Update documentation for any changed functionality
201
+ - Add or update tests as appropriate
202
+ - Ensure all tests pass before submitting
203
+ - Respond to review feedback promptly
204
+
205
+ ## Reporting Issues
206
+
207
+ ### Bug Reports
208
+
209
+ When reporting a bug, please include:
210
+
211
+ - A clear and descriptive title
212
+ - Steps to reproduce the issue
213
+ - Expected behavior
214
+ - Actual behavior
215
+ - System information (OS, Python version, GPU model)
216
+ - Relevant log output or error messages
217
+ - Minimal code example to reproduce the issue
218
+
219
+ ### Feature Requests
220
+
221
+ When suggesting a feature, please include:
222
+
223
+ - A clear description of the feature
224
+ - The use case and benefits
225
+ - Any alternative solutions you've considered
226
+ - Examples of how the feature would be used
227
+
228
+ ### Issue Templates
229
+
230
+ Please use the appropriate issue template when creating a new issue.
231
+
232
+ ## Development Guidelines
233
+
234
+ ### Module Organization
235
+
236
+ - Keep modules focused on a single responsibility
237
+ - Place UI-related code in `src/mosaic/ui/`
238
+ - Place inference code in `src/mosaic/inference/`
239
+ - Place analysis logic in `src/mosaic/analysis.py`
240
+ - Avoid circular dependencies
241
+
242
+ ### Adding New Features
243
+
244
+ When adding new features:
245
+
246
+ 1. Discuss the feature in an issue first
247
+ 2. Follow the existing code structure
248
+ 3. Add comprehensive tests
249
+ 4. Update relevant documentation
250
+ 5. Consider backward compatibility
251
+
252
+ ### Dependencies
253
+
254
+ - Avoid adding new dependencies unless necessary
255
+ - Discuss new dependencies in an issue or pull request
256
+ - Ensure dependencies are compatible with the project's license
257
+ - Pin dependency versions in `pyproject.toml`
258
+
259
+ ## Questions?
260
+
261
+ If you have questions about contributing, please:
262
+
263
+ - Check existing issues and pull requests
264
+ - Open a new issue with your question
265
+ - Join our community discussions (if available)
266
+
267
+ Thank you for contributing to Mosaic!
README.md CHANGED
@@ -4,8 +4,22 @@ Mosaic is a deep learning model designed for predicting cancer subtypes and biom
4
 
5
  ## Table of Contents
6
 
 
 
7
  - [Installation](#installation)
8
  - [Usage](#usage)
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ### System requirements
11
 
@@ -25,7 +39,15 @@ Supported systems:
25
  ## Installation
26
 
27
  ```bash
28
- uv pip install git+ssh://git@github.com/pathology-data-mining/paladin_webapp.git@dev
 
 
 
 
 
 
 
 
29
  ```
30
 
31
  ## Usage
@@ -49,23 +71,23 @@ export HF_HOME="PATH-TO-HUGGINGFACE-HOME"
49
  Run the web application with:
50
 
51
  ```bash
52
- mosaic_app
53
  ```
54
 
55
  It will start a web server on port 7860 by default. You can access the web interface by navigating to `http://localhost:7860` in your web browser.
56
 
57
  ### Command Line Interface
58
 
59
- To process a WSI, use the following command:
60
 
61
  ```bash
62
- mosaic_app --slide-path /path/to/your/wsi.svs --output-dir /path/to/output/directory
63
  ```
64
 
65
  To process a batch of WSIs, use:
66
 
67
  ```bash
68
- mosaic_app --slide-csv /path/to/your/wsi_list.csv --output-dir /path/to/output/directory
69
  ```
70
 
71
  The CSV file should at least contain columns `Slide`, and `Site Type`.
@@ -80,7 +102,7 @@ Optionally, it can also contain `Cancer Subtype`, `Segmentation Config`, and `IH
80
  See additional options with the help command. This command may take a few seconds to run:
81
 
82
  ```bash
83
- mosaic_app --help
84
  ```
85
 
86
  If setting port to run in server mode, you may check for available ports using `ss -tuln | grep :PORT` where PORT is the port number you want to check. No output indicates the port may be available. If port is available, set environment variable `export GRADIO_SERVER_PORT="PORT"`
@@ -88,4 +110,189 @@ If setting port to run in server mode, you may check for available ports using `
88
  ### Notes
89
 
90
  - The first time you run the application, it will download the necessary models from HuggingFace. This may take some time depending on your internet connection.
91
- - The models are downloaded to a directory relative to where you run the application. (A subdirectory named `data`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Table of Contents
6
 
7
+ - [System Requirements](#system-requirements)
8
+ - [Pre-requisites](#pre-requisites)
9
  - [Installation](#installation)
10
  - [Usage](#usage)
11
+ - [Initial Setup](#initial-setup)
12
+ - [Web Application](#web-application)
13
+ - [Command Line Interface](#command-line-interface)
14
+ - [Notes](#notes)
15
+ - [Output Files](#output-files)
16
+ - [Examples](#examples)
17
+ - [Advanced Usage](#advanced-usage)
18
+ - [CSV File Format](#csv-file-format)
19
+ - [Cancer Subtypes](#cancer-subtypes)
20
+ - [Troubleshooting](#troubleshooting)
21
+ - [Contributing](#contributing)
22
+ - [License](#license)
23
 
24
  ### System requirements
25
 
 
39
  ## Installation
40
 
41
  ```bash
42
+ git clone https://github.com/pathology-data-mining/mosaic.git
43
+ cd mosaic
44
+ uv sync
45
+ ```
46
+
47
+ Alternatively, install directly from the repository:
48
+
49
+ ```bash
50
+ uv pip install git+https://github.com/pathology-data-mining/mosaic.git
51
  ```
52
 
53
  ## Usage
 
71
  Run the web application with:
72
 
73
  ```bash
74
+ mosaic
75
  ```
76
 
77
  It will start a web server on port 7860 by default. You can access the web interface by navigating to `http://localhost:7860` in your web browser.
78
 
79
  ### Command Line Interface
80
 
81
+ To process a single WSI, use the following command:
82
 
83
  ```bash
84
+ mosaic --slide-path /path/to/your/wsi.svs --output-dir /path/to/output/directory
85
  ```
86
 
87
  To process a batch of WSIs, use:
88
 
89
  ```bash
90
+ mosaic --slide-csv /path/to/your/wsi_list.csv --output-dir /path/to/output/directory
91
  ```
92
 
93
  The CSV file should at least contain columns `Slide`, and `Site Type`.
 
102
  See additional options with the help command. This command may take a few seconds to run:
103
 
104
  ```bash
105
+ mosaic --help
106
  ```
107
 
108
  If setting port to run in server mode, you may check for available ports using `ss -tuln | grep :PORT` where PORT is the port number you want to check. No output indicates the port may be available. If port is available, set environment variable `export GRADIO_SERVER_PORT="PORT"`
 
110
  ### Notes
111
 
112
  - The first time you run the application, it will download the necessary models from HuggingFace. This may take some time depending on your internet connection.
113
+ - The models are downloaded to a directory named `data` relative to where you run the application.
114
+
115
+ ## Output Files
116
+
117
+ ### Single Slide Processing
118
+
119
+ When processing a single slide, the following files are generated in the output directory:
120
+
121
+ - `{slide_name}_mask.png`: Visualization of the tissue segmentation
122
+ - `{slide_name}_aeon_results.csv`: Cancer subtype predictions with confidence scores (if cancer subtype was set to "Unknown")
123
+ - `{slide_name}_paladin_results.csv`: Biomarker predictions for the slide
124
+
125
+ ### Batch Processing
126
+
127
+ When processing multiple slides, in addition to individual slide outputs, combined results are generated:
128
+
129
+ - `combined_aeon_results.csv`: Cancer subtype predictions for all slides in a single file
130
+ - `combined_paladin_results.csv`: Biomarker predictions for all slides in a single file
131
+
132
+ ## Examples
133
+
134
+ ### Example 1: Process a single slide with unknown cancer type
135
+
136
+ ```bash
137
+ mosaic --slide-path /data/slides/sample.svs \
138
+ --output-dir /data/results \
139
+ --site-type Primary \
140
+ --cancer-subtype Unknown \
141
+ --segmentation-config Resection
142
+ ```
143
+
144
+ ### Example 2: Process a single breast cancer slide with known IHC subtype
145
+
146
+ ```bash
147
+ mosaic --slide-path /data/slides/breast_sample.svs \
148
+ --output-dir /data/results \
149
+ --site-type Primary \
150
+ --cancer-subtype BRCA \
151
+ --ihc-subtype "HR+/HER2-" \
152
+ --segmentation-config Biopsy
153
+ ```
154
+
155
+ ### Example 3: Process multiple slides from CSV
156
+
157
+ Create a CSV file `slides.csv` with the following format:
158
+
159
+ ```csv
160
+ Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype
161
+ /data/slides/sample1.svs,Primary,Unknown,Resection,
162
+ /data/slides/sample2.svs,Metastatic,LUAD,Biopsy,
163
+ /data/slides/sample3.svs,Primary,BRCA,TCGA,HR+/HER2-
164
+ ```
165
+
166
+ Then run:
167
+
168
+ ```bash
169
+ mosaic --slide-csv slides.csv --output-dir /data/results
170
+ ```
171
+
172
+ ## Advanced Usage
173
+
174
+ ### Adjusting Performance
175
+
176
+ You can control the number of workers for feature extraction to balance between speed and memory usage:
177
+
178
+ ```bash
179
+ mosaic --slide-path /path/to/slide.svs \
180
+ --output-dir /path/to/output \
181
+ --num-workers 8
182
+ ```
183
+
184
+ ### Running in Server Mode
185
+
186
+ To run Mosaic as a web server accessible from other machines:
187
+
188
+ ```bash
189
+ export GRADIO_SERVER_PORT=7860
190
+ mosaic --server-name 0.0.0.0 --server-port 7860
191
+ ```
192
+
193
+ Check for available ports using:
194
+ ```bash
195
+ ss -tuln | grep :7860
196
+ ```
197
+
198
+ To share the application publicly (use with caution):
199
+
200
+ ```bash
201
+ mosaic --share
202
+ ```
203
+
204
+ ### Debug Mode
205
+
206
+ Enable debug logging for troubleshooting:
207
+
208
+ ```bash
209
+ mosaic --debug
210
+ ```
211
+
212
+ This will create a `debug.log` file with detailed information about the processing steps.
213
+
214
+ ## CSV File Format
215
+
216
+ When processing multiple slides using the `--slide-csv` option, the CSV file must contain the following columns:
217
+
218
+ ### Required Columns
219
+
220
+ - **Slide**: Full path to the WSI file (e.g., `/path/to/slide.svs`)
221
+ - **Site Type**: Either `Primary` or `Metastatic`
222
+
223
+ ### Optional Columns
224
+
225
+ - **Cancer Subtype**: OncoTree code for the cancer subtype (e.g., `LUAD`, `BRCA`, `COAD`). Use `Unknown` to let Aeon infer the cancer type.
226
+ - **Segmentation Config**: One of `Biopsy`, `Resection`, or `TCGA`. Defaults to `Biopsy` if not specified.
227
+ - **IHC Subtype**: For breast cancer (BRCA) only. One of:
228
+ - `HR+/HER2+`
229
+ - `HR+/HER2-`
230
+ - `HR-/HER2+`
231
+ - `HR-/HER2-`
232
+
233
+ ### CSV Example
234
+
235
+ ```csv
236
+ Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype
237
+ /data/slides/lung1.svs,Primary,LUAD,Resection,
238
+ /data/slides/breast1.svs,Primary,BRCA,Biopsy,HR+/HER2-
239
+ /data/slides/unknown1.svs,Metastatic,Unknown,TCGA,
240
+ ```
241
+
242
+ ## Cancer Subtypes
243
+
244
+ Mosaic uses OncoTree codes to identify cancer subtypes. Common examples include:
245
+
246
+ - **LUAD**: Lung Adenocarcinoma
247
+ - **LUSC**: Lung Squamous Cell Carcinoma
248
+ - **BRCA**: Breast Invasive Carcinoma
249
+ - **COAD**: Colon Adenocarcinoma
250
+ - **READ**: Rectal Adenocarcinoma
251
+ - **PRAD**: Prostate Adenocarcinoma
252
+ - **SKCM**: Skin Cutaneous Melanoma
253
+
254
+ For a complete list of supported cancer subtypes, see the [OncoTree website](http://oncotree.mskcc.org/).
255
+
256
+ When the cancer subtype is set to `Unknown`, Mosaic will use the Aeon model to predict the most likely cancer subtype based on the H&E image features.
257
+
258
+ ## Troubleshooting
259
+
260
+ ### HuggingFace Authentication Errors
261
+
262
+ If you encounter authentication errors when downloading models:
263
+
264
+ 1. Ensure you have access to the PDM-Group on HuggingFace
265
+ 2. Create a HuggingFace access token with appropriate permissions
266
+ 3. Set the `HF_TOKEN` environment variable correctly
267
+
268
+ ### Out of Memory Errors
269
+
270
+ If you encounter GPU out-of-memory errors:
271
+
272
+ 1. Reduce the number of workers: `--num-workers 2`
273
+ 2. Process slides sequentially instead of in batch
274
+ 3. Consider using a GPU with more memory
275
+
276
+ ### Tissue Segmentation Issues
277
+
278
+ If tissue is not being detected correctly:
279
+
280
+ 1. Try a different segmentation configuration (`Biopsy`, `Resection`, or `TCGA`)
281
+ 2. Check that the slide file is not corrupted
282
+ 3. Verify the slide format is supported (e.g., `.svs`, `.tif`)
283
+
284
+ ### Port Already in Use
285
+
286
+ If the default port 7860 is already in use:
287
+
288
+ 1. Check for running processes: `ss -tuln | grep :7860`
289
+ 2. Use a different port: `export GRADIO_SERVER_PORT=7861`
290
+ 3. Or specify the port directly: `mosaic --server-port 7861`
291
+
292
+ ## Contributing
293
+
294
+ We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to this project.
295
+
296
+ ## License
297
+
298
+ This project is licensed under the terms specified in the LICENSE file.
src/mosaic/analysis.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import pickle
2
  import torch
3
  import pandas as pd
@@ -22,6 +28,37 @@ def analyze_slide(
22
  num_workers=4,
23
  progress=gr.Progress(track_tqdm=True),
24
  ):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  if slide_path is None:
26
  raise gr.Error("Please upload a slide.")
27
  # Step 1: Segment tissue
 
1
+ """Core slide analysis module for Mosaic.
2
+
3
+ This module provides the main slide analysis pipeline that integrates tissue segmentation,
4
+ feature extraction, and model inference for cancer subtype and biomarker prediction.
5
+ """
6
+
7
  import pickle
8
  import torch
9
  import pandas as pd
 
28
  num_workers=4,
29
  progress=gr.Progress(track_tqdm=True),
30
  ):
31
+ """Analyze a whole slide image for cancer subtype and biomarker prediction.
32
+
33
+ This function performs a complete analysis pipeline including:
34
+ 1. Tissue segmentation
35
+ 2. CTransPath feature extraction
36
+ 3. Feature filtering with marker classifier
37
+ 4. Optimus feature extraction on filtered tiles
38
+ 5. Aeon inference for cancer subtype (if not provided)
39
+ 6. Paladin inference for biomarker prediction
40
+
41
+ Args:
42
+ slide_path: Path to the whole slide image file
43
+ seg_config: Segmentation configuration, one of "Biopsy", "Resection", or "TCGA"
44
+ site_type: Site type, either "Primary" or "Metastatic"
45
+ cancer_subtype: Cancer subtype (OncoTree code or "Unknown" for inference)
46
+ cancer_subtype_name_map: Dictionary mapping cancer subtype names to codes
47
+ ihc_subtype: IHC subtype for breast cancer (optional)
48
+ num_workers: Number of worker processes for feature extraction
49
+ progress: Gradio progress tracker for UI updates
50
+
51
+ Returns:
52
+ tuple: (slide_mask, aeon_results, paladin_results)
53
+ - slide_mask: PIL Image of tissue segmentation visualization
54
+ - aeon_results: DataFrame with cancer subtype predictions and confidence scores
55
+ - paladin_results: DataFrame with biomarker predictions
56
+
57
+ Raises:
58
+ gr.Error: If no slide is provided
59
+ gr.Warning: If no tissue is detected in the slide
60
+ ValueError: If an unknown segmentation configuration is provided
61
+ """
62
  if slide_path is None:
63
  raise gr.Error("Please upload a slide.")
64
  # Step 1: Segment tissue
src/mosaic/gradio_app.py CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  from argparse import ArgumentParser
2
  import pandas as pd
3
  from pathlib import Path
@@ -17,6 +26,17 @@ from mosaic.analysis import analyze_slide
17
 
18
 
19
  def download_and_process_models():
 
 
 
 
 
 
 
 
 
 
 
20
  snapshot_download(repo_id="PDM-Group/paladin-aeon-models", local_dir="data")
21
 
22
  model_map = pd.read_csv(
@@ -41,6 +61,16 @@ def download_and_process_models():
41
 
42
 
43
  def main():
 
 
 
 
 
 
 
 
 
 
44
  parser = ArgumentParser()
45
  parser.add_argument("--debug", action="store_true", help="Enable debug logging")
46
  parser.add_argument(
 
1
+ """Mosaic command-line interface and entry point.
2
+
3
+ This module provides the main CLI for the Mosaic application, handling:
4
+ - Model downloading and initialization
5
+ - Single slide processing
6
+ - Batch slide processing from CSV
7
+ - Launching the Gradio web interface
8
+ """
9
+
10
  from argparse import ArgumentParser
11
  import pandas as pd
12
  from pathlib import Path
 
26
 
27
 
28
  def download_and_process_models():
29
+ """Download models from HuggingFace and initialize cancer subtype mappings.
30
+
31
+ Downloads the Paladin and Aeon models from the PDM-Group HuggingFace repository
32
+ and creates mappings between cancer subtype names and OncoTree codes.
33
+
34
+ Returns:
35
+ tuple: (cancer_subtype_name_map, reversed_cancer_subtype_name_map, cancer_subtypes)
36
+ - cancer_subtype_name_map: Dict mapping display names to OncoTree codes
37
+ - reversed_cancer_subtype_name_map: Dict mapping OncoTree codes to display names
38
+ - cancer_subtypes: List of all supported cancer subtype codes
39
+ """
40
  snapshot_download(repo_id="PDM-Group/paladin-aeon-models", local_dir="data")
41
 
42
  model_map = pd.read_csv(
 
61
 
62
 
63
  def main():
64
+ """Main entry point for the Mosaic application.
65
+
66
+ Parses command-line arguments and routes to the appropriate mode:
67
+ - Single slide processing (--slide-path)
68
+ - Batch processing (--slide-csv)
69
+ - Web interface (default, no slide arguments)
70
+
71
+ Command-line arguments control analysis parameters like site type,
72
+ cancer subtype, segmentation configuration, and output directory.
73
+ """
74
  parser = ArgumentParser()
75
  parser.add_argument("--debug", action="store_true", help="Enable debug logging")
76
  parser.add_argument(
src/mosaic/inference/aeon.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  import pickle # nosec
2
  import sys
3
  from argparse import ArgumentParser
@@ -16,6 +22,7 @@ from mosaic.inference.data import (
16
 
17
  from loguru import logger
18
 
 
19
  cancer_types_to_drop = [
20
  "UDMN",
21
  "ADNOS",
@@ -48,6 +55,21 @@ NUM_WORKERS = 8
48
  def run(
49
  features, model_path, metastatic=False, batch_size=8, num_workers=8, use_cpu=False
50
  ):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  device = torch.device(
52
  "cuda" if not use_cpu and torch.cuda.is_available() else "cpu"
53
  )
 
1
+ """Aeon model inference module for cancer subtype prediction.
2
+
3
+ This module provides functionality to run the Aeon deep learning model
4
+ for predicting cancer subtypes from H&E whole slide image features.
5
+ """
6
+
7
  import pickle # nosec
8
  import sys
9
  from argparse import ArgumentParser
 
22
 
23
  from loguru import logger
24
 
25
+ # Cancer types excluded from prediction (too broad or ambiguous)
26
  cancer_types_to_drop = [
27
  "UDMN",
28
  "ADNOS",
 
55
  def run(
56
  features, model_path, metastatic=False, batch_size=8, num_workers=8, use_cpu=False
57
  ):
58
+ """Run Aeon model inference for cancer subtype prediction.
59
+
60
+ Args:
61
+ features: NumPy array of tile features extracted from the WSI
62
+ model_path: Path to the pickled Aeon model file
63
+ metastatic: Whether the slide is from a metastatic site
64
+ batch_size: Batch size for inference
65
+ num_workers: Number of workers for data loading
66
+ use_cpu: Force CPU usage instead of GPU
67
+
68
+ Returns:
69
+ tuple: (results_df, part_embedding)
70
+ - results_df: DataFrame with cancer subtypes and confidence scores
71
+ - part_embedding: Torch tensor of the learned part representation
72
+ """
73
  device = torch.device(
74
  "cuda" if not use_cpu and torch.cuda.is_available() else "cpu"
75
  )
src/mosaic/inference/data.py CHANGED
@@ -1,3 +1,11 @@
 
 
 
 
 
 
 
 
1
  from enum import Enum
2
  from typing import List
3
 
 
1
+ """Data structures and utilities for inference modules.
2
+
3
+ This module provides:
4
+ - Cancer type to integer mappings for model inputs/outputs
5
+ - SiteType enum for primary vs metastatic classification
6
+ - TileFeatureTensorDataset for feeding features to PyTorch models
7
+ """
8
+
9
  from enum import Enum
10
  from typing import List
11
 
src/mosaic/inference/paladin.py CHANGED
@@ -1,3 +1,10 @@
 
 
 
 
 
 
 
1
  import csv
2
  import pickle # nosec
3
  import sys
@@ -27,11 +34,16 @@ class UsageError(Exception):
27
 
28
 
29
  def load_model_map(model_map_path: str) -> dict[Any, Any]:
30
- """Load the table mapping cancer_subtypes and targets to the paladin
31
- model (a pickle file) that predicts that target for that cancer subtype.
32
 
33
  A dict is returned, mapping each cancer_subtype to a table mapping a
34
  target to the pathname for the model that predicts it.
 
 
 
 
 
 
35
  """
36
  models = defaultdict(dict)
37
  with Path(model_map_path).open() as fp:
@@ -45,10 +57,13 @@ def load_model_map(model_map_path: str) -> dict[Any, Any]:
45
 
46
 
47
  def load_aeon_scores(df: pd.DataFrame) -> dict[str, float]:
48
- """Load the output table from a single-slide Aeon run, listing Oncotree
49
- cancer subtypes and their confidence values.
50
-
51
- A dict is returned, mapping each cancersubtype to its confidence score.
 
 
 
52
  """
53
  score = {}
54
  for _, row in df.iterrows():
@@ -59,7 +74,15 @@ def load_aeon_scores(df: pd.DataFrame) -> dict[str, float]:
59
 
60
 
61
  def select_cancer_subtypes(aeon_scores: dict[str, float], k=1) -> list[str]:
62
- """Return the three top-scoring cancer_subtypes, based on the given Aeon scores."""
 
 
 
 
 
 
 
 
63
  sorted_cancer_subtypes = list(
64
  sorted([(v, k) for k, v in aeon_scores.items()], reverse=True)
65
  )
@@ -67,7 +90,15 @@ def select_cancer_subtypes(aeon_scores: dict[str, float], k=1) -> list[str]:
67
 
68
 
69
  def select_models(cancer_subtypes: list[str], model_map: dict[Any, Any]) -> list[Any]:
70
- """ """
 
 
 
 
 
 
 
 
71
  models = []
72
  for cancer_subtype, target, model in model_map.items():
73
  if cancer_subtype in cancer_subtypes:
@@ -76,8 +107,17 @@ def select_models(cancer_subtypes: list[str], model_map: dict[Any, Any]) -> list
76
 
77
 
78
  def run_model(device, dataset, model_path: str, num_workers, batch_size) -> float:
79
- """Run inference for the given embeddings and model.
80
- The point estimate is returned.
 
 
 
 
 
 
 
 
 
81
  """
82
 
83
  logger.debug(f"[loading model {model_path}]")
@@ -108,6 +148,17 @@ def run_model(device, dataset, model_path: str, num_workers, batch_size) -> floa
108
 
109
 
110
  def logits_to_point_estimates(logits):
 
 
 
 
 
 
 
 
 
 
 
111
  # logits is a tensor of shape (batch_size, 2 * (n_clf_tasks + n_reg_tasks))
112
  # need to convert it to a tensor of shape (batch_size, n_clf_tasks + n_reg_tasks)
113
  return logits[:, ::2] / (logits[:, ::2] + logits[:, 1::2])
@@ -124,13 +175,28 @@ def run(
124
  num_workers: int = NUM_WORKERS,
125
  use_cpu: bool = False,
126
  ):
127
- """Run Paladin inference on a single slide, using the given embeddings
128
- and either a single model or a table mapping cancer_subtypes and targets to models.
129
- If cancer_subtype_codes is given, it is a list of OncoTree codes for the slide.
130
- If aeon_predictions_path is given, it is the pathname to a CSV file
131
- with the output of an Aeon run on the slide.
132
- If both are given, an error is raised.
133
- The output is written to the given output_path (a CSV file).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  """
135
 
136
  if aeon_results is not None:
 
1
+ """Paladin model inference module for biomarker prediction.
2
+
3
+ This module provides functionality to run the Paladin deep learning models
4
+ for predicting various biomarkers from H&E whole slide image features, based
5
+ on the predicted or known cancer subtype.
6
+ """
7
+
8
  import csv
9
  import pickle # nosec
10
  import sys
 
34
 
35
 
36
  def load_model_map(model_map_path: str) -> dict[Any, Any]:
37
+ """Load the table mapping cancer subtypes and targets to Paladin models.
 
38
 
39
  A dict is returned, mapping each cancer_subtype to a table mapping a
40
  target to the pathname for the model that predicts it.
41
+
42
+ Args:
43
+ model_map_path: Path to the CSV file containing the model map
44
+
45
+ Returns:
46
+ Dictionary mapping cancer subtypes to their target-specific models
47
  """
48
  models = defaultdict(dict)
49
  with Path(model_map_path).open() as fp:
 
57
 
58
 
59
  def load_aeon_scores(df: pd.DataFrame) -> dict[str, float]:
60
+ """Load Aeon output table with cancer subtypes and confidence values.
61
+
62
+ Args:
63
+ df: DataFrame with columns 'Cancer Subtype' and 'Confidence'
64
+
65
+ Returns:
66
+ Dictionary mapping cancer subtypes to their confidence scores
67
  """
68
  score = {}
69
  for _, row in df.iterrows():
 
74
 
75
 
76
  def select_cancer_subtypes(aeon_scores: dict[str, float], k=1) -> list[str]:
77
+ """Select the top k cancer subtypes based on Aeon confidence scores.
78
+
79
+ Args:
80
+ aeon_scores: Dictionary mapping cancer subtypes to confidence scores
81
+ k: Number of top subtypes to select (default: 1)
82
+
83
+ Returns:
84
+ List of cancer subtype codes sorted by confidence (highest first)
85
+ """
86
  sorted_cancer_subtypes = list(
87
  sorted([(v, k) for k, v in aeon_scores.items()], reverse=True)
88
  )
 
90
 
91
 
92
  def select_models(cancer_subtypes: list[str], model_map: dict[Any, Any]) -> list[Any]:
93
+ """Select Paladin models for the given cancer subtypes.
94
+
95
+ Args:
96
+ cancer_subtypes: List of cancer subtype codes
97
+ model_map: Dictionary mapping cancer subtypes to their models
98
+
99
+ Returns:
100
+ List of tuples (cancer_subtype, target, model_path)
101
+ """
102
  models = []
103
  for cancer_subtype, target, model in model_map.items():
104
  if cancer_subtype in cancer_subtypes:
 
107
 
108
 
109
  def run_model(device, dataset, model_path: str, num_workers, batch_size) -> float:
110
+ """Run inference for the given dataset and Paladin model.
111
+
112
+ Args:
113
+ device: Torch device (CPU or CUDA)
114
+ dataset: TileFeatureTensorDataset containing the features
115
+ model_path: Path to the pickled Paladin model
116
+ num_workers: Number of workers for data loading
117
+ batch_size: Batch size for inference
118
+
119
+ Returns:
120
+ Point estimate (predicted value) from the model
121
  """
122
 
123
  logger.debug(f"[loading model {model_path}]")
 
148
 
149
 
150
  def logits_to_point_estimates(logits):
151
+ """Convert model logits to point estimates for beta-binomial distribution.
152
+
153
+ The logits tensor contains alpha and beta parameters interleaved.
154
+ This function computes the mean of the beta-binomial distribution: alpha/(alpha+beta).
155
+
156
+ Args:
157
+ logits: Tensor of shape (batch_size, 2*(n_tasks)) with alpha/beta parameters
158
+
159
+ Returns:
160
+ Tensor of shape (batch_size, n_tasks) with point estimates
161
+ """
162
  # logits is a tensor of shape (batch_size, 2 * (n_clf_tasks + n_reg_tasks))
163
  # need to convert it to a tensor of shape (batch_size, n_clf_tasks + n_reg_tasks)
164
  return logits[:, ::2] / (logits[:, ::2] + logits[:, 1::2])
 
175
  num_workers: int = NUM_WORKERS,
176
  use_cpu: bool = False,
177
  ):
178
+ """Run Paladin inference for biomarker prediction on a single slide.
179
+
180
+ Uses either Aeon predictions or user-provided cancer subtype codes to select
181
+ the appropriate Paladin models for biomarker prediction.
182
+
183
+ Args:
184
+ features: NumPy array of tile features extracted from the WSI
185
+ aeon_results: DataFrame with Aeon predictions (Cancer Subtype, Confidence)
186
+ cancer_subtype_codes: List of OncoTree codes if cancer subtype is known
187
+ model_map_path: Path to CSV file mapping subtypes/targets to model paths
188
+ model_path: Path to a single Paladin model (alternative to model_map_path)
189
+ metastatic: Whether the slide is from a metastatic site
190
+ batch_size: Batch size for inference
191
+ num_workers: Number of workers for data loading
192
+ use_cpu: Force CPU usage instead of GPU
193
+
194
+ Returns:
195
+ DataFrame with columns: Cancer Subtype, Target, Score
196
+
197
+ Note:
198
+ Either aeon_results or cancer_subtype_codes must be provided, but not both.
199
+ Either model_map_path or model_path must be provided, but not both.
200
  """
201
 
202
  if aeon_results is not None:
src/mosaic/ui/app.py CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
  import pandas as pd
3
  from pathlib import Path
 
1
+ """Gradio web interface for Mosaic.
2
+
3
+ This module provides the web-based user interface for analyzing whole slide images.
4
+ It includes functionality for:
5
+ - Multi-slide upload and analysis
6
+ - Settings configuration (site type, cancer subtype, IHC subtype, segmentation)
7
+ - Results visualization and export
8
+ - CSV-based batch processing
9
+ """
10
+
11
  import gradio as gr
12
  import pandas as pd
13
  from pathlib import Path
src/mosaic/ui/utils.py CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  import tempfile
2
  from pathlib import Path
3
  import pandas as pd
@@ -21,6 +30,17 @@ oncotree_code_map = {}
21
 
22
 
23
  def get_oncotree_code_name(code):
 
 
 
 
 
 
 
 
 
 
 
24
  global oncotree_code_map
25
  if code in oncotree_code_map.keys():
26
  return oncotree_code_map[code]
@@ -38,7 +58,15 @@ def get_oncotree_code_name(code):
38
 
39
 
40
  def create_user_directory(state, request: gr.Request):
41
- """Create a unique directory for each user session."""
 
 
 
 
 
 
 
 
42
  session_hash = request.session_hash
43
  if session_hash is None:
44
  return None, None
@@ -49,7 +77,20 @@ def create_user_directory(state, request: gr.Request):
49
 
50
 
51
  def load_settings(slide_csv_path):
52
- """Load settings from CSV file and validate columns."""
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  settings_df = pd.read_csv(slide_csv_path, na_filter=False)
54
  if "Segmentation Config" not in settings_df.columns:
55
  settings_df["Segmentation Config"] = "Biopsy"
@@ -64,7 +105,24 @@ def load_settings(slide_csv_path):
64
 
65
 
66
  def validate_settings(settings_df, cancer_subtype_name_map, cancer_subtypes, reversed_cancer_subtype_name_map):
67
- """Validate settings DataFrame and provide warnings for invalid entries."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  settings_df.columns = SETTINGS_COLUMNS
69
  warnings = []
70
  for idx, row in settings_df.iterrows():
@@ -110,6 +168,17 @@ def validate_settings(settings_df, cancer_subtype_name_map, cancer_subtypes, rev
110
 
111
 
112
  def export_to_csv(df):
 
 
 
 
 
 
 
 
 
 
 
113
  if df is None or df.empty:
114
  raise gr.Error("No data to export.")
115
  csv_path = "paladin_results.csv"
 
1
+ """UI utility functions for the Mosaic Gradio interface.
2
+
3
+ This module provides helper functions for:
4
+ - OncoTree code lookup and caching
5
+ - User session directory management
6
+ - Settings CSV loading and validation
7
+ - Data export functionality
8
+ """
9
+
10
  import tempfile
11
  from pathlib import Path
12
  import pandas as pd
 
30
 
31
 
32
  def get_oncotree_code_name(code):
33
+ """Retrieve the human-readable name for an OncoTree code.
34
+
35
+ Queries the OncoTree API to get the cancer subtype name corresponding
36
+ to the given code. Results are cached to avoid repeated API calls.
37
+
38
+ Args:
39
+ code: OncoTree code (e.g., "LUAD", "BRCA")
40
+
41
+ Returns:
42
+ Human-readable cancer subtype name, or "Unknown" if not found
43
+ """
44
  global oncotree_code_map
45
  if code in oncotree_code_map.keys():
46
  return oncotree_code_map[code]
 
58
 
59
 
60
  def create_user_directory(state, request: gr.Request):
61
+ """Create a unique directory for each user session.
62
+
63
+ Args:
64
+ state: Gradio state object (unused)
65
+ request: Gradio request object containing session hash
66
+
67
+ Returns:
68
+ Path to user's session directory, or None if no session hash available
69
+ """
70
  session_hash = request.session_hash
71
  if session_hash is None:
72
  return None, None
 
77
 
78
 
79
  def load_settings(slide_csv_path):
80
+ """Load slide analysis settings from CSV file.
81
+
82
+ Loads the CSV and ensures all required columns are present, adding defaults
83
+ for optional columns if they are missing.
84
+
85
+ Args:
86
+ slide_csv_path: Path to the CSV file containing slide settings
87
+
88
+ Returns:
89
+ DataFrame with columns: Slide, Site Type, Cancer Subtype, IHC Subtype, Segmentation Config
90
+
91
+ Raises:
92
+ ValueError: If required columns are missing from the CSV
93
+ """
94
  settings_df = pd.read_csv(slide_csv_path, na_filter=False)
95
  if "Segmentation Config" not in settings_df.columns:
96
  settings_df["Segmentation Config"] = "Biopsy"
 
105
 
106
 
107
  def validate_settings(settings_df, cancer_subtype_name_map, cancer_subtypes, reversed_cancer_subtype_name_map):
108
+ """Validate and normalize slide analysis settings.
109
+
110
+ Checks each row for valid values and normalizes cancer subtype names.
111
+ Generates warnings for invalid entries and replaces them with defaults.
112
+
113
+ Args:
114
+ settings_df: DataFrame with slide settings to validate
115
+ cancer_subtype_name_map: Dict mapping subtype display names to codes
116
+ cancer_subtypes: List of valid cancer subtype codes
117
+ reversed_cancer_subtype_name_map: Dict mapping codes to display names
118
+
119
+ Returns:
120
+ Validated DataFrame with normalized values
121
+
122
+ Note:
123
+ Invalid entries are replaced with defaults and warnings are displayed
124
+ to the user via Gradio warnings.
125
+ """
126
  settings_df.columns = SETTINGS_COLUMNS
127
  warnings = []
128
  for idx, row in settings_df.iterrows():
 
168
 
169
 
170
  def export_to_csv(df):
171
+ """Export a DataFrame to CSV file for download.
172
+
173
+ Args:
174
+ df: DataFrame to export
175
+
176
+ Returns:
177
+ Path to the exported CSV file
178
+
179
+ Raises:
180
+ gr.Error: If the DataFrame is None or empty
181
+ """
182
  if df is None or df.empty:
183
  raise gr.Error("No data to export.")
184
  csv_path = "paladin_results.csv"