File size: 13,758 Bytes
2a074d9
 
 
 
 
 
 
e3bcefe
2a074d9
 
 
 
 
6bd4e42
 
 
 
 
 
71ae2f0
 
6bd4e42
fe38b5b
6bd4e42
71ae2f0
 
 
 
 
 
 
 
 
 
 
315cd39
71ae2f0
6bd4e42
 
 
 
 
 
 
 
 
ab80711
6bd4e42
 
 
 
 
 
 
 
ab80711
 
6bd4e42
71ae2f0
 
 
 
 
ab80711
71ae2f0
 
ab80711
 
 
 
 
 
 
 
 
6bd4e42
 
fe38b5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab80711
fe38b5b
 
 
 
6bd4e42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71ae2f0
6bd4e42
 
 
 
 
 
71ae2f0
6bd4e42
 
71ae2f0
6bd4e42
 
 
 
 
71ae2f0
6bd4e42
 
8fe3b28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bd4e42
 
71ae2f0
6bd4e42
 
 
 
 
 
 
71ae2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ba0ba5
 
 
71ae2f0
 
 
 
 
 
 
 
 
 
4ba0ba5
 
 
71ae2f0
 
 
 
 
 
 
4ba0ba5
 
 
 
71ae2f0
 
 
 
 
 
 
 
 
 
14c1ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71ae2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab80711
71ae2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ba0ba5
 
 
 
 
 
 
 
 
 
71ae2f0
 
 
 
4ba0ba5
 
 
 
71ae2f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
315cd39
 
 
 
71ae2f0
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
---
title: Mosaic
emoji: 🧬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.0
python_version: 3.11
app_file: app.py
pinned: false
license: apache-2.0
---

# Mosaic: H&E Whole Slide Image Cancer Subtype and Biomarker Inference

Mosaic is a deep learning model designed for predicting cancer subtypes and biomarkers from Hematoxylin and Eosin (H&E) stained whole slide images (WSIs). This repository provides the code, pre-trained models, and instructions to use Mosaic for your own datasets.

## Table of Contents

- [System Requirements](#system-requirements)
- [Pre-requisites](#pre-requisites)
- [Installation](#installation)
- [Deploying to Hugging Face Spaces](#deploying-to-hugging-face-spaces)
- [Usage](#usage)
  - [Initial Setup](#initial-setup)
  - [Web Application](#web-application)
  - [Command Line Interface](#command-line-interface)
  - [Notes](#notes)
- [Output Files](#output-files)
- [Examples](#examples)
- [Advanced Usage](#advanced-usage)
- [CSV File Format](#csv-file-format)
- [Cancer Subtypes](#cancer-subtypes)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
- [Architecture](#architecture)
- [License](#license)

### System requirements

Supported systems:

- Linux (x86) with GPU (NVIDIA CUDA)

### Pre-requisites

- [python3.11](https://www.python.org/)
- [uv](https://docs.astral.sh/uv/)

    ```bash
    curl -LsSf https://astral.sh/uv/install.sh | sh
    ```

## Installation

Ensure that you have ssh credentials setup to access the paladin private repository. (Create key with `ssh-keygen` and put in your github profile, Settings -> SSH and GPG keys.)

```bash
git clone https://github.com/pathology-data-mining/mosaic.git
cd mosaic
uv sync
```

Note that when installing via `uv sync`, the virtual environment will be created in the `./.venv` directory. To activate it, run:

```bash
source .venv/bin/activate
```

Alternatively, create a virtual environment mosaic-venv (in a subdirectory), activate it, and install the app directly from the repository:

```bash
uv venv mosaic-venv --python 3.11
source mosaic-venv/bin/activate
uv pip install git+ssh://git@github.com/pathology-data-mining/paladin_webapp.git@dev
```

## Deploying to Hugging Face Spaces

This repository is configured for deployment on Hugging Face Spaces with Zero GPU support.

### Prerequisites

1. You need to be added to the [PDM Group](https://huggingface.co/PDM-Group) on Hugging Face to access the models
2. Create a Hugging Face access token with read permissions for the PDM-Group space

### Deployment Steps

1. Create a new Space on Hugging Face
2. Select "Gradio" as the SDK
3. Choose "Zero GPU" as the hardware option (if available)
4. Clone this repository to your Space or push the code
5. In your Space settings, add a secret named `HF_TOKEN` with your Hugging Face access token
6. The app will automatically start and download the necessary models on first run

### Zero GPU Configuration

The app uses the `@spaces.GPU` decorator to allocate GPU resources only when needed for inference. This allows efficient use of Zero GPU resources on Hugging Face Spaces. The GPU is automatically allocated when:

- Processing tissue segmentation
- Extracting features with CTransPath and Optimus models
- Running Aeon and Paladin model inference

## Usage

### Initial Setup

<b>NOTE</b>: In order to run this app, the user needs to be added to the [PDM Group](https://huggingface.co/PDM-Group) and the user needs to set the following environment variable. The token may be obtained from clicking on the user icon on the top right of the HuggingFace website and selecting "Access Tokens". When creating the token, select all read options for your private space and the PDM-Group space.

```bash
export HF_TOKEN="TOKEN-FROM-HUGGINGFACE"
```

Additionally, set the location for huggingface home where models and other data from HuggingFace may be downloaded.

```bash
export HF_HOME="PATH-TO-HUGGINGFACE-HOME"
```

### Web Application

Run the web application with:

```bash
mosaic
```

It will start a web server on port 7860 by default. You can access the web interface by navigating to `http://localhost:7860` in your web browser.

### Command Line Interface

To process a single WSI, use the following command:

```bash
mosaic --slide-path /path/to/your/wsi.svs --output-dir /path/to/output/directory
```

To process a batch of WSIs, use:

```bash
mosaic --slide-csv /path/to/your/wsi_list.csv --output-dir /path/to/output/directory
```

#### Complete CLI Options Reference

##### Processing Options

- `--slide-path PATH`: Path to a single slide for processing (mutually exclusive with `--slide-csv`)
- `--slide-csv PATH`: CSV file with slide settings for batch processing (see [CSV File Format](#csv-file-format))
- `--output-dir PATH`: Directory to save output results (required for CLI processing)

##### Single Slide Parameters

These options apply when using `--slide-path` for single slide processing:

- `--site-type {Primary,Metastatic}`: Site type of the slide (default: `Primary`)
- `--cancer-subtype CODE`: Cancer subtype OncoTree code (default: `Unknown` to infer with Aeon model)
- `--segmentation-config {Biopsy,Resection,TCGA}`: Tissue segmentation configuration (default: `Biopsy`)
- `--ihc-subtype SUBTYPE`: IHC subtype for breast cancer (BRCA) only. Options:
  - `HR+/HER2+`
  - `HR+/HER2-`
  - `HR-/HER2+`
  - `HR-/HER2-`
- `--sex {Male,Female,Unknown}`: Patient sex for improved Aeon inference (default: `Unknown`)
- `--tissue-site SITE`: Primary tissue site for improved Aeon inference (default: `Unknown`)
  - Examples: `Lung`, `Breast`, `Colon`, `Liver`, `Brain`, `Lymph Node`, `Bone`
  - See `data/tissue_site_original_to_idx.csv` for complete list

##### Performance & Processing

- `--num-workers N`: Number of workers for feature extraction (default: 4)
  - Increase for faster processing (e.g., 8-16) if you have sufficient CPU/memory
  - Decrease (e.g., 2-4) if encountering memory issues

##### Model Management

- `--skip-model-download`: Skip downloading models from HuggingFace (assumes models are already cached)
- `--download-models-only`: Download models from HuggingFace and exit without running analysis

##### Web Server Options

- `--server-name ADDRESS`: Server address for Gradio web interface (default: `0.0.0.0`)
- `--server-port PORT`: Server port for Gradio web interface (default: uses `GRADIO_SERVER_PORT` env var or 7860)
- `--share`: Create a public shareable link for the Gradio interface (use with caution)

##### Debugging

- `--debug`: Enable debug logging (creates `debug.log` file with detailed information)

##### Getting Help

See all available options with:

```bash
mosaic --help
```

If setting port to run in server mode, you may check for available ports using `ss -tuln | grep :PORT` where PORT is the port number you want to check. No output indicates the port may be available. If port is available, set environment variable `export GRADIO_SERVER_PORT="PORT"`

### Notes

- The first time you run the application, it will download the necessary models from HuggingFace. This may take some time depending on your internet connection.
- The models are downloaded to a directory named `data` relative to where you run the application.

## Output Files

### Single Slide Processing

When processing a single slide, the following files are generated in the output directory:

- `{slide_name}_mask.png`: Visualization of the tissue segmentation
- `{slide_name}_aeon_results.csv`: Cancer subtype predictions with confidence scores (if cancer subtype was set to "Unknown")
- `{slide_name}_paladin_results.csv`: Biomarker predictions for the slide

### Batch Processing

When processing multiple slides, in addition to individual slide outputs, combined results are generated:

- `combined_aeon_results.csv`: Cancer subtype predictions for all slides in a single file
- `combined_paladin_results.csv`: Biomarker predictions for all slides in a single file

## Examples

### Example 1: Process a single slide with unknown cancer type

```bash
mosaic --slide-path /data/slides/sample.svs \
       --output-dir /data/results \
       --site-type Primary \
       --cancer-subtype Unknown \
       --segmentation-config Resection \
       --sex Female \
       --tissue-site Lung
```

### Example 2: Process a single breast cancer slide with known IHC subtype

```bash
mosaic --slide-path /data/slides/breast_sample.svs \
       --output-dir /data/results \
       --site-type Primary \
       --cancer-subtype BRCA \
       --ihc-subtype "HR+/HER2-" \
       --segmentation-config Biopsy \
       --sex Female \
       --tissue-site Breast
```

### Example 3: Process multiple slides from CSV

Create a CSV file `slides.csv` with the following format:

```csv
Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype,Sex,Tissue Site
/data/slides/sample1.svs,Primary,Unknown,Resection,,Female,Lung
/data/slides/sample2.svs,Metastatic,LUAD,Biopsy,,,Liver
/data/slides/sample3.svs,Primary,BRCA,TCGA,HR+/HER2-,Female,Breast
```

Then run:

```bash
mosaic --slide-csv slides.csv --output-dir /data/results
```

## Advanced Usage

### Model Management

#### Download Models Before Processing

To download models from HuggingFace without running any analysis:

```bash
mosaic --download-models-only
```

Or using the Makefile:

```bash
make download-models
```

#### Skip Model Download

If models are already cached and you want to skip the download check:

```bash
mosaic --skip-model-download --slide-path /path/to/slide.svs --output-dir /path/to/output
```

This is useful for offline processing or when you know models are already cached.

### Adjusting Performance

You can control the number of workers for feature extraction to balance between speed and memory usage:

```bash
mosaic --slide-path /path/to/slide.svs \
       --output-dir /path/to/output \
       --num-workers 8
```

### Running in Server Mode

To run Mosaic as a web server accessible from other machines:

```bash
export GRADIO_SERVER_PORT=7860
mosaic --server-name 0.0.0.0 --server-port 7860
```

Check for available ports using:

```bash
ss -tuln | grep :7860
```

To share the application publicly (use with caution):

```bash
mosaic --share
```

### Debug Mode

Enable debug logging for troubleshooting:

```bash
mosaic --debug
```

This will create a `debug.log` file with detailed information about the processing steps.

## CSV File Format

When processing multiple slides using the `--slide-csv` option, the CSV file must contain the following columns:

### Required Columns

- **Slide**: Full path to the WSI file (e.g., `/path/to/slide.svs`)
- **Site Type**: Either `Primary` or `Metastatic`

### Optional Columns

- **Cancer Subtype**: OncoTree code for the cancer subtype (e.g., `LUAD`, `BRCA`, `COAD`). Use `Unknown` to let Aeon infer the cancer type.
- **Segmentation Config**: One of `Biopsy`, `Resection`, or `TCGA`. Defaults to `Biopsy` if not specified.
- **IHC Subtype**: For breast cancer (BRCA) only. One of:
  - `HR+/HER2+`
  - `HR+/HER2-`
  - `HR-/HER2+`
  - `HR-/HER2-`
- **Sex**: Patient sex for improved Aeon cancer subtype inference. One of `Male`, `Female`, or `Unknown`.
- **Tissue Site**: Primary tissue site for improved Aeon cancer subtype inference. Examples include:
  - `Lung`
  - `Breast`
  - `Colon`
  - `Liver`
  - `Brain`
  - `Lymph Node`
  - `Bone`
  - See `data/tissue_site_original_to_idx.csv` for complete list of supported tissue sites.

### CSV Example

```csv
Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype,Sex,Tissue Site
/data/slides/lung1.svs,Primary,LUAD,Resection,,Male,Lung
/data/slides/breast1.svs,Primary,BRCA,Biopsy,HR+/HER2-,Female,Breast
/data/slides/unknown1.svs,Metastatic,Unknown,TCGA,,,Liver
```

## Cancer Subtypes

Mosaic uses OncoTree codes to identify cancer subtypes. Common examples include:

- **LUAD**: Lung Adenocarcinoma
- **LUSC**: Lung Squamous Cell Carcinoma
- **BRCA**: Breast Invasive Carcinoma
- **COAD**: Colon Adenocarcinoma
- **READ**: Rectal Adenocarcinoma
- **PRAD**: Prostate Adenocarcinoma
- **SKCM**: Skin Cutaneous Melanoma

For a complete list of supported cancer subtypes, see the [OncoTree website](http://oncotree.mskcc.org/).

When the cancer subtype is set to `Unknown`, Mosaic will use the Aeon model to predict the most likely cancer subtype based on the H&E image features.

## Troubleshooting

### HuggingFace Authentication Errors

If you encounter authentication errors when downloading models:

1. Ensure you have access to the PDM-Group on HuggingFace
2. Create a HuggingFace access token with appropriate permissions
3. Set the `HF_TOKEN` environment variable correctly

### Out of Memory Errors

If you encounter GPU out-of-memory errors:

1. Reduce the number of workers: `--num-workers 2`
2. Process slides sequentially instead of in batch
3. Consider using a GPU with more memory

### Tissue Segmentation Issues

If tissue is not being detected correctly:

1. Try a different segmentation configuration (`Biopsy`, `Resection`, or `TCGA`)
2. Check that the slide file is not corrupted
3. Verify the slide format is supported (e.g., `.svs`, `.tif`)

### Port Already in Use

If the default port 7860 is already in use:

1. Check for running processes: `ss -tuln | grep :7860`
2. Use a different port: `export GRADIO_SERVER_PORT=7861`
3. Or specify the port directly: `mosaic --server-port 7861`

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to this project.

## Architecture

For detailed information about the code structure and module organization, see [ARCHITECTURE.md](ARCHITECTURE.md).

## License

This project is licensed under the terms specified in the LICENSE file.