mosaic-zero / README.md
raylim's picture
Add python_version to HF Spaces metadata
e3bcefe unverified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Mosaic
emoji: 🧬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.0
python_version: 3.11
app_file: app.py
pinned: false
license: apache-2.0

Mosaic: H&E Whole Slide Image Cancer Subtype and Biomarker Inference

Mosaic is a deep learning model designed for predicting cancer subtypes and biomarkers from Hematoxylin and Eosin (H&E) stained whole slide images (WSIs). This repository provides the code, pre-trained models, and instructions to use Mosaic for your own datasets.

Table of Contents

System requirements

Supported systems:

  • Linux (x86) with GPU (NVIDIA CUDA)

Pre-requisites

  • python3.11

  • uv

    curl -LsSf https://astral.sh/uv/install.sh | sh
    

Installation

Ensure that you have ssh credentials setup to access the paladin private repository. (Create key with ssh-keygen and put in your github profile, Settings -> SSH and GPG keys.)

git clone https://github.com/pathology-data-mining/mosaic.git
cd mosaic
uv sync

Note that when installing via uv sync, the virtual environment will be created in the ./.venv directory. To activate it, run:

source .venv/bin/activate

Alternatively, create a virtual environment mosaic-venv (in a subdirectory), activate it, and install the app directly from the repository:

uv venv mosaic-venv --python 3.11
source mosaic-venv/bin/activate
uv pip install git+ssh://git@github.com/pathology-data-mining/paladin_webapp.git@dev

Deploying to Hugging Face Spaces

This repository is configured for deployment on Hugging Face Spaces with Zero GPU support.

Prerequisites

  1. You need to be added to the PDM Group on Hugging Face to access the models
  2. Create a Hugging Face access token with read permissions for the PDM-Group space

Deployment Steps

  1. Create a new Space on Hugging Face
  2. Select "Gradio" as the SDK
  3. Choose "Zero GPU" as the hardware option (if available)
  4. Clone this repository to your Space or push the code
  5. In your Space settings, add a secret named HF_TOKEN with your Hugging Face access token
  6. The app will automatically start and download the necessary models on first run

Zero GPU Configuration

The app uses the @spaces.GPU decorator to allocate GPU resources only when needed for inference. This allows efficient use of Zero GPU resources on Hugging Face Spaces. The GPU is automatically allocated when:

  • Processing tissue segmentation
  • Extracting features with CTransPath and Optimus models
  • Running Aeon and Paladin model inference

Usage

Initial Setup

NOTE: In order to run this app, the user needs to be added to the PDM Group and the user needs to set the following environment variable. The token may be obtained from clicking on the user icon on the top right of the HuggingFace website and selecting "Access Tokens". When creating the token, select all read options for your private space and the PDM-Group space.

export HF_TOKEN="TOKEN-FROM-HUGGINGFACE"

Additionally, set the location for huggingface home where models and other data from HuggingFace may be downloaded.

export HF_HOME="PATH-TO-HUGGINGFACE-HOME"

Web Application

Run the web application with:

mosaic

It will start a web server on port 7860 by default. You can access the web interface by navigating to http://localhost:7860 in your web browser.

Command Line Interface

To process a single WSI, use the following command:

mosaic --slide-path /path/to/your/wsi.svs --output-dir /path/to/output/directory

To process a batch of WSIs, use:

mosaic --slide-csv /path/to/your/wsi_list.csv --output-dir /path/to/output/directory

Complete CLI Options Reference

Processing Options
  • --slide-path PATH: Path to a single slide for processing (mutually exclusive with --slide-csv)
  • --slide-csv PATH: CSV file with slide settings for batch processing (see CSV File Format)
  • --output-dir PATH: Directory to save output results (required for CLI processing)
Single Slide Parameters

These options apply when using --slide-path for single slide processing:

  • --site-type {Primary,Metastatic}: Site type of the slide (default: Primary)
  • --cancer-subtype CODE: Cancer subtype OncoTree code (default: Unknown to infer with Aeon model)
  • --segmentation-config {Biopsy,Resection,TCGA}: Tissue segmentation configuration (default: Biopsy)
  • --ihc-subtype SUBTYPE: IHC subtype for breast cancer (BRCA) only. Options:
    • HR+/HER2+
    • HR+/HER2-
    • HR-/HER2+
    • HR-/HER2-
  • --sex {Male,Female,Unknown}: Patient sex for improved Aeon inference (default: Unknown)
  • --tissue-site SITE: Primary tissue site for improved Aeon inference (default: Unknown)
    • Examples: Lung, Breast, Colon, Liver, Brain, Lymph Node, Bone
    • See data/tissue_site_original_to_idx.csv for complete list
Performance & Processing
  • --num-workers N: Number of workers for feature extraction (default: 4)
    • Increase for faster processing (e.g., 8-16) if you have sufficient CPU/memory
    • Decrease (e.g., 2-4) if encountering memory issues
Model Management
  • --skip-model-download: Skip downloading models from HuggingFace (assumes models are already cached)
  • --download-models-only: Download models from HuggingFace and exit without running analysis
Web Server Options
  • --server-name ADDRESS: Server address for Gradio web interface (default: 0.0.0.0)
  • --server-port PORT: Server port for Gradio web interface (default: uses GRADIO_SERVER_PORT env var or 7860)
  • --share: Create a public shareable link for the Gradio interface (use with caution)
Debugging
  • --debug: Enable debug logging (creates debug.log file with detailed information)
Getting Help

See all available options with:

mosaic --help

If setting port to run in server mode, you may check for available ports using ss -tuln | grep :PORT where PORT is the port number you want to check. No output indicates the port may be available. If port is available, set environment variable export GRADIO_SERVER_PORT="PORT"

Notes

  • The first time you run the application, it will download the necessary models from HuggingFace. This may take some time depending on your internet connection.
  • The models are downloaded to a directory named data relative to where you run the application.

Output Files

Single Slide Processing

When processing a single slide, the following files are generated in the output directory:

  • {slide_name}_mask.png: Visualization of the tissue segmentation
  • {slide_name}_aeon_results.csv: Cancer subtype predictions with confidence scores (if cancer subtype was set to "Unknown")
  • {slide_name}_paladin_results.csv: Biomarker predictions for the slide

Batch Processing

When processing multiple slides, in addition to individual slide outputs, combined results are generated:

  • combined_aeon_results.csv: Cancer subtype predictions for all slides in a single file
  • combined_paladin_results.csv: Biomarker predictions for all slides in a single file

Examples

Example 1: Process a single slide with unknown cancer type

mosaic --slide-path /data/slides/sample.svs \
       --output-dir /data/results \
       --site-type Primary \
       --cancer-subtype Unknown \
       --segmentation-config Resection \
       --sex Female \
       --tissue-site Lung

Example 2: Process a single breast cancer slide with known IHC subtype

mosaic --slide-path /data/slides/breast_sample.svs \
       --output-dir /data/results \
       --site-type Primary \
       --cancer-subtype BRCA \
       --ihc-subtype "HR+/HER2-" \
       --segmentation-config Biopsy \
       --sex Female \
       --tissue-site Breast

Example 3: Process multiple slides from CSV

Create a CSV file slides.csv with the following format:

Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype,Sex,Tissue Site
/data/slides/sample1.svs,Primary,Unknown,Resection,,Female,Lung
/data/slides/sample2.svs,Metastatic,LUAD,Biopsy,,,Liver
/data/slides/sample3.svs,Primary,BRCA,TCGA,HR+/HER2-,Female,Breast

Then run:

mosaic --slide-csv slides.csv --output-dir /data/results

Advanced Usage

Model Management

Download Models Before Processing

To download models from HuggingFace without running any analysis:

mosaic --download-models-only

Or using the Makefile:

make download-models

Skip Model Download

If models are already cached and you want to skip the download check:

mosaic --skip-model-download --slide-path /path/to/slide.svs --output-dir /path/to/output

This is useful for offline processing or when you know models are already cached.

Adjusting Performance

You can control the number of workers for feature extraction to balance between speed and memory usage:

mosaic --slide-path /path/to/slide.svs \
       --output-dir /path/to/output \
       --num-workers 8

Running in Server Mode

To run Mosaic as a web server accessible from other machines:

export GRADIO_SERVER_PORT=7860
mosaic --server-name 0.0.0.0 --server-port 7860

Check for available ports using:

ss -tuln | grep :7860

To share the application publicly (use with caution):

mosaic --share

Debug Mode

Enable debug logging for troubleshooting:

mosaic --debug

This will create a debug.log file with detailed information about the processing steps.

CSV File Format

When processing multiple slides using the --slide-csv option, the CSV file must contain the following columns:

Required Columns

  • Slide: Full path to the WSI file (e.g., /path/to/slide.svs)
  • Site Type: Either Primary or Metastatic

Optional Columns

  • Cancer Subtype: OncoTree code for the cancer subtype (e.g., LUAD, BRCA, COAD). Use Unknown to let Aeon infer the cancer type.
  • Segmentation Config: One of Biopsy, Resection, or TCGA. Defaults to Biopsy if not specified.
  • IHC Subtype: For breast cancer (BRCA) only. One of:
    • HR+/HER2+
    • HR+/HER2-
    • HR-/HER2+
    • HR-/HER2-
  • Sex: Patient sex for improved Aeon cancer subtype inference. One of Male, Female, or Unknown.
  • Tissue Site: Primary tissue site for improved Aeon cancer subtype inference. Examples include:
    • Lung
    • Breast
    • Colon
    • Liver
    • Brain
    • Lymph Node
    • Bone
    • See data/tissue_site_original_to_idx.csv for complete list of supported tissue sites.

CSV Example

Slide,Site Type,Cancer Subtype,Segmentation Config,IHC Subtype,Sex,Tissue Site
/data/slides/lung1.svs,Primary,LUAD,Resection,,Male,Lung
/data/slides/breast1.svs,Primary,BRCA,Biopsy,HR+/HER2-,Female,Breast
/data/slides/unknown1.svs,Metastatic,Unknown,TCGA,,,Liver

Cancer Subtypes

Mosaic uses OncoTree codes to identify cancer subtypes. Common examples include:

  • LUAD: Lung Adenocarcinoma
  • LUSC: Lung Squamous Cell Carcinoma
  • BRCA: Breast Invasive Carcinoma
  • COAD: Colon Adenocarcinoma
  • READ: Rectal Adenocarcinoma
  • PRAD: Prostate Adenocarcinoma
  • SKCM: Skin Cutaneous Melanoma

For a complete list of supported cancer subtypes, see the OncoTree website.

When the cancer subtype is set to Unknown, Mosaic will use the Aeon model to predict the most likely cancer subtype based on the H&E image features.

Troubleshooting

HuggingFace Authentication Errors

If you encounter authentication errors when downloading models:

  1. Ensure you have access to the PDM-Group on HuggingFace
  2. Create a HuggingFace access token with appropriate permissions
  3. Set the HF_TOKEN environment variable correctly

Out of Memory Errors

If you encounter GPU out-of-memory errors:

  1. Reduce the number of workers: --num-workers 2
  2. Process slides sequentially instead of in batch
  3. Consider using a GPU with more memory

Tissue Segmentation Issues

If tissue is not being detected correctly:

  1. Try a different segmentation configuration (Biopsy, Resection, or TCGA)
  2. Check that the slide file is not corrupted
  3. Verify the slide format is supported (e.g., .svs, .tif)

Port Already in Use

If the default port 7860 is already in use:

  1. Check for running processes: ss -tuln | grep :7860
  2. Use a different port: export GRADIO_SERVER_PORT=7861
  3. Or specify the port directly: mosaic --server-port 7861

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

Architecture

For detailed information about the code structure and module organization, see ARCHITECTURE.md.

License

This project is licensed under the terms specified in the LICENSE file.