SPARC / README.md
azizayed's picture
Update README.md
c94d8f7 verified
---
license: cc-by-nc-4.0
tags:
- computational-pathology
- survival-analysis
- whole-slide-imaging
- gene-expression
- oncology
- histopathology
extra_gated_prompt: |
These weights are released under CC BY-NC 4.0 — strictly non-commercial,
research and educational use only. By requesting access you agree to:
1. Use the weights only for non-commercial research.
2. Cite the SPARC paper in any derived publication.
3. Not redistribute the weights to third parties.
extra_gated_fields:
Name: text
Affiliation: text
Email: text
Intended use: text
I agree to the non-commercial license: checkbox
---
# SPARC
**Gene-program-aware survival modelling from H&E whole-slide images.**
This repository hosts the trained model weights for the SPARC paper (Ayed,
Cohn, et al.). Code, configs, training scripts, and figure-regeneration
notebooks live at **[github.com/aziz-ayed/SPARC](https://github.com/aziz-ayed/SPARC)**.
SPARC is a two-stage pipeline:
1. **SPARC-Map** predicts 40 hallmark gene-expression-program (GEP) scores
per H&E patch, recovering a *spatial* molecular map of each slide.
2. **SPARC-Risk** fuses those per-patch GEP scores with the same H&E
features through a signature-query attention head and a cancer-aware
gate, producing a single per-patient risk score.
These weights cover the SPARC-Risk model and the image-only baseline used
for ablations.
<p align="center">
<img src="https://raw.githubusercontent.com/aziz-ayed/SPARC/main/docs/figs/figure1.png" alt="SPARC pipeline" width="720">
</p>
## What you get
| Folder | Model | Description |
|---|---|---|
| `sparc_risk/` | **SPARC-Risk (canonical)** | Signature-query fusion + H&E. The model reported throughout the paper. |
| `image_only/` | **Image-only baseline** | Same backbone, GEP pathway disabled. Use for direct ablation against SPARC-Risk. |
Each folder contains 5 checkpoints — `fold_0_best.pt` through
`fold_4_best.pt` — corresponding to the 5-fold cross-validation splits
described in the paper and in
[`data/mmp_hybrid_splits_v2_20cancer.csv`](https://github.com/aziz-ayed/SPARC/blob/main/data/mmp_hybrid_splits_v2_20cancer.csv).
Every `.pt` carries both `model_state_dict` and the original training
`config`, so the model can be rebuilt with one line:
```python
import torch
from sparc.models.factory import build_model
ckpt = torch.load("sparc_risk/fold_0_best.pt", map_location="cpu", weights_only=False)
model = build_model(ckpt["config"])
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
```
## Quick start
```bash
# 1. Install the SPARC package
git clone https://github.com/aziz-ayed/SPARC.git && cd SPARC
conda env create -f environment.yml
conda activate sparc
# 2. Accept the license on https://huggingface.co/azizayed/SPARC, then:
pip install -U "huggingface_hub[cli]"
hf auth login
hf download azizayed/SPARC --local-dir checkpoints
# 3. Inference on an external cohort (e.g. NLST lung)
python -m inference.run \
--cohort nlst \
--checkpoint_dir checkpoints/sparc_risk \
--gpus 0,1,2,3
```
The download produces:
```
checkpoints/
├── sparc_risk/ fold_{0..4}_best.pt
└── image_only/ fold_{0..4}_best.pt
```
## Architecture (SPARC-Risk)
| Component | Setting |
|---|---|
| Image backbone | H-optimus-1 (1536-dim) |
| Patch size / magnification | 224 px @ 20× |
| Max patches per slide | 4096 |
| Fusion | Signature-query cross-attention (64-NN, 4 heads) |
| Aggregator | Gated attention MIL |
| Head | Discrete-time NLL survival, 4 bins |
| Cancer conditioning | Per-cancer learned gate |
| Hidden dim | 256 |
| Trainable params | ≈ 2.6 M |
| Optimiser / schedule | Adam, lr 1 × 10⁻⁴, cosine T_max 20 |
| Random seed | 1337 |
Full config + reproduction recipe: [`configs/sparc_risk.yaml`](https://github.com/aziz-ayed/SPARC/blob/main/configs/sparc_risk.yaml).
## Training data
5-fold patient-level cross-validation over **20 TCGA cancer types**
(BLCA, BRCA, CESC, COAD, ESCA, GBM, HNSC, KIRC, KIRP, LGG, LIHC, LUAD,
LUSC, PAAD, READ, SARC, SKCM, STAD, UCEC, plus a held-out evaluation
split). Splits derive from the MMP hybrid scheme of Mahmood et al. and
are released alongside the code at
[`data/mmp_hybrid_splits_v2_20cancer.csv`](https://github.com/aziz-ayed/SPARC/blob/main/data/mmp_hybrid_splits_v2_20cancer.csv).
External validation cohorts (not used for training) — NLST lung, SurGen
CRC, Yale breast, ovarian — are described in the paper.
## Intended use
These weights are intended for **non-commercial biomedical research and
education only**. Acceptable uses include:
- Reproducing the SPARC paper's results.
- Benchmarking against SPARC-Risk in computational-pathology research.
- Methodological extensions (new fusion designs, additional cohorts,
ablation studies).
## Citation
The SPARC paper is currently under review. Once a preprint or accepted
version is available, a BibTeX entry will be added here. In the meantime,
if you use these weights, please link back to
[github.com/aziz-ayed/SPARC](https://github.com/aziz-ayed/SPARC) and
contact the corresponding author at <azizayed@mit.edu>.
## License
These weights are released under the
[Creative Commons Attribution-NonCommercial 4.0 International License
(CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). For
commercial licensing, please contact the authors via the corresponding
GitHub issues page.