TESSERA
Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations.
A foundation model for the cancer genome, jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas through masked-token reconstruction within each modality and a cross-modal InfoNCE contrastive objective.
A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.
- Source code: github.com/JW-Sidhom-Lab/tessera
- Hosted inference API: see the project README's Quick start section
- Paper: citation pending publication
Model variants
This repository contains two pretrained variants. Pick the one that matches your input data.
| Variant | Inputs | Use when |
|---|---|---|
joint_snv_cna_noloh |
SNV CSV (chrom, pos, ref, alt, vaf) + CNA CSV (chrom, start, end, segment_mean) | Default. CNA inputs without allele-specific / loss-of-heterozygosity calls (most panel-sequencing cohorts). All published Figure 4-6 results use this variant. |
joint_snv_cna |
Same plus an LOH column (0/1) |
When allele-specific CNA calls are available and the LoH signal should inform the representation. |
Each subdirectory contains:
best_model.keras,final_model.kerasโ full inference graphsfeatures_model_mut.keras,features_model_cna.kerasโ per-modality feature extractors used byget_variant_features/get_cna_featuresattn_*_model.kerasโ attention sub-modulesmodel_config.jsonโ architecture configurationtraining_log.csvโ training history (for reproducibility)
Quick start
pip install tessera-foundation
The shortest path from raw dataframes to feature tensors is the
featurize one-liner. It downloads the requested variant from this
repo on first call (cached afterwards), lifts coordinates to GRCh37 if
the source assembly is anything else, builds the inference dataset, and
returns both per-modality embeddings:
import tessera
result = tessera.featurize(
snv_df=snv_df,
cna_df=cna_df,
variant="joint_snv_cna_noloh", # or "joint_snv_cna"
from_assembly="GRCh38", # "GRCh37" / "hg19" is a no-op
quantile_normalize_to_tcga=False, # see CNA-distribution table below
)
result.snv_features # (n_variants, 1169)
result.cna_features # (n_segments, 688)
result.snv_table # post-liftover SNV table, row-aligned with snv_features
result.cna_table # likewise for CNA
result.liftover_stats # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": ...}
Expected dataframe columns:
- SNV:
Tumor_Sample_Barcode,Chromosome(nochrprefix),Start_Position,Reference_Allele,Tumor_Seq_Allele2, plus eithervafor botht_alt_count+t_ref_count. - CNA:
Tumor_Sample_Barcode,Chromosome,Start,End,Segment_Mean(log2 ratio). OptionalLOHcolumn is consumed by the with-LoH variant.
CNA distribution: when to set quantile_normalize_to_tcga=True
TESSERA was pretrained on TCGA Pan-Cancer whole-exome ABSOLUTE Segment_Means. Inputs whose log2-ratio distribution differs from TCGA's should be quantile-normalized against the TCGA reference before inference.
| Input type | quantile_normalize_to_tcga |
|---|---|
| TCGA-like whole-exome ABSOLUTE segments | False (default) |
| Panel sequencing (MSK-IMPACT, MSK-CHORD, GENIE) | True |
| Cell-line data (DepMap, CCLE) | True |
When True, TESSERA rank-maps each Segment_Mean onto the bundled
TCGA reference (tessera/data/cna_sorted.npy, 7 MB, ships with the
package). The helper tessera.data.preprocessing.quantile_normalize_to_tcga
is also exposed directly if you prefer to pre-normalize.
For finer control, the lower-level building blocks remain available
(tessera.load_pretrained, tessera.lift_snv / tessera.lift_cna,
model.create_sample_dataset, model.get_variant_features /
model.get_cna_features).
Pretraining data
- Cohort: TCGA Pan-Cancer Atlas (~10,000 tumour samples across 31 solid tumour types)
- SNV source: TCGA MC3 MAF (multi-centre mutation-calling consensus); variants filtered to depth > 0
- CNA source: TCGA ABSOLUTE allele-specific copy-number calls; log2-ratio Segment_Means with tumour-purity adjustment
- Genome assembly: GRCh37 (hg19). GRCh38 inputs must be lifted over before inference.
- Train / validation split: 75% / 25% at the patient level, stratified by tumour type, with the same patient-level assignments across SNV and CNA modalities
Architecture
- SNV encoder: 3 local-attention blocks (sequence context 25 bp each side of the variant) followed by 3 global-attention blocks (variant-variant interactions within a sample). Per-token output dimension 1169.
- CNA encoder: parallel-stream encoder separating positional features from segment-mean and LoH values to prevent leakage during reconstruction. Two inter-segment self-attention blocks. Per-token output dimension 688.
- Cross-modal alignment: InfoNCE contrastive loss on per-modality 256-D projection heads (2-layer MLPs with one GELU-activated hidden layer, masked global average pooling, project-then-mean). No cross-modal attention between encoders during the forward pass, so per-modality outputs do not depend on the other modality's inputs.
- Pretraining objectives: masked-token reconstruction (alt allele for SNV; segment mean for CNA) plus the InfoNCE alignment loss.
Full details are in the accompanying paper.
Intended use
This is a research foundation model for cancer-genomics research. It produces representations useful for downstream tasks like variant pathogenicity prediction, tumour-type classification, prognostic stratification, and treatment-effect estimation when combined with appropriate task-specific analyses.
Not intended for clinical decision-making. The model has not been validated for any clinical use. Predictions, including any downstream pathogenicity, prognostic, or treatment-selection scores derived from these representations, should not be used to direct patient care.
Limitations
- Genome assembly: trained on GRCh37/hg19. GRCh38 inputs must be lifted over before inference. The hosted inference API in the project repository does this automatically.
- Variant types: SNVs only. Indels and multi-base substitutions are not modelled.
- Cross-platform shift: the model was trained on TCGA whole-exome data. Panel-sequencing inputs (e.g., MSK-IMPACT) require quantile normalisation of CNA Segment_Means against the TCGA training distribution to reconcile coverage differences.
- Cell-line vs primary-tumour distribution: cell-line inputs (e.g., DepMap) are aneuploid relative to primary tumours; expect partially out-of-distribution behaviour without normalisation.
- Treatment-effect heads not included: the linear coefficients producing the published predictive-biomarker scores in metastatic CRC and PDAC are not in this repository. Those remain available on request under a Data Use Agreement.
Citation
citation pending publication
A BibTeX entry will be added on acceptance.
License
This model is distributed under Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0). Use is permitted for academic research, education, and personal experimentation; commercial use is not permitted without a separate license.
Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian. Commercial licensing inquiries should be directed to NYP's technology transfer office.
The accompanying source code is distributed under the PolyForm Noncommercial License 1.0.0.