TESSERA

Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations.

A foundation model for the cancer genome, jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas through masked-token reconstruction within each modality and a cross-modal InfoNCE contrastive objective.

A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.

Source code: github.com/JW-Sidhom-Lab/tessera
Hosted inference API: see the project README's Quick start section
Paper: citation pending publication

Model variants

This repository contains two pretrained variants. Pick the one that matches your input data.

Variant	Inputs	Use when
`joint_snv_cna_noloh`	SNV CSV (chrom, pos, ref, alt, vaf) + CNA CSV (chrom, start, end, segment_mean)	Default. CNA inputs without allele-specific / loss-of-heterozygosity calls (most panel-sequencing cohorts). All published Figure 4-6 results use this variant.
`joint_snv_cna`	Same plus an `LOH` column (0/1)	When allele-specific CNA calls are available and the LoH signal should inform the representation.

Each subdirectory contains:

best_model.keras, final_model.keras — full inference graphs
features_model_mut.keras, features_model_cna.keras — per-modality feature extractors used by get_variant_features / get_cna_features
attn_*_model.keras — attention sub-modules
model_config.json — architecture configuration
training_log.csv — training history (for reproducibility)

Quick start

pip install tessera-foundation

The shortest path from raw dataframes to feature tensors is the featurize one-liner. It downloads the requested variant from this repo on first call (cached afterwards), lifts coordinates to GRCh37 if the source assembly is anything else, builds the inference dataset, and returns both per-modality embeddings:

import tessera

result = tessera.featurize(
    snv_df=snv_df,
    cna_df=cna_df,
    variant="joint_snv_cna_noloh",        # or "joint_snv_cna"
    from_assembly="GRCh38",               # "GRCh37" / "hg19" is a no-op
    quantile_normalize_to_tcga=False,     # see CNA-distribution table below
)

result.snv_features      # (n_variants, 1169)
result.cna_features      # (n_segments, 688)
result.snv_table         # post-liftover SNV table, row-aligned with snv_features
result.cna_table         # likewise for CNA
result.liftover_stats    # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": ...}

Expected dataframe columns:

SNV: Tumor_Sample_Barcode, Chromosome (no chr prefix), Start_Position, Reference_Allele, Tumor_Seq_Allele2, plus either vaf or both t_alt_count + t_ref_count.
CNA: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean (log2 ratio). Optional LOH column is consumed by the with-LoH variant.

CNA distribution: when to set `quantile_normalize_to_tcga=True`

TESSERA was pretrained on TCGA Pan-Cancer whole-exome ABSOLUTE Segment_Means. Inputs whose log2-ratio distribution differs from TCGA's should be quantile-normalized against the TCGA reference before inference.

Input type	`quantile_normalize_to_tcga`
TCGA-like whole-exome ABSOLUTE segments	`False` (default)
Panel sequencing (MSK-IMPACT, MSK-CHORD, GENIE)	`True`
Cell-line data (DepMap, CCLE)	`True`

When True, TESSERA rank-maps each Segment_Mean onto the bundled TCGA reference (tessera/data/cna_sorted.npy, 7 MB, ships with the package). The helper tessera.data.preprocessing.quantile_normalize_to_tcga is also exposed directly if you prefer to pre-normalize.

For finer control, the lower-level building blocks remain available (tessera.load_pretrained, tessera.lift_snv / tessera.lift_cna, model.create_sample_dataset, model.get_variant_features / model.get_cna_features).

Pretraining data

Cohort: TCGA Pan-Cancer Atlas (~10,000 tumour samples across 31 solid tumour types)
SNV source: TCGA MC3 MAF (multi-centre mutation-calling consensus); variants filtered to depth > 0
CNA source: TCGA ABSOLUTE allele-specific copy-number calls; log2-ratio Segment_Means with tumour-purity adjustment
Genome assembly: GRCh37 (hg19). GRCh38 inputs must be lifted over before inference.
Train / validation split: 75% / 25% at the patient level, stratified by tumour type, with the same patient-level assignments across SNV and CNA modalities

Architecture

SNV encoder: 3 local-attention blocks (sequence context 25 bp each side of the variant) followed by 3 global-attention blocks (variant-variant interactions within a sample). Per-token output dimension 1169.
CNA encoder: parallel-stream encoder separating positional features from segment-mean and LoH values to prevent leakage during reconstruction. Two inter-segment self-attention blocks. Per-token output dimension 688.
Cross-modal alignment: InfoNCE contrastive loss on per-modality 256-D projection heads (2-layer MLPs with one GELU-activated hidden layer, masked global average pooling, project-then-mean). No cross-modal attention between encoders during the forward pass, so per-modality outputs do not depend on the other modality's inputs.
Pretraining objectives: masked-token reconstruction (alt allele for SNV; segment mean for CNA) plus the InfoNCE alignment loss.

Full details are in the accompanying paper.

Intended use

This is a research foundation model for cancer-genomics research. It produces representations useful for downstream tasks like variant pathogenicity prediction, tumour-type classification, prognostic stratification, and treatment-effect estimation when combined with appropriate task-specific analyses.

Not intended for clinical decision-making. The model has not been validated for any clinical use. Predictions, including any downstream pathogenicity, prognostic, or treatment-selection scores derived from these representations, should not be used to direct patient care.

Limitations

Genome assembly: trained on GRCh37/hg19. GRCh38 inputs must be lifted over before inference. The hosted inference API in the project repository does this automatically.
Variant types: SNVs only. Indels and multi-base substitutions are not modelled.
Cross-platform shift: the model was trained on TCGA whole-exome data. Panel-sequencing inputs (e.g., MSK-IMPACT) require quantile normalisation of CNA Segment_Means against the TCGA training distribution to reconcile coverage differences.
Cell-line vs primary-tumour distribution: cell-line inputs (e.g., DepMap) are aneuploid relative to primary tumours; expect partially out-of-distribution behaviour without normalisation.
Treatment-effect heads not included: the linear coefficients producing the published predictive-biomarker scores in metastatic CRC and PDAC are not in this repository. Those remain available on request under a Data Use Agreement.

Citation

citation pending publication

A BibTeX entry will be added on acceptance.

License

This model is distributed under Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0). Use is permitted for academic research, education, and personal experimentation; commercial use is not permitted without a separate license.

Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian. Commercial licensing inquiries should be directed to NYP's technology transfer office.

The accompanying source code is distributed under the PolyForm Noncommercial License 1.0.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

JW-Sidhom-Lab
/

tessera-foundation

TESSERA

Model variants

Quick start

CNA distribution: when to set `quantile_normalize_to_tcga=True`

Pretraining data

Architecture

Intended use

Limitations

Citation

License

Space using JW-Sidhom-Lab/tessera-foundation 1

TESSERA

Model variants

Quick start

CNA distribution: when to set quantile_normalize_to_tcga=True

Pretraining data

Architecture

Intended use

Limitations

Citation

License

Space using JW-Sidhom-Lab/tessera-foundation 1

CNA distribution: when to set `quantile_normalize_to_tcga=True`