File size: 8,602 Bytes

77bfaaa
 
 
 
4a2cfd7
 
 
 
 
77bfaaa

---
license: mit
---

# DecoderTCR v0.1
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.

For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)

## Model Architecture

DecoderTCR is built on a Transformer-based protein language model (ESM2 family).

### Core Architecture

The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.

#### Embedding Layer
- Token embedding dimension: *d* (e.g., 1280)
- Learned positional embeddings
- Vocabulary includes:
  - 20 standard amino acids
  - Special tokens (mask, padding, BOS/EOS, unknown)

#### Transformer Stack
- Number of layers: *L* (e.g., 33)
- Hidden dimension: *d*
- Number of attention heads: *h* (e.g., 20)
- Multi-head self-attention:
  - Full-sequence, bidirectional attention
- Feed-forward network:
  - Intermediate dimension ≈ 4× *d*
  - Activation function: GELU
- Layer normalization: Pre-LayerNorm
- Residual connections around attention and feed-forward blocks

### Continual Training Setup

The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.


### Model Scale (Example Configurations)

  | Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
| --- | --- | --- | --- | --- |
| ESM2-650M | ~650M | 33 | 1280 | 20 |
| ESM2-3B | ~3B | 36 | 2560 | 40 |

### Model Card Authors

Ben Lai

### Primary Contact Email

Ben Lai ben.lai@czbiohub.org
To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).

### System Requirements

- Compute Requirements: GPU

## Intended Use

### Primary Use Cases

The DecoderTCR models are designed for the following primary use cases:

1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation

The models are particularly useful for:
- Identifying potential TCR-peptide binding pairs
- Screening TCR sequences for specific antigen recognition
- Understanding the molecular basis of T-cell recognition
- Supporting vaccine design and immunotherapy development

### Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
- Clinical diagnosis or treatment decisions without proper validation
- Direct use in patient care without appropriate clinical validation and regulatory approval
- Use for purposes that could cause harm to individuals or groups

The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.

## Training Data

The models are trained with multi-component large-scale protein sequence databases. The training data consists of:

- **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
- **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
- **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.


## Continual Pre-training Strategy

This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.

### Overview

Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:

- Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
- Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.

The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.

### Stage 1: Component-Level Adaptation

In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.

- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
- **Purpose:**  
  - Adapt the pretrained ESM2 representations to the target protein subspace  
  - Learn domain-specific sequence statistics while retaining general protein knowledge

This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.

### Stage 2: Conditional / Interaction-Aware Refinement

In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).

- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
- **Purpose:**  
  - Refine conditional relationships learned from limited paired data  
  - Align representations across components without degrading Stage 1 task performance

## Biases, Risks, and Limitations

### Potential Biases

- The model may reflect biases present in the training data, including:
  - Overrepresentation of certain HLA alleles or peptide types
  - Limited diversity in TCR sequences from specific populations
  - Bias toward well-studied antigen systems
- Certain TCR clonotypes or peptide types may be underrepresented in training data

### Risks

Areas of risk may include but are not limited to:
- **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
- **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
- **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
- **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions

### Limitations

- **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
- **Novel sequences**: Performance may degrade on sequences very different from training data
- **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
- **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
- **Computational requirements**: GPU is recommended for optimal performance

### Caveats and Recommendations

- **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
- **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
- **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
- **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
- **Version tracking**: Keep track of which model version and checkpoint you are using

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy)  when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

## Acknowledgements

This model builds upon:
- **ESM2** by Meta AI (Facebook Research) for the base protein language model
- The broader computational biology and immunology research communities

Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.