DecoderTCR / README.md
jperera-czbio's picture
add github link to model card
4a2cfd7 verified
---
license: mit
---
# DecoderTCR v0.1
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.
For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)
## Model Architecture
DecoderTCR is built on a Transformer-based protein language model (ESM2 family).
### Core Architecture
The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.
#### Embedding Layer
- Token embedding dimension: *d* (e.g., 1280)
- Learned positional embeddings
- Vocabulary includes:
- 20 standard amino acids
- Special tokens (mask, padding, BOS/EOS, unknown)
#### Transformer Stack
- Number of layers: *L* (e.g., 33)
- Hidden dimension: *d*
- Number of attention heads: *h* (e.g., 20)
- Multi-head self-attention:
- Full-sequence, bidirectional attention
- Feed-forward network:
- Intermediate dimension ≈ 4× *d*
- Activation function: GELU
- Layer normalization: Pre-LayerNorm
- Residual connections around attention and feed-forward blocks
### Continual Training Setup
The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.
### Model Scale (Example Configurations)
| Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
| --- | --- | --- | --- | --- |
| ESM2-650M | ~650M | 33 | 1280 | 20 |
| ESM2-3B | ~3B | 36 | 2560 | 40 |
### Model Card Authors
Ben Lai
### Primary Contact Email
Ben Lai ben.lai@czbiohub.org
To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).
### System Requirements
- Compute Requirements: GPU
## Intended Use
### Primary Use Cases
The DecoderTCR models are designed for the following primary use cases:
1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation
The models are particularly useful for:
- Identifying potential TCR-peptide binding pairs
- Screening TCR sequences for specific antigen recognition
- Understanding the molecular basis of T-cell recognition
- Supporting vaccine design and immunotherapy development
### Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
- Clinical diagnosis or treatment decisions without proper validation
- Direct use in patient care without appropriate clinical validation and regulatory approval
- Use for purposes that could cause harm to individuals or groups
The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.
## Training Data
The models are trained with multi-component large-scale protein sequence databases. The training data consists of:
- **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
- **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
- **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.
## Continual Pre-training Strategy
This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.
### Overview
Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:
- Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
- Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.
The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.
### Stage 1: Component-Level Adaptation
In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.
- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
- **Purpose:**
- Adapt the pretrained ESM2 representations to the target protein subspace
- Learn domain-specific sequence statistics while retaining general protein knowledge
This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.
### Stage 2: Conditional / Interaction-Aware Refinement
In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).
- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
- **Purpose:**
- Refine conditional relationships learned from limited paired data
- Align representations across components without degrading Stage 1 task performance
## Biases, Risks, and Limitations
### Potential Biases
- The model may reflect biases present in the training data, including:
- Overrepresentation of certain HLA alleles or peptide types
- Limited diversity in TCR sequences from specific populations
- Bias toward well-studied antigen systems
- Certain TCR clonotypes or peptide types may be underrepresented in training data
### Risks
Areas of risk may include but are not limited to:
- **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
- **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
- **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
- **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions
### Limitations
- **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
- **Novel sequences**: Performance may degrade on sequences very different from training data
- **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
- **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
- **Computational requirements**: GPU is recommended for optimal performance
### Caveats and Recommendations
- **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
- **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
- **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
- **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
- **Version tracking**: Keep track of which model version and checkpoint you are using
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
## Acknowledgements
This model builds upon:
- **ESM2** by Meta AI (Facebook Research) for the base protein language model
- The broader computational biology and immunology research communities
Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.