|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# DecoderTCR v0.1 |
|
|
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family. |
|
|
|
|
|
For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
DecoderTCR is built on a Transformer-based protein language model (ESM2 family). |
|
|
|
|
|
### Core Architecture |
|
|
|
|
|
The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences. |
|
|
|
|
|
#### Embedding Layer |
|
|
- Token embedding dimension: *d* (e.g., 1280) |
|
|
- Learned positional embeddings |
|
|
- Vocabulary includes: |
|
|
- 20 standard amino acids |
|
|
- Special tokens (mask, padding, BOS/EOS, unknown) |
|
|
|
|
|
#### Transformer Stack |
|
|
- Number of layers: *L* (e.g., 33) |
|
|
- Hidden dimension: *d* |
|
|
- Number of attention heads: *h* (e.g., 20) |
|
|
- Multi-head self-attention: |
|
|
- Full-sequence, bidirectional attention |
|
|
- Feed-forward network: |
|
|
- Intermediate dimension ≈ 4× *d* |
|
|
- Activation function: GELU |
|
|
- Layer normalization: Pre-LayerNorm |
|
|
- Residual connections around attention and feed-forward blocks |
|
|
|
|
|
### Continual Training Setup |
|
|
|
|
|
The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives. |
|
|
|
|
|
|
|
|
### Model Scale (Example Configurations) |
|
|
|
|
|
| Model Variant | Parameters | Layers | Hidden Dim | Attention Heads | |
|
|
| --- | --- | --- | --- | --- | |
|
|
| ESM2-650M | ~650M | 33 | 1280 | 20 | |
|
|
| ESM2-3B | ~3B | 36 | 2560 | 40 | |
|
|
|
|
|
### Model Card Authors |
|
|
|
|
|
Ben Lai |
|
|
|
|
|
### Primary Contact Email |
|
|
|
|
|
Ben Lai ben.lai@czbiohub.org |
|
|
To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR). |
|
|
|
|
|
### System Requirements |
|
|
|
|
|
- Compute Requirements: GPU |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
The DecoderTCR models are designed for the following primary use cases: |
|
|
|
|
|
1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes |
|
|
2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions |
|
|
3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides |
|
|
4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation |
|
|
|
|
|
The models are particularly useful for: |
|
|
- Identifying potential TCR-peptide binding pairs |
|
|
- Screening TCR sequences for specific antigen recognition |
|
|
- Understanding the molecular basis of T-cell recognition |
|
|
- Supporting vaccine design and immunotherapy development |
|
|
|
|
|
### Out-of-Scope or Unauthorized Use Cases |
|
|
|
|
|
Do not use the model for the following purposes: |
|
|
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights |
|
|
- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy). |
|
|
- Clinical diagnosis or treatment decisions without proper validation |
|
|
- Direct use in patient care without appropriate clinical validation and regulatory approval |
|
|
- Use for purposes that could cause harm to individuals or groups |
|
|
|
|
|
The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The models are trained with multi-component large-scale protein sequence databases. The training data consists of: |
|
|
|
|
|
- **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences. |
|
|
- **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions. |
|
|
- **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data. |
|
|
|
|
|
|
|
|
## Continual Pre-training Strategy |
|
|
|
|
|
This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations. |
|
|
|
|
|
### Overview |
|
|
|
|
|
Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies: |
|
|
|
|
|
- Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations. |
|
|
- Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge. |
|
|
|
|
|
The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve. |
|
|
|
|
|
### Stage 1: Component-Level Adaptation |
|
|
|
|
|
In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain. |
|
|
|
|
|
- **Objective:** Masked Language Modeling (MLM) |
|
|
- **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions |
|
|
- **Purpose:** |
|
|
- Adapt the pretrained ESM2 representations to the target protein subspace |
|
|
- Learn domain-specific sequence statistics while retaining general protein knowledge |
|
|
|
|
|
This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies. |
|
|
|
|
|
### Stage 2: Conditional / Interaction-Aware Refinement |
|
|
|
|
|
In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components). |
|
|
|
|
|
- **Objective:** Masked Language Modeling (MLM) |
|
|
- **Masking:** Joint masking across interacting regions to encourage cross-context conditioning |
|
|
- **Purpose:** |
|
|
- Refine conditional relationships learned from limited paired data |
|
|
- Align representations across components without degrading Stage 1 task performance |
|
|
|
|
|
## Biases, Risks, and Limitations |
|
|
|
|
|
### Potential Biases |
|
|
|
|
|
- The model may reflect biases present in the training data, including: |
|
|
- Overrepresentation of certain HLA alleles or peptide types |
|
|
- Limited diversity in TCR sequences from specific populations |
|
|
- Bias toward well-studied antigen systems |
|
|
- Certain TCR clonotypes or peptide types may be underrepresented in training data |
|
|
|
|
|
### Risks |
|
|
|
|
|
Areas of risk may include but are not limited to: |
|
|
- **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations |
|
|
- **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain |
|
|
- **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions |
|
|
- **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens) |
|
|
- **Novel sequences**: Performance may degrade on sequences very different from training data |
|
|
- **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy |
|
|
- **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment) |
|
|
- **Computational requirements**: GPU is recommended for optimal performance |
|
|
|
|
|
### Caveats and Recommendations |
|
|
|
|
|
- **Review and validate outputs**: Always review and validate model predictions, especially for critical applications |
|
|
- **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts |
|
|
- **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty |
|
|
- **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology |
|
|
- **Version tracking**: Keep track of which model version and checkpoint you are using |
|
|
|
|
|
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services. |
|
|
|
|
|
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
This model builds upon: |
|
|
- **ESM2** by Meta AI (Facebook Research) for the base protein language model |
|
|
- The broader computational biology and immunology research communities |
|
|
|
|
|
Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible. |
|
|
|