File size: 8,602 Bytes
77bfaaa 4a2cfd7 77bfaaa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
license: mit
---
# DecoderTCR v0.1
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.
For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)
## Model Architecture
DecoderTCR is built on a Transformer-based protein language model (ESM2 family).
### Core Architecture
The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.
#### Embedding Layer
- Token embedding dimension: *d* (e.g., 1280)
- Learned positional embeddings
- Vocabulary includes:
- 20 standard amino acids
- Special tokens (mask, padding, BOS/EOS, unknown)
#### Transformer Stack
- Number of layers: *L* (e.g., 33)
- Hidden dimension: *d*
- Number of attention heads: *h* (e.g., 20)
- Multi-head self-attention:
- Full-sequence, bidirectional attention
- Feed-forward network:
- Intermediate dimension ≈ 4× *d*
- Activation function: GELU
- Layer normalization: Pre-LayerNorm
- Residual connections around attention and feed-forward blocks
### Continual Training Setup
The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.
### Model Scale (Example Configurations)
| Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
| --- | --- | --- | --- | --- |
| ESM2-650M | ~650M | 33 | 1280 | 20 |
| ESM2-3B | ~3B | 36 | 2560 | 40 |
### Model Card Authors
Ben Lai
### Primary Contact Email
Ben Lai ben.lai@czbiohub.org
To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).
### System Requirements
- Compute Requirements: GPU
## Intended Use
### Primary Use Cases
The DecoderTCR models are designed for the following primary use cases:
1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation
The models are particularly useful for:
- Identifying potential TCR-peptide binding pairs
- Screening TCR sequences for specific antigen recognition
- Understanding the molecular basis of T-cell recognition
- Supporting vaccine design and immunotherapy development
### Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
- Clinical diagnosis or treatment decisions without proper validation
- Direct use in patient care without appropriate clinical validation and regulatory approval
- Use for purposes that could cause harm to individuals or groups
The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.
## Training Data
The models are trained with multi-component large-scale protein sequence databases. The training data consists of:
- **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
- **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
- **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.
## Continual Pre-training Strategy
This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.
### Overview
Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:
- Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
- Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.
The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.
### Stage 1: Component-Level Adaptation
In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.
- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
- **Purpose:**
- Adapt the pretrained ESM2 representations to the target protein subspace
- Learn domain-specific sequence statistics while retaining general protein knowledge
This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.
### Stage 2: Conditional / Interaction-Aware Refinement
In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).
- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
- **Purpose:**
- Refine conditional relationships learned from limited paired data
- Align representations across components without degrading Stage 1 task performance
## Biases, Risks, and Limitations
### Potential Biases
- The model may reflect biases present in the training data, including:
- Overrepresentation of certain HLA alleles or peptide types
- Limited diversity in TCR sequences from specific populations
- Bias toward well-studied antigen systems
- Certain TCR clonotypes or peptide types may be underrepresented in training data
### Risks
Areas of risk may include but are not limited to:
- **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
- **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
- **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
- **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions
### Limitations
- **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
- **Novel sequences**: Performance may degrade on sequences very different from training data
- **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
- **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
- **Computational requirements**: GPU is recommended for optimal performance
### Caveats and Recommendations
- **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
- **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
- **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
- **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
- **Version tracking**: Keep track of which model version and checkpoint you are using
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
## Acknowledgements
This model builds upon:
- **ESM2** by Meta AI (Facebook Research) for the base protein language model
- The broader computational biology and immunology research communities
Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.
|