File size: 8,602 Bytes
77bfaaa
 
 
 
4a2cfd7
 
 
 
 
77bfaaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: mit
---

# DecoderTCR v0.1
DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.

For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)

## Model Architecture

DecoderTCR is built on a Transformer-based protein language model (ESM2 family).

### Core Architecture

The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.

#### Embedding Layer
- Token embedding dimension: *d* (e.g., 1280)
- Learned positional embeddings
- Vocabulary includes:
  - 20 standard amino acids
  - Special tokens (mask, padding, BOS/EOS, unknown)

#### Transformer Stack
- Number of layers: *L* (e.g., 33)
- Hidden dimension: *d*
- Number of attention heads: *h* (e.g., 20)
- Multi-head self-attention:
  - Full-sequence, bidirectional attention
- Feed-forward network:
  - Intermediate dimension ≈ 4× *d*
  - Activation function: GELU
- Layer normalization: Pre-LayerNorm
- Residual connections around attention and feed-forward blocks

### Continual Training Setup

The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.


### Model Scale (Example Configurations)

  | Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
| --- | --- | --- | --- | --- |
| ESM2-650M | ~650M | 33 | 1280 | 20 |
| ESM2-3B | ~3B | 36 | 2560 | 40 |

### Model Card Authors

Ben Lai

### Primary Contact Email

Ben Lai ben.lai@czbiohub.org
To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).

### System Requirements

- Compute Requirements: GPU

## Intended Use

### Primary Use Cases

The DecoderTCR models are designed for the following primary use cases:

1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation

The models are particularly useful for:
- Identifying potential TCR-peptide binding pairs
- Screening TCR sequences for specific antigen recognition
- Understanding the molecular basis of T-cell recognition
- Supporting vaccine design and immunotherapy development

### Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
- Clinical diagnosis or treatment decisions without proper validation
- Direct use in patient care without appropriate clinical validation and regulatory approval
- Use for purposes that could cause harm to individuals or groups

The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.

## Training Data

The models are trained with multi-component large-scale protein sequence databases. The training data consists of:

- **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
- **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
- **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.


## Continual Pre-training Strategy

This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.

### Overview

Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:

- Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
- Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.

The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.

### Stage 1: Component-Level Adaptation

In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.

- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
- **Purpose:**  
  - Adapt the pretrained ESM2 representations to the target protein subspace  
  - Learn domain-specific sequence statistics while retaining general protein knowledge

This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.

### Stage 2: Conditional / Interaction-Aware Refinement

In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).

- **Objective:** Masked Language Modeling (MLM)
- **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
- **Purpose:**  
  - Refine conditional relationships learned from limited paired data  
  - Align representations across components without degrading Stage 1 task performance

## Biases, Risks, and Limitations

### Potential Biases

- The model may reflect biases present in the training data, including:
  - Overrepresentation of certain HLA alleles or peptide types
  - Limited diversity in TCR sequences from specific populations
  - Bias toward well-studied antigen systems
- Certain TCR clonotypes or peptide types may be underrepresented in training data

### Risks

Areas of risk may include but are not limited to:
- **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
- **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
- **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
- **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions

### Limitations

- **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
- **Novel sequences**: Performance may degrade on sequences very different from training data
- **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
- **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
- **Computational requirements**: GPU is recommended for optimal performance

### Caveats and Recommendations

- **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
- **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
- **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
- **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
- **Version tracking**: Keep track of which model version and checkpoint you are using

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy)  when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

## Acknowledgements

This model builds upon:
- **ESM2** by Meta AI (Facebook Research) for the base protein language model
- The broader computational biology and immunology research communities

Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.