jperera-czbio commited on
Commit
77bfaaa
·
verified ·
1 Parent(s): d0e67b6

Add Model Card

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ## Model Architecture
6
+
7
+ DecoderTCR is built on a Transformer-based protein language model (ESM2 family).
8
+
9
+ ### Core Architecture
10
+
11
+ The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.
12
+
13
+ #### Embedding Layer
14
+ - Token embedding dimension: *d* (e.g., 1280)
15
+ - Learned positional embeddings
16
+ - Vocabulary includes:
17
+ - 20 standard amino acids
18
+ - Special tokens (mask, padding, BOS/EOS, unknown)
19
+
20
+ #### Transformer Stack
21
+ - Number of layers: *L* (e.g., 33)
22
+ - Hidden dimension: *d*
23
+ - Number of attention heads: *h* (e.g., 20)
24
+ - Multi-head self-attention:
25
+ - Full-sequence, bidirectional attention
26
+ - Feed-forward network:
27
+ - Intermediate dimension ≈ 4× *d*
28
+ - Activation function: GELU
29
+ - Layer normalization: Pre-LayerNorm
30
+ - Residual connections around attention and feed-forward blocks
31
+
32
+ ### Continual Training Setup
33
+
34
+ The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.
35
+
36
+
37
+ ### Model Scale (Example Configurations)
38
+
39
+ | Model Variant | Parameters | Layers | Hidden Dim | Attention Heads |
40
+ | --- | --- | --- | --- | --- |
41
+ | ESM2-650M | ~650M | 33 | 1280 | 20 |
42
+ | ESM2-3B | ~3B | 36 | 2560 | 40 |
43
+
44
+ ### Model Card Authors
45
+
46
+ Ben Lai
47
+
48
+ ### Primary Contact Email
49
+
50
+ Ben Lai ben.lai@czbiohub.org
51
+ To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).
52
+
53
+ ### System Requirements
54
+
55
+ - Compute Requirements: GPU
56
+
57
+ ## Intended Use
58
+
59
+ ### Primary Use Cases
60
+
61
+ The DecoderTCR models are designed for the following primary use cases:
62
+
63
+ 1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
64
+ 2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
65
+ 3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
66
+ 4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation
67
+
68
+ The models are particularly useful for:
69
+ - Identifying potential TCR-peptide binding pairs
70
+ - Screening TCR sequences for specific antigen recognition
71
+ - Understanding the molecular basis of T-cell recognition
72
+ - Supporting vaccine design and immunotherapy development
73
+
74
+ ### Out-of-Scope or Unauthorized Use Cases
75
+
76
+ Do not use the model for the following purposes:
77
+ - Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
78
+ - Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
79
+ - Clinical diagnosis or treatment decisions without proper validation
80
+ - Direct use in patient care without appropriate clinical validation and regulatory approval
81
+ - Use for purposes that could cause harm to individuals or groups
82
+
83
+ The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.
84
+
85
+ ## Training Data
86
+
87
+ The models are trained with multi-component large-scale protein sequence databases. The training data consists of:
88
+
89
+ - **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
90
+ - **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
91
+ - **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.
92
+
93
+
94
+ ## Continual Pre-training Strategy
95
+
96
+ This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.
97
+
98
+ ### Overview
99
+
100
+ Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:
101
+
102
+ - Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
103
+ - Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.
104
+
105
+ The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.
106
+
107
+ ### Stage 1: Component-Level Adaptation
108
+
109
+ In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.
110
+
111
+ - **Objective:** Masked Language Modeling (MLM)
112
+ - **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
113
+ - **Purpose:**
114
+ - Adapt the pretrained ESM2 representations to the target protein subspace
115
+ - Learn domain-specific sequence statistics while retaining general protein knowledge
116
+
117
+ This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.
118
+
119
+ ### Stage 2: Conditional / Interaction-Aware Refinement
120
+
121
+ In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).
122
+
123
+ - **Objective:** Masked Language Modeling (MLM)
124
+ - **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
125
+ - **Purpose:**
126
+ - Refine conditional relationships learned from limited paired data
127
+ - Align representations across components without degrading Stage 1 task performance
128
+
129
+ ## Biases, Risks, and Limitations
130
+
131
+ ### Potential Biases
132
+
133
+ - The model may reflect biases present in the training data, including:
134
+ - Overrepresentation of certain HLA alleles or peptide types
135
+ - Limited diversity in TCR sequences from specific populations
136
+ - Bias toward well-studied antigen systems
137
+ - Certain TCR clonotypes or peptide types may be underrepresented in training data
138
+
139
+ ### Risks
140
+
141
+ Areas of risk may include but are not limited to:
142
+ - **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
143
+ - **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
144
+ - **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
145
+ - **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions
146
+
147
+ ### Limitations
148
+
149
+ - **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
150
+ - **Novel sequences**: Performance may degrade on sequences very different from training data
151
+ - **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
152
+ - **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
153
+ - **Computational requirements**: GPU is recommended for optimal performance
154
+
155
+ ### Caveats and Recommendations
156
+
157
+ - **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
158
+ - **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
159
+ - **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
160
+ - **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
161
+ - **Version tracking**: Keep track of which model version and checkpoint you are using
162
+
163
+ We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services.
164
+
165
+ Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
166
+
167
+ ## Acknowledgements
168
+
169
+ This model builds upon:
170
+ - **ESM2** by Meta AI (Facebook Research) for the base protein language model
171
+ - The broader computational biology and immunology research communities
172
+
173
+ Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.