bl-2633 commited on
Commit
ff853ab
·
verified ·
1 Parent(s): 4b39f2a

Keep original model card; update model list (add ESM-C V0.3 line) + license note

Browse files
Files changed (1) hide show
  1. README.md +172 -41
README.md CHANGED
@@ -1,57 +1,188 @@
1
  ---
2
  license: mit
3
- library_name: pytorch
4
- pipeline_tag: other
5
- tags:
6
- - protein
7
- - tcr
8
- - tcr-pmhc
9
- - immunology
10
- - esm-c
11
- - esm2
12
- - masked-language-model
13
  ---
14
 
15
- # DecoderTCR V0.3
 
16
 
17
- TCR–pMHC binding scoring by masked-language-model pseudo-log-likelihood (PLL), on ESM-2 and
18
- ESM-C backbones. Given a TCR (V/J genes + CDR3) paired with an HLA allele and a peptide, the
19
- models score how well the TCR is predicted to bind the peptide–MHC.
20
 
21
- This repository hosts **model weights only**. The code, loader, and prediction CLIs live in the
22
- DecoderTCR package; weights are fetched into the expected paths by its `download_weights.py`
23
- script — you do not load these checkpoints with `transformers`.
24
 
25
- ## Models
26
 
27
- | File | Registry name | Backbone | Params | Version | Notes |
28
- |------|---------------|----------|--------|---------|-------|
29
- | `DecoderTCR-C-V0.3/600M.ckpt` | `DecoderTCR-C_600M` | ESM-C | 600M | V0.3 | **default**, runs on ≤24 GB GPUs |
30
- | `DecoderTCR-C-V0.3/300M.ckpt` | `DecoderTCR-C_300M` | ESM-C | 300M | V0.3 | lightest ESM-C |
31
- | `DecoderTCR-C-V0.3/6B.ckpt` | `DecoderTCR-C_6B` | ESM-C | 6B | V0.3 | larger variant (80 GB GPU) |
32
- | `DecoderTCR-ESM2-V0.1/650M_DecoderTCR.ckpt` | `DecoderTCR_650M` | ESM-2 | 650M | V0.1 | paper reproduction |
33
- | `DecoderTCR-ESM2-V0.1/3B_DecoderTCR.ckpt` | `DecoderTCR_3B` | ESM-2 | 3B | V0.1 | paper reproduction |
34
 
35
- The V0.3 ESM-C line (`DecoderTCR-C`) is the current default; the V0.1 ESM-2 line is kept for
36
- paper reproduction and backward compatibility.
37
 
38
- ## Usage
 
 
 
 
 
39
 
40
- Get the DecoderTCR code, then fetch weights (they download into these same nested paths):
 
 
 
 
 
 
 
 
 
 
41
 
42
- ```bash
43
- uv sync
44
- uv run python scripts/download_weights.py # all models
45
- uv run python scripts/download_weights.py -m DecoderTCR-C_600M # just the default
46
- ```
47
 
48
- ```python
49
- from DecoderTCR.utils.model_zoo import load
50
- model, n_layers = load() # default = DecoderTCR-C_600M
51
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## License
54
 
55
- MIT (see [`LICENSE`](LICENSE)). The bundled backbones are also MIT: ESM-2 (Meta) and the
56
- Chan Zuckerberg Biohub ESM-C release (https://github.com/Biohub/esm). The released checkpoints
57
- contain the full fine-tuned weights and are distributed under the MIT license.
 
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
+ # DecoderTCR
6
+ DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The models are based on the ESM-2 and ESM-C model families.
7
 
8
+ For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)
 
 
9
 
10
+ ## Model Architecture
 
 
11
 
12
+ DecoderTCR is built on a Transformer-based protein language model (ESM-2 and ESM-C families).
13
 
14
+ ### Core Architecture
 
 
 
 
 
 
15
 
16
+ The model follows the **ESM2** architecture, a deep Transformer encoder designed for protein sequences.
 
17
 
18
+ #### Embedding Layer
19
+ - Token embedding dimension: *d* (e.g., 1280)
20
+ - Learned positional embeddings
21
+ - Vocabulary includes:
22
+ - 20 standard amino acids
23
+ - Special tokens (mask, padding, BOS/EOS, unknown)
24
 
25
+ #### Transformer Stack
26
+ - Number of layers: *L* (e.g., 33)
27
+ - Hidden dimension: *d*
28
+ - Number of attention heads: *h* (e.g., 20)
29
+ - Multi-head self-attention:
30
+ - Full-sequence, bidirectional attention
31
+ - Feed-forward network:
32
+ - Intermediate dimension ≈ 4× *d*
33
+ - Activation function: GELU
34
+ - Layer normalization: Pre-LayerNorm
35
+ - Residual connections around attention and feed-forward blocks
36
 
37
+ ### Continual Training Setup
 
 
 
 
38
 
39
+ The model is initialized from a pretrained **ESM2 checkpoint** and further trained via continual pretraining with MLM objectives.
40
+
41
+
42
+ ### Released Models
43
+
44
+ This repository hosts two model lines. The **V0.3 ESM-C** line (`DecoderTCR-C`) is the current default; the **V0.1 ESM-2** line is retained for paper reproduction.
45
+
46
+ | Model | File | Backbone | Parameters | Layers | Hidden Dim | Attention Heads |
47
+ | --- | --- | --- | --- | --- | --- | --- |
48
+ | DecoderTCR-C 300M | `DecoderTCR-C-V0.3/300M.ckpt` | ESM-C | ~300M | 30 | 960 | 15 |
49
+ | DecoderTCR-C 600M (default) | `DecoderTCR-C-V0.3/600M.ckpt` | ESM-C | ~600M | 36 | 1152 | 18 |
50
+ | DecoderTCR-C 6B | `DecoderTCR-C-V0.3/6B.ckpt` | ESM-C | ~6B | 80 | 2560 | 40 |
51
+ | DecoderTCR 650M | `DecoderTCR-ESM2-V0.1/650M_DecoderTCR.ckpt` | ESM-2 | ~650M | 33 | 1280 | 20 |
52
+ | DecoderTCR 3B | `DecoderTCR-ESM2-V0.1/3B_DecoderTCR.ckpt` | ESM-2 | ~3B | 36 | 2560 | 40 |
53
+
54
+ ### Model Card Authors
55
+
56
+ Ben Lai
57
+
58
+ ### Primary Contact Email
59
+
60
+ Ben Lai ben.lai@czbiohub.org
61
+ To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).
62
+
63
+ ### System Requirements
64
+
65
+ - Compute Requirements: GPU
66
+
67
+ ## Intended Use
68
+
69
+ ### Primary Use Cases
70
+
71
+ The DecoderTCR models are designed for the following primary use cases:
72
+
73
+ 1. **TCR-pMHC Binding Prediction**: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
74
+ 2. **Interaction Scoring**: Calculate interface energy scores for TCR-pMHC interactions
75
+ 3. **Sequence Analysis**: Analyze TCR sequences and their interactions with specific peptides
76
+ 4. **Immunology Research**: Support research in adaptive immunity, T-cell recognition, and antigen presentation
77
+
78
+ The models are particularly useful for:
79
+ - Identifying potential TCR-peptide binding pairs
80
+ - Screening TCR sequences for specific antigen recognition
81
+ - Understanding the molecular basis of T-cell recognition
82
+ - Supporting vaccine design and immunotherapy development
83
+
84
+ ### Out-of-Scope or Unauthorized Use Cases
85
+
86
+ Do not use the model for the following purposes:
87
+ - Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
88
+ - Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
89
+ - Clinical diagnosis or treatment decisions without proper validation
90
+ - Direct use in patient care without appropriate clinical validation and regulatory approval
91
+ - Use for purposes that could cause harm to individuals or groups
92
+
93
+ The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.
94
+
95
+ ## Training Data
96
+
97
+ The models are trained with multi-component large-scale protein sequence databases. The training data consists of:
98
+
99
+ - **TCR sequences**: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
100
+ - **Peptide-MHC sequences**: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
101
+ - **Paired TCR-pMHC Interactions**: VDJdb for paired TCR-pMHC interaction data.
102
+
103
+
104
+ ## Continual Pre-training Strategy
105
+
106
+ This model is trained using a **continual pre-training curriculum** that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.
107
+
108
+ ### Overview
109
+
110
+ Continual pre-training proceeds in **multiple stages**, each leveraging different data regimes and masking strategies:
111
+
112
+ - Stage 1 emphasize **abundant marginal sequence data**, encouraging robust component-level representations.
113
+ - Stage 2 incorporate **scarcer, structured, or interaction-rich data**, refining conditional dependencies without overwriting earlier knowledge.
114
+
115
+ The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.
116
+
117
+ ### Stage 1: Component-Level Adaptation
118
+
119
+ In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.
120
+
121
+ - **Objective:** Masked Language Modeling (MLM)
122
+ - **Masking:** Component- or region-aware masking schedules that upweight functionally relevant positions
123
+ - **Purpose:**
124
+ - Adapt the pretrained ESM2 representations to the target protein subspace
125
+ - Learn domain-specific sequence statistics while retaining general protein knowledge
126
+
127
+ This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.
128
+
129
+ ### Stage 2: Conditional / Interaction-Aware Refinement
130
+
131
+ In subsequent stages, the model is continually trained on **structured or paired sequences** that encode higher-order dependencies (e.g., interactions between protein regions or components).
132
+
133
+ - **Objective:** Masked Language Modeling (MLM)
134
+ - **Masking:** Joint masking across interacting regions to encourage cross-context conditioning
135
+ - **Purpose:**
136
+ - Refine conditional relationships learned from limited paired data
137
+ - Align representations across components without degrading Stage 1 task performance
138
+
139
+ ## Biases, Risks, and Limitations
140
+
141
+ ### Potential Biases
142
+
143
+ - The model may reflect biases present in the training data, including:
144
+ - Overrepresentation of certain HLA alleles or peptide types
145
+ - Limited diversity in TCR sequences from specific populations
146
+ - Bias toward well-studied antigen systems
147
+ - Certain TCR clonotypes or peptide types may be underrepresented in training data
148
+
149
+ ### Risks
150
+
151
+ Areas of risk may include but are not limited to:
152
+ - **Inaccurate predictions**: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
153
+ - **Overconfidence**: The model may assign high confidence to predictions that are actually uncertain
154
+ - **Biological misinterpretation**: Users may misinterpret model outputs as definitive biological facts rather than predictions
155
+ - **Clinical misuse**: Use in clinical settings without proper validation could lead to incorrect treatment decisions
156
+
157
+ ### Limitations
158
+
159
+ - **Sequence length**: The model has limitations on maximum sequence length (typically ~1024 tokens)
160
+ - **Novel sequences**: Performance may degrade on sequences very different from training data
161
+ - **HLA diversity**: Limited training data for rare HLA alleles may affect prediction accuracy
162
+ - **Context dependency**: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
163
+ - **Computational requirements**: GPU is recommended for optimal performance
164
+
165
+ ### Caveats and Recommendations
166
+
167
+ - **Review and validate outputs**: Always review and validate model predictions, especially for critical applications
168
+ - **Experimental validation**: Model predictions should be validated experimentally before use in research or clinical contexts
169
+ - **Uncertainty awareness**: Be aware that predictions are probabilistic and may have uncertainty
170
+ - **Domain expertise**: Use the model in conjunction with domain expertise in immunology and T-cell biology
171
+ - **Version tracking**: Keep track of which model version and checkpoint you are using
172
+
173
+ We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services.
174
+
175
+ Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
176
+
177
+ ## Acknowledgements
178
+
179
+ This model builds upon:
180
+ - **ESM-2** by Meta AI (Facebook Research) for the ESM-2 base protein language model
181
+ - **ESM-C** released by Chan Zuckerberg Biohub (https://github.com/Biohub/esm) for the ESM-C base protein language model
182
+ - The broader computational biology and immunology research communities
183
+
184
+ Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.
185
 
186
  ## License
187
 
188
+ DecoderTCR code and all released weights are distributed under the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE). The base backbones are likewise MIT: ESM-2 (Meta AI) and ESM-C (Chan Zuckerberg Biohub, https://github.com/Biohub/esm).