DecoderTCR / README.md

add github link to model card

4a2cfd7 verified 6 days ago

8.6 kB

	---
	license: mit
	---

	# DecoderTCR v0.1
	DecoderTCR is a protein language model for T-cell receptor (TCR) & peptide-MHC complexes. The model is based on the ESM2 model family.

	For Model Code and additional information on installation/usage please see [the associated GitHub repository](https://github.com/czbiohub-chi/DecoderTCR)

	## Model Architecture

	DecoderTCR is built on a Transformer-based protein language model (ESM2 family).

	### Core Architecture

	The model follows the ESM2 architecture, a deep Transformer encoder designed for protein sequences.

	#### Embedding Layer
	- Token embedding dimension: d (e.g., 1280)
	- Learned positional embeddings
	- Vocabulary includes:
	- 20 standard amino acids
	- Special tokens (mask, padding, BOS/EOS, unknown)

	#### Transformer Stack
	- Number of layers: L (e.g., 33)
	- Hidden dimension: d
	- Number of attention heads: h (e.g., 20)
	- Multi-head self-attention:
	- Full-sequence, bidirectional attention
	- Feed-forward network:
	- Intermediate dimension ≈ 4× d
	- Activation function: GELU
	- Layer normalization: Pre-LayerNorm
	- Residual connections around attention and feed-forward blocks

	### Continual Training Setup

	The model is initialized from a pretrained ESM2 checkpoint and further trained via continual pretraining with MLM objectives.


	### Model Scale (Example Configurations)

	\| Model Variant \| Parameters \| Layers \| Hidden Dim \| Attention Heads \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| ESM2-650M \| ~650M \| 33 \| 1280 \| 20 \|
	\| ESM2-3B \| ~3B \| 36 \| 2560 \| 40 \|

	### Model Card Authors

	Ben Lai

	### Primary Contact Email

	Ben Lai ben.lai@czbiohub.org
	To submit feature requests or report issues with the model, please open an issue on [the GitHub repository](https://github.com/czbiohub-chi/DecoderTCR).

	### System Requirements

	- Compute Requirements: GPU

	## Intended Use

	### Primary Use Cases

	The DecoderTCR models are designed for the following primary use cases:

	1. TCR-pMHC Binding Prediction: Predict the interaction between T-cell receptors (TCRs) and peptide-MHC complexes
	2. Interaction Scoring: Calculate interface energy scores for TCR-pMHC interactions
	3. Sequence Analysis: Analyze TCR sequences and their interactions with specific peptides
	4. Immunology Research: Support research in adaptive immunity, T-cell recognition, and antigen presentation

	The models are particularly useful for:
	- Identifying potential TCR-peptide binding pairs
	- Screening TCR sequences for specific antigen recognition
	- Understanding the molecular basis of T-cell recognition
	- Supporting vaccine design and immunotherapy development

	### Out-of-Scope or Unauthorized Use Cases

	Do not use the model for the following purposes:
	- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights
	- Any use that is prohibited by the [MIT license](https://github.com/czbiohub-chi/DecoderTCR/blob/main/LICENSE) and [Acceptable Use Policy](https://virtualcellmodels.cziscience.com/acceptable-use-policy).
	- Clinical diagnosis or treatment decisions without proper validation
	- Direct use in patient care without appropriate clinical validation and regulatory approval
	- Use for purposes that could cause harm to individuals or groups

	The models are research tools and should not be used as the sole basis for clinical or diagnostic decisions.

	## Training Data

	The models are trained with multi-component large-scale protein sequence databases. The training data consists of:

	- TCR sequences: Observerd T-cell Space(OTS) for paired $\alpha/\beta$ TCR sequences.
	- Peptide-MHC sequences: MHC Motif Atlas for peptide-MHC ligandomes and high confidence synthetic interactions via MixMHCpred predictions.
	- Paired TCR-pMHC Interactions: VDJdb for paired TCR-pMHC interaction data.


	## Continual Pre-training Strategy

	This model is trained using a continual pre-training curriculum that adapts a pretrained ESM2 backbone to new protein domains while preserving previously learned representations.

	### Overview

	Continual pre-training proceeds in multiple stages, each leveraging different data regimes and masking strategies:

	- Stage 1 emphasize abundant marginal sequence data, encouraging robust component-level representations.
	- Stage 2 incorporate scarcer, structured, or interaction-rich data, refining conditional dependencies without overwriting earlier knowledge.

	The architecture, tokenizer, and objective remain unchanged throughout training; only the data distribution and masking strategy evolve.

	### Stage 1: Component-Level Adaptation

	In the first stage, the model is further pretrained on large collections of unpaired or weakly structured protein sequences relevant to the target domain.

	- Objective: Masked Language Modeling (MLM)
	- Masking: Component- or region-aware masking schedules that upweight functionally relevant positions
	- Purpose:
	- Adapt the pretrained ESM2 representations to the target protein subspace
	- Learn domain-specific sequence statistics while retaining general protein knowledge

	This stage acts as a regularizer, anchoring learning in large-scale marginal data before introducing more complex dependencies.

	### Stage 2: Conditional / Interaction-Aware Refinement

	In subsequent stages, the model is continually trained on structured or paired sequences that encode higher-order dependencies (e.g., interactions between protein regions or components).

	- Objective: Masked Language Modeling (MLM)
	- Masking: Joint masking across interacting regions to encourage cross-context conditioning
	- Purpose:
	- Refine conditional relationships learned from limited paired data
	- Align representations across components without degrading Stage 1 task performance

	## Biases, Risks, and Limitations

	### Potential Biases

	- The model may reflect biases present in the training data, including:
	- Overrepresentation of certain HLA alleles or peptide types
	- Limited diversity in TCR sequences from specific populations
	- Bias toward well-studied antigen systems
	- Certain TCR clonotypes or peptide types may be underrepresented in training data

	### Risks

	Areas of risk may include but are not limited to:
	- Inaccurate predictions: The model may produce incorrect binding predictions, especially for novel sequences or rare HLA-peptide combinations
	- Overconfidence: The model may assign high confidence to predictions that are actually uncertain
	- Biological misinterpretation: Users may misinterpret model outputs as definitive biological facts rather than predictions
	- Clinical misuse: Use in clinical settings without proper validation could lead to incorrect treatment decisions

	### Limitations

	- Sequence length: The model has limitations on maximum sequence length (typically ~1024 tokens)
	- Novel sequences: Performance may degrade on sequences very different from training data
	- HLA diversity: Limited training data for rare HLA alleles may affect prediction accuracy
	- Context dependency: The model may not capture all biological context (e.g., post-translational modifications, cellular environment)
	- Computational requirements: GPU is recommended for optimal performance

	### Caveats and Recommendations

	- Review and validate outputs: Always review and validate model predictions, especially for critical applications
	- Experimental validation: Model predictions should be validated experimentally before use in research or clinical contexts
	- Uncertainty awareness: Be aware that predictions are probabilistic and may have uncertainty
	- Domain expertise: Use the model in conjunction with domain expertise in immunology and T-cell biology
	- Version tracking: Keep track of which model version and checkpoint you are using

	We are committed to advancing the responsible development and use of artificial intelligence. Please follow our [Acceptable Use Policy](/acceptable-use-policy) when engaging with our services.

	Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

	## Acknowledgements

	This model builds upon:
	- ESM2 by Meta AI (Facebook Research) for the base protein language model
	- The broader computational biology and immunology research communities

	Special thanks to the developers and contributors of the ESM models and the open-source tools that made this work possible.