| --- |
| license: mit |
| base_model: |
| - facebook/esm2_t36_3B_UR50D |
| tags: |
| - protein |
| - antibody |
| - biology |
| - CDR |
| - esm2 |
| --- |
| # Antibody ESM2 Paired Model |
|
|
| ## Model Description |
|
|
| This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains). |
|
|
| **Key Features:** |
| - Trained on paired antibody sequences |
| - 15% WC followed by 50% CDR fine-tuning |
| - Input format: Heavy-Light chains separated by "-" |
| - Output: 2560-dimensional embeddings |
| - Optimized for antibody CDR region understanding |
|
|
| ### Preprocessing |
|
|
| Sequences were: |
| 1. Combined as: HEAVY-LIGHT (with "-" separator) |
| 2. Tokenized with ESM2 tokenizer |
| 3. CDR regions annotated for masking |
|
|
| ## Usage |
|
|
| ### Loading the Model |
|
|
| ```python |
| from transformers import EsmModel, AutoTokenizer |
| import torch |
| |
| # Load model and tokenizer |
| model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2") |
| tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2") |
| model.eval() |
| ``` |
|
|
| ### Extract Embeddings |
|
|
| ```python |
| # Prepare paired sequence |
| SEP_TOKEN = "-" |
| heavy_chain = ( |
| "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF" |
| "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS" |
| ) |
| light_chain = ( |
| "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS" |
| "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK" |
| ) |
| paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}" |
| |
| # Tokenize |
| inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True) |
| |
| # Extract embeddings |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| embeddings = outputs.last_hidden_state |
| |
| # Mean pooling |
| mask = inputs["attention_mask"].unsqueeze(-1) |
| pooled = (embeddings * mask).sum(1) / mask.sum(1) |
| |
| print(f"Embedding shape: {pooled.shape}") # (1, 2560) |
| ``` |
|
|
| ## Input Format |
|
|
| **Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN` |
|
|
| - Heavy and light chains must be separated by hyphen (`-`) |
| - Use standard single-letter amino acid codes |
| - No spaces in sequence |
| - Uncommon residues should be replaced with X |
|
|
| **Example:** |
| ```python |
| sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..." |
| ``` |
|
|
| ## Output |
|
|
| - **Embedding dimension:** 2560 |
| - **Sequence length:** Variable (up to ~1024 tokens including special tokens) |
| - **Format:** PyTorch tensor |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @article{talaei2025preferential, |
| title={Preferential CDR masking in paired antibody language models improves binding affinity prediction}, |
| author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane}, |
| journal={bioRxiv}, |
| year={2025}, |
| doi={10.1101/2025.10.31.685149} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| - **Maintainer:** Network Optimization & Control (NOC) Lab |
| - **Email:** mtalaei@bu.edu |
| - **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM) |
| - **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149) |
|
|
| ## License |
|
|
| This model is released under the MIT License. |
|
|
| ## Acknowledgments |
|
|
| - Base model: ESM2 by Meta AI |
| - Data: OAS database |
|
|
| --- |
|
|
| **Note:** For private repositories, you'll need to authenticate: |
| ```bash |
| # Option 1: CLI login |
| huggingface-cli login |
| |
| # Option 2: Environment variable |
| export HF_TOKEN="your_token_here" |