--- license: mit base_model: - facebook/esm2_t36_3B_UR50D tags: - protein - antibody - biology - CDR - esm2 --- # Antibody ESM2 Paired Model ## Model Description This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains). **Key Features:** - Trained on paired antibody sequences - 15% WC followed by 50% CDR fine-tuning - Input format: Heavy-Light chains separated by "-" - Output: 2560-dimensional embeddings - Optimized for antibody CDR region understanding ### Preprocessing Sequences were: 1. Combined as: HEAVY-LIGHT (with "-" separator) 2. Tokenized with ESM2 tokenizer 3. CDR regions annotated for masking ## Usage ### Loading the Model ```python from transformers import EsmModel, AutoTokenizer import torch # Load model and tokenizer model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2") tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2") model.eval() ``` ### Extract Embeddings ```python # Prepare paired sequence SEP_TOKEN = "-" heavy_chain = ( "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF" "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS" ) light_chain = ( "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS" "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK" ) paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}" # Tokenize inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True) # Extract embeddings with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # Mean pooling mask = inputs["attention_mask"].unsqueeze(-1) pooled = (embeddings * mask).sum(1) / mask.sum(1) print(f"Embedding shape: {pooled.shape}") # (1, 2560) ``` ## Input Format **Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN` - Heavy and light chains must be separated by hyphen (`-`) - Use standard single-letter amino acid codes - No spaces in sequence - Uncommon residues should be replaced with X **Example:** ```python sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..." ``` ## Output - **Embedding dimension:** 2560 - **Sequence length:** Variable (up to ~1024 tokens including special tokens) - **Format:** PyTorch tensor ## Citation If you use this model, please cite: ```bibtex @article{talaei2025preferential, title={Preferential CDR masking in paired antibody language models improves binding affinity prediction}, author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane}, journal={bioRxiv}, year={2025}, doi={10.1101/2025.10.31.685149} } ``` ## Contact - **Maintainer:** Network Optimization & Control (NOC) Lab - **Email:** mtalaei@bu.edu - **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM) - **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149) ## License This model is released under the MIT License. ## Acknowledgments - Base model: ESM2 by Meta AI - Data: OAS database --- **Note:** For private repositories, you'll need to authenticate: ```bash # Option 1: CLI login huggingface-cli login # Option 2: Environment variable export HF_TOKEN="your_token_here"