--- license: mit base_model: - biohub/esmc-600m-2024-12 tags: - protein - antibody - esmc - biology - CDR --- # AbCDR-ESMC: Antibody ESMC Paired Model ## Model Description This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains). **Key Features:** - Trained on paired antibody sequences - 50% CDR fine-tuning - Input format: Heavy-Light chains separated by "-" - Output: 1152-dimensional embeddings - Optimized for antibody CDR region understanding ### Preprocessing Sequences were: 1. Combined as: HEAVY-LIGHT (with "-" separator) 2. Uncommon amino acids replaced with X 3. Tokenized with ESMC tokenizer 4. CDR regions annotated for masking ## Installation & Requirements ```bash pip install torch pip install safetensors pip install huggingface_hub pip install esm==3.1.4 ``` ## Usage ### Loading the Model ```python import os import torch from huggingface_hub import hf_hub_download from esm.tokenization import get_esmc_model_tokenizers from esm.models.esmc import ESMC from safetensors import safe_open # Configuration REPO_ID = "NOC-Lab/AbCDR-ESMC" device = "cuda" if torch.cuda.is_available() else "cpu" # Load tokenizer and base model tokenizer = get_esmc_model_tokenizers() model = ESMC.from_pretrained("esmc_600m").to(device) # Download fine-tuned weights local_ckpt_path = hf_hub_download( repo_id=REPO_ID, filename="model.safetensors", token=os.getenv("HF_TOKEN", None) # For private repos ) # Load and rename state dict original_state_dict = {} with safe_open(local_ckpt_path, framework="pt") as sf: for key in sf.keys(): original_state_dict[key] = sf.get_tensor(key) # Remove "esmC_model." prefix renamed_state_dict = {} for key, value in original_state_dict.items(): new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key renamed_state_dict[new_key] = value # Load weights model.load_state_dict(renamed_state_dict, strict=False) model.eval() ``` ### Extract Embeddings - Method 1 (High-Level API) ```python from esm.sdk.api import ESMProtein, LogitsConfig SEP_TOKEN = "-" # Example sequences heavy_chain = ( "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF" "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS" ) light_chain = ( "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS" "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK" ) # Combine with separator paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}" # Create protein object and encode protein = ESMProtein(sequence=paired_sequence) protein_tensor = model.encode(protein) # Get embeddings logits_output = model.logits( protein_tensor, LogitsConfig(sequence=True, return_embeddings=True) ) embeddings = logits_output.embeddings # Shape: (1, seq_len, 1152) logits = logits_output.logits.sequence # Shape: (1, seq_len, 64) print(f"Embeddings shape: {embeddings.shape}") # (1, L, 1152) print(f"Embeddings dtype: {embeddings.dtype}") # float32 ``` ### Extract Embeddings - Method 2 (Low-Level Direct) ```python # Tokenize sequence seq_encoded = tokenizer(paired_sequence, return_tensors="pt") seq_input_ids = seq_encoded["input_ids"].to(device) # Forward pass with torch.no_grad(): outputs = model(sequence_tokens=seq_input_ids) embeddings_direct = outputs.embeddings # Shape: (1, seq_len, 1152) logits_direct = outputs.sequence_logits # Shape: (1, seq_len, 64) print(f"Embeddings shape: {embeddings_direct.shape}") # (1, L, 1152) print(f"Embeddings dtype: {embeddings_direct.dtype}") # bfloat16 ``` ### Mean Pooling for Fixed-Size Representation ```python # Mean pooling over sequence length sequence_representation = embeddings_direct.mean(dim=1) # (1, 1152) print(f"Pooled embedding shape: {sequence_representation.shape}") # Get interface embedding (at separator position) separator_pos = len(heavy_chain) interface_embedding = embeddings_direct[0, separator_pos, :] # (1152,) ``` ### Batch Processing ```python # Multiple sequences sequences = [ f"{heavy_chain}{SEP_TOKEN}{light_chain}", f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}", ] # Tokenize with padding batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True) batch_input_ids = batch_encoded["input_ids"].to(device) # Forward pass with torch.no_grad(): batch_outputs = model(sequence_tokens=batch_input_ids) batch_embeddings = batch_outputs.embeddings # (batch_size, max_seq_len, 1152) print(f"Batch embeddings shape: {batch_embeddings.shape}") ``` ## Input Format **Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN` - Heavy and light chains must be separated by hyphen (`-`) - Use standard single-letter amino acid codes - No spaces in sequence **Example:** ```python sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..." ``` ## Output ### Embeddings - **Dimension:** 1152 (ESMC hidden size) - **Sequence length:** Variable (up to model's max length) - **Format:** PyTorch tensor - **Dtype:** - High-level API: float32 - Low-level API: bfloat16 ### Logits - **Dimension:** 64 (ESMC vocabulary size) - **Format:** PyTorch tensor - **Dtype:** bfloat16 ## Citation If you use this model, please cite: ```bibtex @article{talaei2025preferential, title={Preferential CDR masking in paired antibody language models improves binding affinity prediction}, author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane}, journal={bioRxiv}, year={2025}, doi={10.1101/2025.10.31.685149} } @article{hayes2025simulating, title={Simulating 500 million years of evolution with a language model}, author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J. and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q. and Deaton, Jonathan and Wiggert, Marius and Badkundri, Rohil and Shafkat, Irhum and Gong, Jun and Derry, Alexander and Molina, Raul S. and Thomas, Neil and Khan, Yousuf A. and Mishra, Chetan and Kim, Carolyn and Bartie, Liam J. and Nemeth, Matthew and Hsu, Patrick D. and Sercu, Tom and Candido, Salvatore and Rives, Alexander}, journal={Science}, volume={387}, number={6736}, pages={850--858}, year={2025}, doi={10.1126/science.ads0018} } @misc{esm2024cambrian, author={{ESM Team}}, title={ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning}, year={2024}, publisher={EvolutionaryScale}, url={https://evolutionaryscale.ai/blog/esm-cambrian} } ``` ## Contact - **Maintainer:** Network Optimization & Control (NOC) Lab - **Email:** mtalaei@bu.edu - **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM) - **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149) ## License This model is released under the MIT License. ## Acknowledgments - Base model: ESMC (ESM Cambrian) by EvolutionaryScale - Data: OAS database --- **Note:** For private repositories, you'll need to authenticate: ```bash # Option 1: CLI login huggingface-cli login # Option 2: Environment variable export HF_TOKEN="your_token_here" ```