AbCDR-ESMC: Antibody ESMC Paired Model
Model Description
This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).
Key Features:
- Trained on paired antibody sequences
- 50% CDR fine-tuning
- Input format: Heavy-Light chains separated by "-"
- Output: 1152-dimensional embeddings
- Optimized for antibody CDR region understanding
Preprocessing
Sequences were:
- Combined as: HEAVY-LIGHT (with "-" separator)
- Uncommon amino acids replaced with X
- Tokenized with ESMC tokenizer
- CDR regions annotated for masking
Installation & Requirements
pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4
Usage
Loading the Model
import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open
# Configuration
REPO_ID = "MahTala/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)
# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
repo_id=REPO_ID,
filename="model.safetensors",
token=os.getenv("HF_TOKEN", None) # For private repos
)
# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
for key in sf.keys():
original_state_dict[key] = sf.get_tensor(key)
# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
renamed_state_dict[new_key] = value
# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()
Extract Embeddings - Method 1 (High-Level API)
from esm.sdk.api import ESMProtein, LogitsConfig
SEP_TOKEN = "-"
# Example sequences
heavy_chain = (
"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
"TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
"DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
"GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"
# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)
# Get embeddings
logits_output = model.logits(
protein_tensor,
LogitsConfig(sequence=True, return_embeddings=True)
)
embeddings = logits_output.embeddings # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence # Shape: (1, seq_len, 64)
print(f"Embeddings shape: {embeddings.shape}") # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}") # float32
Extract Embeddings - Method 2 (Low-Level Direct)
# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)
# Forward pass
with torch.no_grad():
outputs = model(sequence_tokens=seq_input_ids)
embeddings_direct = outputs.embeddings # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits # Shape: (1, seq_len, 64)
print(f"Embeddings shape: {embeddings_direct.shape}") # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}") # bfloat16
Mean Pooling for Fixed-Size Representation
# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1) # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")
# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :] # (1152,)
Batch Processing
# Multiple sequences
sequences = [
f"{heavy_chain}{SEP_TOKEN}{light_chain}",
f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]
# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)
# Forward pass
with torch.no_grad():
batch_outputs = model(sequence_tokens=batch_input_ids)
batch_embeddings = batch_outputs.embeddings # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")
Input Format
Required Format: HEAVY_CHAIN-LIGHT_CHAIN
- Heavy and light chains must be separated by hyphen (
-) - Use standard single-letter amino acid codes
- No spaces in sequence
Example:
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
Output
Embeddings
- Dimension: 1152 (ESMC hidden size)
- Sequence length: Variable (up to model's max length)
- Format: PyTorch tensor
- Dtype:
- High-level API: float32
- Low-level API: bfloat16
Logits
- Dimension: 64 (ESMC vocabulary size)
- Format: PyTorch tensor
- Dtype: bfloat16
Citation
@article{Talaei2025.10.31.685149,
author = {Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
title = {CDR-aware masked language models for paired antibodies enable state-of-the-art binding prediction},
year = {2025},
doi = {10.1101/2025.10.31.685149},
eprint = {https://www.biorxiv.org/content/early/2025/10/31/2025.10.31.685149.full.pdf},
journal = {bioRxiv}
}
@article{hayes2024simulating,
title={Simulating 500 million years of evolution with a language model},
author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q and Deaton, Jonathan and Wiggert, Marius and others},
journal={bioRxiv},
year={2024}
}
Model Card Authors
Mahtab Talaei
Contact
- Maintainer: Network Optimization & Control (NOC) Lab
- Email: mtalaei@bu.edu
- GitHub: https://github.com/Mah-Tala/AbCDR-ESM
- Paper: bioRxiv preprint
License
This model is released under the MIT License.
Acknowledgments
- Base model: ESMC (ESM Cambrian) by EvolutionaryScale
- Data: OAS database
Note: For private repositories, you'll need to authenticate:
# Option 1: CLI login
huggingface-cli login
# Option 2: Environment variable
export HF_TOKEN="your_token_here"
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support