You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AbCDR-ESMC: Antibody ESMC Paired Model

Model Description

This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).

Key Features:

  • Trained on paired antibody sequences
  • 50% CDR fine-tuning
  • Input format: Heavy-Light chains separated by "-"
  • Output: 1152-dimensional embeddings
  • Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

  1. Combined as: HEAVY-LIGHT (with "-" separator)
  2. Uncommon amino acids replaced with X
  3. Tokenized with ESMC tokenizer
  4. CDR regions annotated for masking

Installation & Requirements

pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4

Usage

Loading the Model

import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open

# Configuration
REPO_ID = "MahTala/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)

# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
    repo_id=REPO_ID,
    filename="model.safetensors",
    token=os.getenv("HF_TOKEN", None)  # For private repos
)

# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
    for key in sf.keys():
        original_state_dict[key] = sf.get_tensor(key)

# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
    new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
    renamed_state_dict[new_key] = value

# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()

Extract Embeddings - Method 1 (High-Level API)

from esm.sdk.api import ESMProtein, LogitsConfig

SEP_TOKEN = "-"

# Example sequences
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)

# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)

# Get embeddings
logits_output = model.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

embeddings = logits_output.embeddings  # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}")  # float32

Extract Embeddings - Method 2 (Low-Level Direct)

# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    outputs = model(sequence_tokens=seq_input_ids)

embeddings_direct = outputs.embeddings  # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings_direct.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}")  # bfloat16

Mean Pooling for Fixed-Size Representation

# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1)  # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")

# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :]  # (1152,)

Batch Processing

# Multiple sequences
sequences = [
    f"{heavy_chain}{SEP_TOKEN}{light_chain}",
    f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]

# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    batch_outputs = model(sequence_tokens=batch_input_ids)

batch_embeddings = batch_outputs.embeddings  # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

  • Heavy and light chains must be separated by hyphen (-)
  • Use standard single-letter amino acid codes
  • No spaces in sequence

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embeddings

  • Dimension: 1152 (ESMC hidden size)
  • Sequence length: Variable (up to model's max length)
  • Format: PyTorch tensor
  • Dtype:
    • High-level API: float32
    • Low-level API: bfloat16

Logits

  • Dimension: 64 (ESMC vocabulary size)
  • Format: PyTorch tensor
  • Dtype: bfloat16

Citation

@article{Talaei2025.10.31.685149,
    author = {Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
    title = {CDR-aware masked language models for paired antibodies enable state-of-the-art binding prediction},
    year = {2025},
    doi = {10.1101/2025.10.31.685149},
    eprint = {https://www.biorxiv.org/content/early/2025/10/31/2025.10.31.685149.full.pdf},
    journal = {bioRxiv}
}

@article{hayes2024simulating,
    title={Simulating 500 million years of evolution with a language model},
    author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q and Deaton, Jonathan and Wiggert, Marius and others},
    journal={bioRxiv},
    year={2024}
}

Model Card Authors

Mahtab Talaei

Contact

License

This model is released under the MIT License.

Acknowledgments

  • Base model: ESMC (ESM Cambrian) by EvolutionaryScale
  • Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.6B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support