AbCDR-ESMC / README.md

MahTala

Update README.md

11d8482 verified 15 days ago

preview code

raw

history blame contribute delete

6.77 kB

metadata

license: mit
base_model:
  - esmc_600m
tags:
  - protein
  - antibody
  - esmc
  - biology
  - CDR

AbCDR-ESMC: Antibody ESMC Paired Model

Model Description

This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).

Key Features:

Trained on paired antibody sequences
50% CDR fine-tuning
Input format: Heavy-Light chains separated by "-"
Output: 1152-dimensional embeddings
Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

Combined as: HEAVY-LIGHT (with "-" separator)
Uncommon amino acids replaced with X
Tokenized with ESMC tokenizer
CDR regions annotated for masking

Installation & Requirements

pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4

Usage

Loading the Model

import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open

# Configuration
REPO_ID = "MahTala/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)

# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
    repo_id=REPO_ID,
    filename="model.safetensors",
    token=os.getenv("HF_TOKEN", None)  # For private repos
)

# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
    for key in sf.keys():
        original_state_dict[key] = sf.get_tensor(key)

# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
    new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
    renamed_state_dict[new_key] = value

# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()

Extract Embeddings - Method 1 (High-Level API)

from esm.sdk.api import ESMProtein, LogitsConfig

SEP_TOKEN = "-"

# Example sequences
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)

# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)

# Get embeddings
logits_output = model.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

embeddings = logits_output.embeddings  # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}")  # float32

Extract Embeddings - Method 2 (Low-Level Direct)

# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    outputs = model(sequence_tokens=seq_input_ids)

embeddings_direct = outputs.embeddings  # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings_direct.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}")  # bfloat16

Mean Pooling for Fixed-Size Representation

# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1)  # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")

# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :]  # (1152,)

Batch Processing

# Multiple sequences
sequences = [
    f"{heavy_chain}{SEP_TOKEN}{light_chain}",
    f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]

# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    batch_outputs = model(sequence_tokens=batch_input_ids)

batch_embeddings = batch_outputs.embeddings  # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

Heavy and light chains must be separated by hyphen (-)
Use standard single-letter amino acid codes
No spaces in sequence

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embeddings

Dimension: 1152 (ESMC hidden size)
Sequence length: Variable (up to model's max length)
Format: PyTorch tensor
Dtype:
- High-level API: float32
- Low-level API: bfloat16

Logits

Dimension: 64 (ESMC vocabulary size)
Format: PyTorch tensor
Dtype: bfloat16

Citation

@article{Talaei2025.10.31.685149,
    author = {Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
    title = {CDR-aware masked language models for paired antibodies enable state-of-the-art binding prediction},
    year = {2025},
    doi = {10.1101/2025.10.31.685149},
    eprint = {https://www.biorxiv.org/content/early/2025/10/31/2025.10.31.685149.full.pdf},
    journal = {bioRxiv}
}

@article{hayes2024simulating,
    title={Simulating 500 million years of evolution with a language model},
    author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q and Deaton, Jonathan and Wiggert, Marius and others},
    journal={bioRxiv},
    year={2024}
}

Model Card Authors

Mahtab Talaei

Contact

Maintainer: Network Optimization & Control (NOC) Lab
Email: mtalaei@bu.edu
GitHub: https://github.com/Mah-Tala/AbCDR-ESM
Paper: bioRxiv preprint

License

This model is released under the MIT License.

Acknowledgments

Base model: ESMC (ESM Cambrian) by EvolutionaryScale
Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"