You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Antibody ESM2 Paired Model

Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

Key Features:

Trained on paired antibody sequences
15% WC followed by 50% CDR fine-tuning
Input format: Heavy-Light chains separated by "-"
Output: 2560-dimensional embeddings
Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

Combined as: HEAVY-LIGHT (with "-" separator)
Tokenized with ESM2 tokenizer
CDR regions annotated for masking

Usage

Loading the Model

from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("MahTala/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("MahTala/AbCDR-ESM2")
model.eval()

Extract Embeddings

# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

Heavy and light chains must be separated by hyphen (-)
Use standard single-letter amino acid codes
No spaces in sequence
Uncommon residues should be replaced with X

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embedding dimension: 2560
Sequence length: Variable (up to ~1024 tokens including special tokens)
Format: PyTorch tensor

Model Card Authors

Mahtab Talaei

Contact

Maintainer: Network Optimization & Control (NOC) Lab
Email: mtalaei@bu.edu
GitHub: https://github.com/Mah-Tala/AbCDR-ESM
Paper: bioRxiv preprint

License

This model is released under the MIT License.

Acknowledgments

Base model: ESM2 by Meta AI
Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Paschalidis-NOC-Lab/AbCDR-ESM2

Base model

facebook/esm2_t36_3B_UR50D

Finetuned

(2)

this model