AbCDR-ESM2 / README.md
MahTala's picture
Update README.md
a63ac54 verified
metadata
license: mit
base_model:
  - facebook/esm2_t36_3B_UR50D
tags:
  - protein
  - antibody
  - esmc
  - biology
  - CDR

Antibody ESM2 Paired Model

Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

Key Features:

  • Trained on paired antibody sequences
  • 15% WC followed by 50% CDR fine-tuning
  • Input format: Heavy-Light chains separated by "-"
  • Output: 2560-dimensional embeddings
  • Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

  1. Combined as: HEAVY-LIGHT (with "-" separator)
  2. Tokenized with ESM2 tokenizer
  3. CDR regions annotated for masking

Usage

Loading the Model

from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("MahTala/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("MahTala/AbCDR-ESM2")
model.eval()

Extract Embeddings

# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

  • Heavy and light chains must be separated by hyphen (-)
  • Use standard single-letter amino acid codes
  • No spaces in sequence
  • Uncommon residues should be replaced with X

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

  • Embedding dimension: 2560
  • Sequence length: Variable (up to ~1024 tokens including special tokens)
  • Format: PyTorch tensor

Model Card Authors

Mahtab Talaei

Contact

License

This model is released under the MIT License.

Acknowledgments

  • Base model: ESM2 by Meta AI
  • Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"