AbCDR-ESM2 / README.md

MahTala

Update README.md

a63ac54 verified 17 days ago

preview code

raw

history blame contribute delete

2.89 kB

metadata

license: mit
base_model:
  - facebook/esm2_t36_3B_UR50D
tags:
  - protein
  - antibody
  - esmc
  - biology
  - CDR

Antibody ESM2 Paired Model

Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

Key Features:

Trained on paired antibody sequences
15% WC followed by 50% CDR fine-tuning
Input format: Heavy-Light chains separated by "-"
Output: 2560-dimensional embeddings
Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

Combined as: HEAVY-LIGHT (with "-" separator)
Tokenized with ESM2 tokenizer
CDR regions annotated for masking

Usage

Loading the Model

from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("MahTala/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("MahTala/AbCDR-ESM2")
model.eval()

Extract Embeddings

# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

Heavy and light chains must be separated by hyphen (-)
Use standard single-letter amino acid codes
No spaces in sequence
Uncommon residues should be replaced with X

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embedding dimension: 2560
Sequence length: Variable (up to ~1024 tokens including special tokens)
Format: PyTorch tensor

Model Card Authors

Mahtab Talaei

Contact

Maintainer: Network Optimization & Control (NOC) Lab
Email: mtalaei@bu.edu
GitHub: https://github.com/Mah-Tala/AbCDR-ESM
Paper: bioRxiv preprint

License

This model is released under the MIT License.

Acknowledgments

Base model: ESM2 by Meta AI
Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"