---
license: mit
base_model:
- facebook/esm2_t36_3B_UR50D
tags:
- protein
- antibody
- biology
- CDR
- esm2
---
# Antibody ESM2 Paired Model

## Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

**Key Features:**
- Trained on paired antibody sequences
- 15% WC followed by 50% CDR fine-tuning
- Input format: Heavy-Light chains separated by "-"
- Output: 2560-dimensional embeddings
- Optimized for antibody CDR region understanding

### Preprocessing

Sequences were:
1. Combined as: HEAVY-LIGHT (with "-" separator)
2. Tokenized with ESM2 tokenizer
3. CDR regions annotated for masking

## Usage

### Loading the Model

```python
from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2")
model.eval()
```

### Extract Embeddings

```python
# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)
```

## Input Format

**Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN`

- Heavy and light chains must be separated by hyphen (`-`)
- Use standard single-letter amino acid codes
- No spaces in sequence
- Uncommon residues should be replaced with X

**Example:**
```python
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
```

## Output

- **Embedding dimension:** 2560
- **Sequence length:** Variable (up to ~1024 tokens including special tokens)
- **Format:** PyTorch tensor

## Citation

If you use this model, please cite:

```bibtex
@article{talaei2025preferential,
  title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
  author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.10.31.685149}
}
```

## Contact

- **Maintainer:** Network Optimization & Control (NOC) Lab
- **Email:** mtalaei@bu.edu
- **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM)
- **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149)

## License

This model is released under the MIT License.

## Acknowledgments

- Base model: ESM2 by Meta AI
- Data: OAS database

---

**Note:** For private repositories, you'll need to authenticate:
```bash
# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"