---
license: mit
base_model:
- biohub/esmc-600m-2024-12
tags:
- protein
- antibody
- esmc
- biology
- CDR
---

# AbCDR-ESMC: Antibody ESMC Paired Model

## Model Description

This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).

**Key Features:**
- Trained on paired antibody sequences
- 50% CDR fine-tuning
- Input format: Heavy-Light chains separated by "-"
- Output: 1152-dimensional embeddings
- Optimized for antibody CDR region understanding

### Preprocessing

Sequences were:
1. Combined as: HEAVY-LIGHT (with "-" separator)
2. Uncommon amino acids replaced with X
3. Tokenized with ESMC tokenizer
4. CDR regions annotated for masking

## Installation & Requirements
```bash
pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4
```

## Usage

### Loading the Model
```python
import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open

# Configuration
REPO_ID = "NOC-Lab/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)

# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
    repo_id=REPO_ID,
    filename="model.safetensors",
    token=os.getenv("HF_TOKEN", None)  # For private repos
)

# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
    for key in sf.keys():
        original_state_dict[key] = sf.get_tensor(key)

# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
    new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
    renamed_state_dict[new_key] = value

# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()
```

### Extract Embeddings - Method 1 (High-Level API)
```python
from esm.sdk.api import ESMProtein, LogitsConfig

SEP_TOKEN = "-"

# Example sequences
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)

# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)

# Get embeddings
logits_output = model.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

embeddings = logits_output.embeddings  # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}")  # float32
```

### Extract Embeddings - Method 2 (Low-Level Direct)
```python
# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    outputs = model(sequence_tokens=seq_input_ids)

embeddings_direct = outputs.embeddings  # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings_direct.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}")  # bfloat16
```

### Mean Pooling for Fixed-Size Representation
```python
# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1)  # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")

# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :]  # (1152,)
```

### Batch Processing
```python
# Multiple sequences
sequences = [
    f"{heavy_chain}{SEP_TOKEN}{light_chain}",
    f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]

# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    batch_outputs = model(sequence_tokens=batch_input_ids)

batch_embeddings = batch_outputs.embeddings  # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")
```

## Input Format

**Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN`

- Heavy and light chains must be separated by hyphen (`-`)
- Use standard single-letter amino acid codes
- No spaces in sequence

**Example:**
```python
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
```

## Output

### Embeddings
- **Dimension:** 1152 (ESMC hidden size)
- **Sequence length:** Variable (up to model's max length)
- **Format:** PyTorch tensor
- **Dtype:** 
  - High-level API: float32
  - Low-level API: bfloat16

### Logits
- **Dimension:** 64 (ESMC vocabulary size)
- **Format:** PyTorch tensor
- **Dtype:** bfloat16


## Citation

If you use this model, please cite:

```bibtex
@article{talaei2025preferential,
  title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
  author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.10.31.685149}
}

@article{hayes2025simulating,
  title={Simulating 500 million years of evolution with a language model},
  author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J. and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q. and Deaton, Jonathan and Wiggert, Marius and Badkundri, Rohil and Shafkat, Irhum and Gong, Jun and Derry, Alexander and Molina, Raul S. and Thomas, Neil and Khan, Yousuf A. and Mishra, Chetan and Kim, Carolyn and Bartie, Liam J. and Nemeth, Matthew and Hsu, Patrick D. and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
  journal={Science},
  volume={387},
  number={6736},
  pages={850--858},
  year={2025},
  doi={10.1126/science.ads0018}
}

@misc{esm2024cambrian,
  author={{ESM Team}},
  title={ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning},
  year={2024},
  publisher={EvolutionaryScale},
  url={https://evolutionaryscale.ai/blog/esm-cambrian}
}
```

## Contact

- **Maintainer:** Network Optimization & Control (NOC) Lab
- **Email:** mtalaei@bu.edu
- **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM)
- **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149)

## License

This model is released under the MIT License.

## Acknowledgments

- Base model: ESMC (ESM Cambrian) by EvolutionaryScale
- Data: OAS database

---

**Note:** For private repositories, you'll need to authenticate:
```bash
# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"
```