AbCDR-ESM2 / README.md
MahTala's picture
Update README.md
a63ac54 verified
---
license: mit
base_model:
- facebook/esm2_t36_3B_UR50D
tags:
- protein
- antibody
- esmc
- biology
- CDR
---
# Antibody ESM2 Paired Model
## Model Description
This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).
**Key Features:**
- Trained on paired antibody sequences
- 15% WC followed by 50% CDR fine-tuning
- Input format: Heavy-Light chains separated by "-"
- Output: 2560-dimensional embeddings
- Optimized for antibody CDR region understanding
### Preprocessing
Sequences were:
1. Combined as: HEAVY-LIGHT (with "-" separator)
2. Tokenized with ESM2 tokenizer
3. CDR regions annotated for masking
## Usage
### Loading the Model
```python
from transformers import EsmModel, AutoTokenizer
import torch
# Load model and tokenizer
model = EsmModel.from_pretrained("MahTala/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("MahTala/AbCDR-ESM2")
model.eval()
```
### Extract Embeddings
```python
# Prepare paired sequence
SEP_TOKEN = "-"
heavy_chain = (
"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
"TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
"DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
"GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"
# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)
# Extract embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)
print(f"Embedding shape: {pooled.shape}") # (1, 2560)
```
## Input Format
**Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN`
- Heavy and light chains must be separated by hyphen (`-`)
- Use standard single-letter amino acid codes
- No spaces in sequence
- Uncommon residues should be replaced with X
**Example:**
```python
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
```
## Output
- **Embedding dimension:** 2560
- **Sequence length:** Variable (up to ~1024 tokens including special tokens)
- **Format:** PyTorch tensor
## Model Card Authors
Mahtab Talaei
## Contact
- **Maintainer:** Network Optimization & Control (NOC) Lab
- **Email:** mtalaei@bu.edu
- **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM)
- **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149v1)
## License
This model is released under the MIT License.
## Acknowledgments
- Base model: ESM2 by Meta AI
- Data: OAS database
---
**Note:** For private repositories, you'll need to authenticate:
```bash
# Option 1: CLI login
huggingface-cli login
# Option 2: Environment variable
export HF_TOKEN="your_token_here"