|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- facebook/esm2_t36_3B_UR50D |
|
|
tags: |
|
|
- protein |
|
|
- antibody |
|
|
- esmc |
|
|
- biology |
|
|
- CDR |
|
|
--- |
|
|
# Antibody ESM2 Paired Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains). |
|
|
|
|
|
**Key Features:** |
|
|
- Trained on paired antibody sequences |
|
|
- 15% WC followed by 50% CDR fine-tuning |
|
|
- Input format: Heavy-Light chains separated by "-" |
|
|
- Output: 2560-dimensional embeddings |
|
|
- Optimized for antibody CDR region understanding |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
Sequences were: |
|
|
1. Combined as: HEAVY-LIGHT (with "-" separator) |
|
|
2. Tokenized with ESM2 tokenizer |
|
|
3. CDR regions annotated for masking |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from transformers import EsmModel, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = EsmModel.from_pretrained("MahTala/AbCDR-ESM2") |
|
|
tokenizer = AutoTokenizer.from_pretrained("MahTala/AbCDR-ESM2") |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
### Extract Embeddings |
|
|
|
|
|
```python |
|
|
# Prepare paired sequence |
|
|
SEP_TOKEN = "-" |
|
|
heavy_chain = ( |
|
|
"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF" |
|
|
"TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS" |
|
|
) |
|
|
light_chain = ( |
|
|
"DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS" |
|
|
"GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK" |
|
|
) |
|
|
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}" |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True) |
|
|
|
|
|
# Extract embeddings |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state |
|
|
|
|
|
# Mean pooling |
|
|
mask = inputs["attention_mask"].unsqueeze(-1) |
|
|
pooled = (embeddings * mask).sum(1) / mask.sum(1) |
|
|
|
|
|
print(f"Embedding shape: {pooled.shape}") # (1, 2560) |
|
|
``` |
|
|
|
|
|
## Input Format |
|
|
|
|
|
**Required Format:** `HEAVY_CHAIN-LIGHT_CHAIN` |
|
|
|
|
|
- Heavy and light chains must be separated by hyphen (`-`) |
|
|
- Use standard single-letter amino acid codes |
|
|
- No spaces in sequence |
|
|
- Uncommon residues should be replaced with X |
|
|
|
|
|
**Example:** |
|
|
```python |
|
|
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..." |
|
|
``` |
|
|
|
|
|
## Output |
|
|
|
|
|
- **Embedding dimension:** 2560 |
|
|
- **Sequence length:** Variable (up to ~1024 tokens including special tokens) |
|
|
- **Format:** PyTorch tensor |
|
|
|
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Mahtab Talaei |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Maintainer:** Network Optimization & Control (NOC) Lab |
|
|
- **Email:** mtalaei@bu.edu |
|
|
- **GitHub:** [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM) |
|
|
- **Paper:** [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149v1) |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: ESM2 by Meta AI |
|
|
- Data: OAS database |
|
|
|
|
|
--- |
|
|
|
|
|
**Note:** For private repositories, you'll need to authenticate: |
|
|
```bash |
|
|
# Option 1: CLI login |
|
|
huggingface-cli login |
|
|
|
|
|
# Option 2: Environment variable |
|
|
export HF_TOKEN="your_token_here" |
|
|
|