NOC-Lab
/

AbCDR-ESM2

Model card Files Files and versions

AbCDR-ESM2 / README.md

noclab-llm's picture

Update README.md

b13ef9a verified about 1 hour ago

|

history blame contribute delete

3.33 kB

	---
	license: mit
	base_model:
	- facebook/esm2_t36_3B_UR50D
	tags:
	- protein
	- antibody
	- biology
	- CDR
	- esm2
	---
	# Antibody ESM2 Paired Model

	## Model Description

	This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

	Key Features:
	- Trained on paired antibody sequences
	- 15% WC followed by 50% CDR fine-tuning
	- Input format: Heavy-Light chains separated by "-"
	- Output: 2560-dimensional embeddings
	- Optimized for antibody CDR region understanding

	### Preprocessing

	Sequences were:
	1. Combined as: HEAVY-LIGHT (with "-" separator)
	2. Tokenized with ESM2 tokenizer
	3. CDR regions annotated for masking

	## Usage

	### Loading the Model

	```python
	from transformers import EsmModel, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2")
	tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2")
	model.eval()
	```

	### Extract Embeddings

	```python
	# Prepare paired sequence
	SEP_TOKEN = "-"
	heavy_chain = (
	"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
	"TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
	)
	light_chain = (
	"DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
	"GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
	)
	paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

	# Tokenize
	inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

	# Extract embeddings
	with torch.no_grad():
	outputs = model(**inputs)
	embeddings = outputs.last_hidden_state

	# Mean pooling
	mask = inputs["attention_mask"].unsqueeze(-1)
	pooled = (embeddings * mask).sum(1) / mask.sum(1)

	print(f"Embedding shape: {pooled.shape}") # (1, 2560)
	```

	## Input Format

	Required Format: `HEAVY_CHAIN-LIGHT_CHAIN`

	- Heavy and light chains must be separated by hyphen (`-`)
	- Use standard single-letter amino acid codes
	- No spaces in sequence
	- Uncommon residues should be replaced with X

	Example:
	```python
	sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
	```

	## Output

	- Embedding dimension: 2560
	- Sequence length: Variable (up to ~1024 tokens including special tokens)
	- Format: PyTorch tensor

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{talaei2025preferential,
	title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
	author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
	journal={bioRxiv},
	year={2025},
	doi={10.1101/2025.10.31.685149}
	}
	```

	## Contact

	- Maintainer: Network Optimization & Control (NOC) Lab
	- Email: mtalaei@bu.edu
	- GitHub: [https://github.com/Mah-Tala/AbCDR-ESM](https://github.com/Mah-Tala/AbCDR-ESM)
	- Paper: [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2025.10.31.685149)

	## License

	This model is released under the MIT License.

	## Acknowledgments

	- Base model: ESM2 by Meta AI
	- Data: OAS database

	---

	Note: For private repositories, you'll need to authenticate:
	```bash
	# Option 1: CLI login
	huggingface-cli login

	# Option 2: Environment variable
	export HF_TOKEN="your_token_here"