|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- biology |
|
|
- protein |
|
|
- antibody |
|
|
- nanobody |
|
|
- bert |
|
|
- masked-lm |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# NanoBodyBERT |
|
|
|
|
|
NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Nanobody sequence analysis |
|
|
- CDR region reconstruction |
|
|
- Sequence embedding generation |
|
|
- Antibody design applications |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
First, install the required dependencies: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import BertForMaskedLM |
|
|
import sys |
|
|
import os |
|
|
|
|
|
# Load custom tokenizer (AATokenizer) |
|
|
# You need to have the tokenizer.py file in your project |
|
|
from tokenizer import AATokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert") |
|
|
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert") |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
|
|
|
```python |
|
|
# Example nanobody sequence |
|
|
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS" |
|
|
|
|
|
# Encode sequence |
|
|
input_ids = tokenizer.encode(sequence, add_special_tokens=True) |
|
|
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device) |
|
|
|
|
|
# Get embeddings |
|
|
with torch.no_grad(): |
|
|
outputs = model.bert(input_ids=input_ids, return_dict=True) |
|
|
embeddings = outputs.last_hidden_state |
|
|
cls_embedding = embeddings[0, 0, :] # [CLS] token embedding |
|
|
|
|
|
print(f"CLS embedding shape: {cls_embedding.shape}") |
|
|
``` |
|
|
|
|
|
### Masked Prediction Example |
|
|
|
|
|
```python |
|
|
# Create a masked sequence (mask CDR3 region for example) |
|
|
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS" |
|
|
|
|
|
# Tokenize |
|
|
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token) |
|
|
input_ids = tokenizer.encode(tokens, add_special_tokens=True) |
|
|
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(input_ids=input_ids) |
|
|
predictions = torch.argmax(outputs.logits, dim=-1) |
|
|
|
|
|
# Decode |
|
|
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True) |
|
|
print(f"Predicted: {predicted_sequence}") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Architecture:** BERT (Bidirectional Encoder Representations from Transformers) |
|
|
- **Vocabulary:** 26 tokens (20 amino acids + special tokens) |
|
|
- **Max sequence length:** 256 |
|
|
- **Special tokens:** |
|
|
- `[PAD]`: Padding token |
|
|
- `[CLS]`: Classification token (sequence start) |
|
|
- `[SEP]`: Separator token (sequence end) |
|
|
- `[MASK]`: Mask token for MLM |
|
|
- `[UNK]`: Unknown token |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nanobodybert, |
|
|
title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences}, |
|
|
author={Ling Luo}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions and feedback, please open an issue on the repository. |
|
|
|
|
|
|