nanobodybert / README.md
LLMasterLL's picture
Upload README.md with huggingface_hub
365dda5 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- biology
- protein
- antibody
- nanobody
- bert
- masked-lm
pipeline_tag: fill-mask
---
# NanoBodyBERT
NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.
## Model Description
This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.
## Intended Use
This model is designed for:
- Nanobody sequence analysis
- CDR region reconstruction
- Sequence embedding generation
- Antibody design applications
## How to Use
### Installation
First, install the required dependencies:
```bash
pip install transformers torch
```
### Loading the Model
```python
import torch
from transformers import BertForMaskedLM
import sys
import os
# Load custom tokenizer (AATokenizer)
# You need to have the tokenizer.py file in your project
from tokenizer import AATokenizer
# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
```
### Inference Example
```python
# Example nanobody sequence
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"
# Encode sequence
input_ids = tokenizer.encode(sequence, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)
# Get embeddings
with torch.no_grad():
outputs = model.bert(input_ids=input_ids, return_dict=True)
embeddings = outputs.last_hidden_state
cls_embedding = embeddings[0, 0, :] # [CLS] token embedding
print(f"CLS embedding shape: {cls_embedding.shape}")
```
### Masked Prediction Example
```python
# Create a masked sequence (mask CDR3 region for example)
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"
# Tokenize
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
input_ids = tokenizer.encode(tokens, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)
# Predict
with torch.no_grad():
outputs = model(input_ids=input_ids)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
print(f"Predicted: {predicted_sequence}")
```
## Model Architecture
- **Architecture:** BERT (Bidirectional Encoder Representations from Transformers)
- **Vocabulary:** 26 tokens (20 amino acids + special tokens)
- **Max sequence length:** 256
- **Special tokens:**
- `[PAD]`: Padding token
- `[CLS]`: Classification token (sequence start)
- `[SEP]`: Separator token (sequence end)
- `[MASK]`: Mask token for MLM
- `[UNK]`: Unknown token
## Training Data
The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{nanobodybert,
title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
author={Ling Luo},
year={2025},
howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
}
```
## License
Apache 2.0
## Contact
For questions and feedback, please open an issue on the repository.