LLMasterLL
/

nanobodybert

Model card Files Files and versions

nanobodybert / README.md

LLMasterLL's picture

Upload README.md with huggingface_hub

365dda5 verified about 1 month ago

|

history blame contribute delete

3.54 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- biology
	- protein
	- antibody
	- nanobody
	- bert
	- masked-lm
	pipeline_tag: fill-mask
	---

	# NanoBodyBERT

	NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.

	## Model Description

	This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.

	## Intended Use

	This model is designed for:
	- Nanobody sequence analysis
	- CDR region reconstruction
	- Sequence embedding generation
	- Antibody design applications

	## How to Use

	### Installation

	First, install the required dependencies:

	```bash
	pip install transformers torch
	```

	### Loading the Model

	```python
	import torch
	from transformers import BertForMaskedLM
	import sys
	import os

	# Load custom tokenizer (AATokenizer)
	# You need to have the tokenizer.py file in your project
	from tokenizer import AATokenizer

	# Load model and tokenizer
	model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
	tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()
	```

	### Inference Example

	```python
	# Example nanobody sequence
	sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"

	# Encode sequence
	input_ids = tokenizer.encode(sequence, add_special_tokens=True)
	input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

	# Get embeddings
	with torch.no_grad():
	outputs = model.bert(input_ids=input_ids, return_dict=True)
	embeddings = outputs.last_hidden_state
	cls_embedding = embeddings[0, 0, :] # [CLS] token embedding

	print(f"CLS embedding shape: {cls_embedding.shape}")
	```

	### Masked Prediction Example

	```python
	# Create a masked sequence (mask CDR3 region for example)
	masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"

	# Tokenize
	tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
	input_ids = tokenizer.encode(tokens, add_special_tokens=True)
	input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)

	# Predict
	with torch.no_grad():
	outputs = model(input_ids=input_ids)
	predictions = torch.argmax(outputs.logits, dim=-1)

	# Decode
	predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
	print(f"Predicted: {predicted_sequence}")
	```

	## Model Architecture

	- Architecture: BERT (Bidirectional Encoder Representations from Transformers)
	- Vocabulary: 26 tokens (20 amino acids + special tokens)
	- Max sequence length: 256
	- Special tokens:
	- `[PAD]`: Padding token
	- `[CLS]`: Classification token (sequence start)
	- `[SEP]`: Separator token (sequence end)
	- `[MASK]`: Mask token for MLM
	- `[UNK]`: Unknown token

	## Training Data

	The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{nanobodybert,
	title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
	author={Ling Luo},
	year={2025},
	howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
	}
	```

	## License

	Apache 2.0

	## Contact

	For questions and feedback, please open an issue on the repository.