SaNano: Structure-Aware VHH Language Model
SaNano is a protein language model fine-tuned on VHH (nanobody®) sequences for improved representation learning of single-domain antibodies. Built on top of SaProt-650M, this model incorporates both sequence and predicted structure information to generate high-quality embeddings for VHH antibodies.
Model Description
- Base Model: SaProt_650M_PDB
- Model Size: 650M parameters
- Hidden Dimensionality: 1280
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Rank: 32
- Alpha: 64
- Dropout: 0.1
- Target modules: query, key, value, dense
- Training Objective: Masked Language Modeling (MLM)
- Masking probability: 15%
- Masking method: Uniform MLM
Training Data
The model was fine-tuned on a curated dataset of 75% VHH sequences with predicted structures and 25% protein sequences with predicted structures. In training the model sees 50% sequences with structural tokens and 50% sequences with structureal tokens masked.
Training Configuration
- Steps: 5000
- Batch Size: 16 per device (with gradient accumulation steps of 4)
- Learning Rate: 2e-5
- Scheduler: Cosine with 1000 warmup steps
- Weight Decay: 0.01
- Precision: FP16
Usage
Basic Usage
from transformers import EsmForMaskedLM, EsmTokenizer
import torch
def add_structure_masking(sequence):
return "#".join(sequence) + "#"
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = EsmForMaskedLM.from_pretrained("novonordisk-red/SaNano").to(device)
tokenizer = EsmTokenizer.from_pretrained("novonordisk-red/SaNano")
sequences = [
"EVQLVESGGGLVQAGGSLRLSCAASGFTFPTYAMAWFRQAPGKGREFV",
"QVQLQESGGGLVQAGGSLRLSCAASGRTFSSYAMGWFRQAPGKEREFVAAI",
"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKG",
"DVQLVESGGGLVQAGGSLRLSCAASGRTFSTYAMGWFRQAPGKGREFVAGISWS"
]
# To convert to structure-aware format with masking. This is necessary if structural tokens are not computed
structure_aware_sequences = [add_structure_masking(seq) for seq in sequences]
# ["E#V#Q#L#V#E#S#G#G#G#L#V#Q#A#G#G#S#L#R#L#S#C#A#A#S#G#F#T#F#P#T#Y#A#M#A#W#F#R#Q#A#P#G#...", ...]
# Move input to device and ensure hidden states are returned
inputs = tokenizer(structure_aware_sequences, return_tensors="pt", padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs['output_hidden_states'] = True
outputs = model(**inputs)
Contact
Developer: Hugo Frelin (ahyf@novonordisk.com)
Team: InSilico Antibody Discovery
- Downloads last month
- 211
Model tree for novonordisk-red/SaNano
Base model
westlake-repl/SaProt_650M_PDB