SaNano: Structure-Aware VHH Language Model

SaNano is a protein language model fine-tuned on VHH (NANOBODY®) sequences for improved representation learning of single-domain antibodies. Built on top of SaProt-650M, this model incorporates both sequence and predicted structure information to generate high-quality embeddings for VHH antibodies.

Model Description

Base Model: SaProt_650M_PDB
Model Size: 650M parameters
Hidden Dimensionality: 1280
Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Rank: 32
- Alpha: 64
- Dropout: 0.1
- Target modules: query, key, value, dense
Training Objective: Masked Language Modeling (MLM)
- Masking probability: 15%
- Masking method: Uniform MLM

Training Data

The model was fine-tuned on a curated dataset of 75% VHH sequences with predicted structures and 25% protein sequences with predicted structures. In training the model sees 50% sequences with structural tokens and 50% sequences with structureal tokens masked.

Training Configuration

Steps: 5000
Batch Size: 16 per device (with gradient accumulation steps of 4)
Learning Rate: 2e-5
Scheduler: Cosine with 1000 warmup steps
Weight Decay: 0.01
Precision: FP16

Usage

Basic Usage

from transformers import EsmForMaskedLM, EsmTokenizer
import torch

def add_structure_masking(sequence):
    return "#".join(sequence) + "#"

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = EsmForMaskedLM.from_pretrained("novonordisk-red/SaNano").to(device)
tokenizer = EsmTokenizer.from_pretrained("novonordisk-red/SaNano")

sequences = [
    "EVQLVESGGGLVQAGGSLRLSCAASGFTFPTYAMAWFRQAPGKGREFV",
    "QVQLQESGGGLVQAGGSLRLSCAASGRTFSSYAMGWFRQAPGKEREFVAAI",
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKG",
    "DVQLVESGGGLVQAGGSLRLSCAASGRTFSTYAMGWFRQAPGKGREFVAGISWS"
]

# To convert to structure-aware format with masking. This is necessary if structural tokens are not computed
structure_aware_sequences = [add_structure_masking(seq) for seq in sequences]
# ["E#V#Q#L#V#E#S#G#G#G#L#V#Q#A#G#G#S#L#R#L#S#C#A#A#S#G#F#T#F#P#T#Y#A#M#A#W#F#R#Q#A#P#G#...", ...]

# Move input to device and ensure hidden states are returned
inputs = tokenizer(structure_aware_sequences, return_tensors="pt", padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs['output_hidden_states'] = True

outputs = model(**inputs)

Contact

Developer: Hugo Frelin (ahyf@novonordisk.com)
Team: InSilico Antibody Discovery

Downloads last month: 1,448

Safetensors

Model size

0.7B params

Tensor type

F32

Model tree for novonordisk-red/SaNano

Base model

westlake-repl/SaProt_650M_PDB

Finetuned

(1)

this model