SaNano: Structure-Aware VHH Language Model

SaNano is a protein language model fine-tuned on VHH (nanobody®) sequences for improved representation learning of single-domain antibodies. Built on top of SaProt-650M, this model incorporates both sequence and predicted structure information to generate high-quality embeddings for VHH antibodies.

Model Description

  • Base Model: SaProt_650M_PDB
  • Model Size: 650M parameters
  • Hidden Dimensionality: 1280
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
    • Rank: 32
    • Alpha: 64
    • Dropout: 0.1
    • Target modules: query, key, value, dense
  • Training Objective: Masked Language Modeling (MLM)
    • Masking probability: 15%
    • Masking method: Uniform MLM

Training Data

The model was fine-tuned on a curated dataset of 75% VHH sequences with predicted structures and 25% protein sequences with predicted structures. In training the model sees 50% sequences with structural tokens and 50% sequences with structureal tokens masked.

Training Configuration

  • Steps: 5000
  • Batch Size: 16 per device (with gradient accumulation steps of 4)
  • Learning Rate: 2e-5
  • Scheduler: Cosine with 1000 warmup steps
  • Weight Decay: 0.01
  • Precision: FP16

Usage

Basic Usage

from transformers import EsmForMaskedLM, EsmTokenizer
import torch

def add_structure_masking(sequence):
    return "#".join(sequence) + "#"

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = EsmForMaskedLM.from_pretrained("novonordisk-red/SaNano").to(device)
tokenizer = EsmTokenizer.from_pretrained("novonordisk-red/SaNano")

sequences = [
    "EVQLVESGGGLVQAGGSLRLSCAASGFTFPTYAMAWFRQAPGKGREFV",
    "QVQLQESGGGLVQAGGSLRLSCAASGRTFSSYAMGWFRQAPGKEREFVAAI",
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKG",
    "DVQLVESGGGLVQAGGSLRLSCAASGRTFSTYAMGWFRQAPGKGREFVAGISWS"
]

# To convert to structure-aware format with masking. This is necessary if structural tokens are not computed
structure_aware_sequences = [add_structure_masking(seq) for seq in sequences]
# ["E#V#Q#L#V#E#S#G#G#G#L#V#Q#A#G#G#S#L#R#L#S#C#A#A#S#G#F#T#F#P#T#Y#A#M#A#W#F#R#Q#A#P#G#...", ...]

# Move input to device and ensure hidden states are returned
inputs = tokenizer(structure_aware_sequences, return_tensors="pt", padding=True)
inputs = {key: value.to(device) for key, value in inputs.items()}
inputs['output_hidden_states'] = True

outputs = model(**inputs)

Contact

Developer: Hugo Frelin (ahyf@novonordisk.com)
Team: InSilico Antibody Discovery

Downloads last month
211
Safetensors
Model size
0.7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for novonordisk-red/SaNano

Finetuned
(1)
this model