SynthGuard Protein V4 β€” ESM-2 + k-mer Biosecurity Screening Model

Part of SynthGuard, a dual-track biosecurity screening system built for AIxBio Hackathon 2026 (Track 1: DNA Screening & Synthesis Controls).

Benchmark date: April 26, 2026 β€” all numbers from real NCBI BLAST+ 2.12.0+ on Google Colab A100.


What it does

Screens protein (amino acid) sequences for biosecurity hazards using a hybrid feature approach: amino acid composition + dipeptide frequencies + ESM-2 structural embeddings. Detects ProteinMPNN-redesigned toxin analogs that share <30% sequence identity with known hazardous proteins and are therefore invisible to BLAST-based protein screening.

Key finding: Real NCBI blastp (50% amino acid identity) achieves 0% recall on the protein sequences in our corpus β€” they are all below the identity threshold by design. SynthGuard Protein V4 achieves 86.0% recall at AUROC 0.944 on the same sequences.


Architecture

Not a fine-tuned ESM-2 classifier. V4 uses ESM-2 as a frozen feature extractor:

  1. facebook/esm2_t12_35M_UR50D (35M parameters) β€” mean-pool of last hidden states (excluding CLS/EOS tokens) β†’ 480-dim embedding
  2. 426 k-mer features β€” AA composition (20) + dipeptide frequencies (400) + physicochemical descriptors (6)
  3. 906-dim concatenation β†’ LightGBM classifier (sigmoid-calibrated)
# ESM-2 embedding extraction
inputs = tokenizer(aa_seq, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    hidden = model(**inputs).last_hidden_state[0, 1:-1, :]  # exclude CLS/EOS
embedding = hidden.mean(0).cpu().numpy()  # 480-dim

LightGBM (v4): LGBMClassifier(n_estimators=600, max_depth=8, num_leaves=63, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, scale_pos_weight=0.9398, random_state=42) Calibrated with CalibratedClassifierCV(sigmoid, cv='prefit')


Performance β€” verified April 26, 2026

vs. real NCBI blastp 2.12.0+ (50% amino acid identity threshold), 1,286 translatable sequences:

Model Recall FPR AUROC
blastp (50% identity) 0.0% 0.0% 0.500
SynthGuard V4 (this model) 86.0% 13.2% 0.944
SynthGuard V2 (k-mer only, 426-feat) 84.4% β€” 0.937

ProteinMPNN ablation β€” 255 structural redesigns, 5 toxin families (standalone benchmark):

Toxin (PDB) blastp recall V2 (k-mer) V3 (+ESM-2) V4 (+1BC7 data)
Abrin (1ABR) 6% 48% 100% 100%
Ricin (1IFS) 6% 55% 100% 100%
Anthrax LF (1J7N) 6% 51% 100% 100%
BoNT (3BTA) †† 6% 53% 96% 100%
Diphtheria (1BC7) †† ~0% 0% 0% 100%
All 255 variants ~6% 52.9% 79.2% 100% (AUROC 1.000)

†† = BoNT and Diphtheria were never included in any training version. Their improvement from V2β†’V3β†’V4 demonstrates that ESM-2 embeddings transfer structural topology across unseen toxin fold classes.

The Diphtheria gap closed by targeted data augmentation:

  • V1–V3: 0% recall (ADP-ribosyltransferase fold absent from training)
  • V4: 100% recall after adding 50 ProteinMPNN variants of 1BC7 to training β€” no architectural change

Why ESM-2 + k-mer outperforms k-mer alone

k-mer features (AA composition, dipeptides) capture sequence composition but not structural information. ESM-2 embeddings encode the structural topology of the protein β€” two proteins with <10% sequence identity but the same fold will have similar ESM-2 representations. This explains the BoNT trajectory:

  • V2 (k-mer only): 53% recall on BoNT variants β€” partial compositional overlap
  • V3 (+ ESM-2): 96% recall β€” structural embedding recognizes the clostridial neurotoxin fold topology even with no BoNT training data

Usage

import pickle
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download

# Download V4 model
v4_path = hf_hub_download("Seyomi/synthguard-kmer", "protein_kmer_v4_esm2.pkl")
with open(v4_path, "rb") as f:
    protein_v4 = pickle.load(f)

# Load ESM-2 feature extractor (frozen)
ESM2_ID = "facebook/esm2_t12_35M_UR50D"
tokenizer = AutoTokenizer.from_pretrained(ESM2_ID)
esm2 = AutoModel.from_pretrained(ESM2_ID).eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
esm2 = esm2.to(device)

_AA20 = list("ACDEFGHIKLMNPQRSTVWY")
_DIPEP = [a+b for a in _AA20 for b in _AA20]

def esm2_embed(aa, max_len=512):
    aa = aa[:max_len]
    if not aa: return np.zeros(480)
    inputs = tokenizer(aa, return_tensors="pt", truncation=True,
                       max_length=max_len, padding=False)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        h = esm2(**inputs).last_hidden_state[0, 1:-1, :]
    return (h.mean(0) if h.shape[0] > 0 else h.new_zeros(480)).cpu().numpy()

def kmer_features(aa):
    aa = "".join(c for c in aa.upper() if c in set(_AA20))
    if not aa: return [0.0] * 426
    n = len(aa)
    comp = [aa.count(a)/n for a in _AA20]
    dipep = [aa[i:i+2] for i in range(n-1)]
    dp_cnt = {d: dipep.count(d) for d in _DIPEP}
    dp_tot = max(len(dipep), 1)
    dipep_f = [dp_cnt.get(d, 0)/dp_tot for d in _DIPEP]
    mw = sum({'A':89,'C':121,'D':133,'E':147,'F':165,'G':75,'H':155,'I':131,
              'K':146,'L':131,'M':149,'N':132,'P':115,'Q':146,'R':174,'S':105,
              'T':119,'V':117,'W':204,'Y':181}.get(c,110) for c in aa)/n
    charge = sum({'R':1,'K':1,'D':-1,'E':-1}.get(c,0) for c in aa)/n
    hydro  = sum({'A':1.8,'V':4.2,'L':3.8,'I':4.5,'P':-1.6,'F':2.8,'W':-0.9,'M':1.9,
                  'G':-0.4,'S':-0.8,'T':-0.7,'C':2.5,'Y':-1.3,'H':-3.2,'D':-3.5,
                  'E':-3.5,'N':-3.5,'Q':-3.5,'K':-3.9,'R':-4.5}.get(c,0) for c in aa)/n
    return comp + dipep_f + [hydro, charge, mw/200, 0.0, 0.0, aa.count('C')/n]

def screen_protein_v4(aa, threshold_review=0.3, threshold_escalate=0.6):
    feats = np.array([kmer_features(aa) + esm2_embed(aa).tolist()])  # 906-dim
    prob = float(protein_v4.predict_proba(feats)[0, 1])
    if prob >= threshold_escalate: return "ESCALATE", prob
    if prob >= threshold_review:   return "REVIEW", prob
    return "SAFE", prob

decision, score = screen_protein_v4("MKAIFVLKGFFGAFLGFLLLPFLMAK")
print(f"{decision} (score: {score:.3f})")

Deployment note

The live API (https://seyomi-synthguard-api.hf.space/protein/screen) serves V4 when GPU is available, falling back to V2 (k-mer only, 426 features) on CPU-only instances. Check /health β†’ protein_model field to confirm which version is active.


Training Data

Dataset: Seyomi/synthscreen-dataset

Protein model trained on sequences that yield β‰₯30 amino acids after best-frame translation. V4 additionally trained on 50 ProteinMPNN redesigns of Diphtheria toxin (1BC7, ADP-ribosyltransferase fold) β€” the only addition from V3β†’V4.


Limitations

  • V4 requires GPU for ESM-2 inference (~15ms/seq GPU vs ~200ms/seq CPU). V2 (426-feat k-mer) is the operationally viable CPU deployment.
  • Chimeric constructs not evaluated β€” hazardous domain fused to benign scaffold may evade both blastp and SynthGuard
  • RFdiffusion de-novo backbones not evaluated β€” novel folds with no known structural homolog are outside training distribution
  • ProteinMPNN ablation uses a separate dataset β€” 255 variants across 5 toxin families, not in the HF test split
  • Computational only β€” no wet-lab functional validation

Related

Citation

SynthGuard: Closing the AI Biodesign Gap in DNA Synthesis Screening
AIxBio Hackathon 2026 β€” Track 1 | Ashok Kumar
https://github.com/Ashok-kumar290/synthscreen
Benchmark: NCBI BLAST+ 2.12.0+, Google Colab A100, April 26 2026
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Seyomi/synthguard-esm2

Finetuned
(60)
this model

Dataset used to train Seyomi/synthguard-esm2