ProteinMPNN-noMHC

A retrained vanilla ProteinMPNN (Dauparas et al., 2022) on the standard PDB 2021-08-02 training set with every MHC-containing structure removed. The purpose is a leakage-controlled, genuinely out-of-distribution model for studying peptide/epitope design on peptide–MHC (pMHC) and TCR–pMHC complexes: this model has never seen an MHC molecule (classical class I/II or MHC-like) or any peptide presented in an MHC groove during training.

  • Architecture: vanilla ProteinMPNN — hidden_dim=128, 3 encoder + 3 decoder layers, k_neighbors=48, backbone_noise=0.20. ~1.66 M parameters.
  • Released checkpoint: proteinmpnn_nomhc.pt (epoch 106; full training target 200 epochs).
  • Loads directly with the official protein_mpnn_run.py (--path_to_model_weights <dir> --model_name proteinmpnn_nomhc).
  • Base data: https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz.

Intended use

Drop-in replacement for vanilla ProteinMPNN when you need an inverse-folding model that is blind to MHC. Useful for: testing whether epitope/groove-anchor preferences are learned from general protein structure vs. memorized from pMHC structures; OOD design benchmarks on immunological complexes; leakage controls for TCR/pMHC modeling pipelines (e.g. the TCera project).

It is not a better general-purpose ProteinMPNN — it is a controlled ablation.

Data partition

Base training set (pdb_2021aug02, curated by Ivan Anishchenko): clustered at 30% sequence identity.

chains clusters
Full list.csv 555,720 27,789
Removed (all MHC) 6,076 27 fully dropped
Kept (training set used here) 549,644 27,762

.pt files in the extracted set: 869,548.

How "MHC structures" were defined (1,822 PDB IDs)

The exclusion set is the union of three complementary sources, to avoid annotation gaps:

  1. RCSB Pfam annotations — every entry with an MHC-fold polymer entity: PF00129 (MHC-I / MHC-like, incl. CD1, MR1), PF00969 (MHC-II β), PF00993 (MHC-II α) → 1,352 entries.
  2. STCRDab TCR–MHC complexes → 456 entries (adds 160 not Pfam-annotated).
  3. Sequence-signature scan over all 555,720 chains (MHC-I SRPG.GEP[RK].I, β2-microglobulin YSRHP[AP]ENGK, MHC-II HV[IL]IQAEFY / FNGTERVR) → caught 310 more structures that RCSB had not Pfam-annotated (e.g. 1cd1).

Full list: mhc_excluded_pdb_ids.txt. Of 1,822, 1,531 were present in this training set.

Removal is PDB-level (and why)

Each chain is removed if its PDB entry contains MHC, dropping the entire structure (MHC + β2m + peptide + any crystallization Fab/TCR). This is required because the ProteinMPNN loader rebuilds the full biological assembly from any one chain — removing only the MHC chain would let the β2m or peptide chain re-load the complex (and the presented peptide) into training.

⚠️ We deliberately do not use cluster-level removal. MHC structures are routinely crystallized with Fabs/TCRs whose Ig V-domains fall into large mixed antibody+TCR clusters; expanding to those clusters would delete thousands of legitimate non-MHC antibody/TCR chains. PDB-level removal keeps antibodies/TCRs from non-MHC structures intact (e.g. the largest Ig-Vκ cluster retained 6,914 of 7,507 chains; only the ~593 anti-MHC Fabs inside MHC complexes were removed).

Verification

All 9 PDB structures in the RCSB that present the NY-ESO-1 epitope SLLMWITQC (incl. the pMHC-only 1s9w, which had leaked the epitope under an earlier "remove TCR–MHC only" partition) are confirmed absent from the final list.csv.

Training

  • Framework: PyTorch 2.1 (CUDA 11.8); 1× NVIDIA A40 (Savio savio3_gpu).
  • Command (official training.py): --num_epochs 200 --save_model_every_n_epochs 10 --reload_data_every_n_epochs 2 --num_neighbors 48 --backbone_noise 0.2 (defaults otherwise: batch_size=10000 tokens, mixed precision, Noam schedule).
  • Convergence: validation sequence-recovery plateaus at ~0.52 (≈ vanilla ProteinMPNN), confirming the ~1% MHC removal did not perturb general learning.

Training curves (through epoch 106)

Loss (perplexity) Accuracy (sequence recovery)
loss accuracy

(Validation perplexity sits below train because training applies backbone noise + dropout; both curves are smooth and plateaued, with no sign of overfitting.)

Evaluation — epitope design on held-out TCR–pMHC (leakage probe)

Design only the 9-mer peptide (chain C), MHC + TCR backbones fixed, on five held-out 1G4 TCR–pMHC complexes (2BNR, 2F53, 2F54, 2P5E, 2P5W; native epitope SLLMWITQC). 1,000 designs each (5,000 total), sampling_temp=0.1, seed 37.

Result: 0 / 5,000 exact epitope recovery (both this model and a leakier control). The discriminating signal is the buried P5 Trp anchor (pooled over the 5 structures):

Pos Native Leaky control (saw 1s9w, e41) MHC-blind (e94 snapshot ≈ released e106)
5 W 0.999 ~0.50
2 L 0.590 1.000
7 T 0.594 0.800
1/3/4/8/9 S/L/M/Q/C ≤0.23 ≤0.09

Interpretation. Conditioned on the MHC-groove backbone, the MHC-blind model still designs groove-compatible peptides with the correct HLA-A2 anchor grammar (P2→L, C-terminus→L) — i.e. general protein-packing rules generalize to the groove. But it never recovers the true TCR-selected epitope, and its confidence at the most MHC-specific buried position (P5 Trp) is well below the leaky control's 100%. At strictly matched training length (epoch ~44) the MHC-blind W5 recovery was 0.30, rising to **0.50** by epoch 94 (the released epoch-106 checkpoint is plateaued at the same level) as general structure learning improves — still far short of the leaky model's ~1.00. That near-perfect leaky-control W5 was largely memorization of the 1s9w pMHC, now removed.

Files

  • proteinmpnn_nomhc.pt — model weights (epoch 106).
  • mhc_excluded_pdb_ids.txt — the 1,822 excluded MHC PDB IDs.
  • filter_list.py — PDB-level list.csv filter.
  • get_mhc_chains.py — RCSB polymer-entity Pfam → MHC chain resolver.

Citation

Built on ProteinMPNN:

@article{dauparas2022robust,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, J. and Anishchenko, I. and Bennett, N. and others},
  journal={Science}, volume={378}, number={6615}, pages={49--56}, year={2022}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support