ProteinMPNN-noMHC
A retrained vanilla ProteinMPNN (Dauparas et al., 2022) on the standard PDB 2021-08-02 training set with every MHC-containing structure removed. The purpose is a leakage-controlled, genuinely out-of-distribution model for studying peptide/epitope design on peptide–MHC (pMHC) and TCR–pMHC complexes: this model has never seen an MHC molecule (classical class I/II or MHC-like) or any peptide presented in an MHC groove during training.
- Architecture: vanilla ProteinMPNN —
hidden_dim=128, 3 encoder + 3 decoder layers,k_neighbors=48,backbone_noise=0.20. ~1.66 M parameters. - Released checkpoint:
proteinmpnn_nomhc.pt(epoch 106; full training target 200 epochs). - Loads directly with the official
protein_mpnn_run.py(--path_to_model_weights <dir> --model_name proteinmpnn_nomhc). - Base data:
https://files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz.
Intended use
Drop-in replacement for vanilla ProteinMPNN when you need an inverse-folding model that is blind to MHC. Useful for: testing whether epitope/groove-anchor preferences are learned from general protein structure vs. memorized from pMHC structures; OOD design benchmarks on immunological complexes; leakage controls for TCR/pMHC modeling pipelines (e.g. the TCera project).
It is not a better general-purpose ProteinMPNN — it is a controlled ablation.
Data partition
Base training set (pdb_2021aug02, curated by Ivan Anishchenko): clustered at 30% sequence identity.
| chains | clusters | |
|---|---|---|
Full list.csv |
555,720 | 27,789 |
| Removed (all MHC) | 6,076 | 27 fully dropped |
| Kept (training set used here) | 549,644 | 27,762 |
.pt files in the extracted set: 869,548.
How "MHC structures" were defined (1,822 PDB IDs)
The exclusion set is the union of three complementary sources, to avoid annotation gaps:
- RCSB Pfam annotations — every entry with an MHC-fold polymer entity:
PF00129(MHC-I / MHC-like, incl. CD1, MR1),PF00969(MHC-II β),PF00993(MHC-II α) → 1,352 entries. - STCRDab TCR–MHC complexes → 456 entries (adds 160 not Pfam-annotated).
- Sequence-signature scan over all 555,720 chains (MHC-I
SRPG.GEP[RK].I, β2-microglobulinYSRHP[AP]ENGK, MHC-IIHV[IL]IQAEFY/FNGTERVR) → caught 310 more structures that RCSB had not Pfam-annotated (e.g.1cd1).
Full list: mhc_excluded_pdb_ids.txt. Of 1,822, 1,531 were present in this training set.
Removal is PDB-level (and why)
Each chain is removed if its PDB entry contains MHC, dropping the entire structure (MHC + β2m + peptide + any crystallization Fab/TCR). This is required because the ProteinMPNN loader rebuilds the full biological assembly from any one chain — removing only the MHC chain would let the β2m or peptide chain re-load the complex (and the presented peptide) into training.
⚠️ We deliberately do not use cluster-level removal. MHC structures are routinely crystallized with Fabs/TCRs whose Ig V-domains fall into large mixed antibody+TCR clusters; expanding to those clusters would delete thousands of legitimate non-MHC antibody/TCR chains. PDB-level removal keeps antibodies/TCRs from non-MHC structures intact (e.g. the largest Ig-Vκ cluster retained 6,914 of 7,507 chains; only the ~593 anti-MHC Fabs inside MHC complexes were removed).
Verification
All 9 PDB structures in the RCSB that present the NY-ESO-1 epitope SLLMWITQC
(incl. the pMHC-only 1s9w, which had leaked the epitope under an earlier "remove TCR–MHC only"
partition) are confirmed absent from the final list.csv.
Training
- Framework: PyTorch 2.1 (CUDA 11.8); 1× NVIDIA A40 (Savio
savio3_gpu). - Command (official
training.py):--num_epochs 200 --save_model_every_n_epochs 10 --reload_data_every_n_epochs 2 --num_neighbors 48 --backbone_noise 0.2(defaults otherwise:batch_size=10000tokens, mixed precision, Noam schedule). - Convergence: validation sequence-recovery plateaus at ~0.52 (≈ vanilla ProteinMPNN), confirming the ~1% MHC removal did not perturb general learning.
Training curves (through epoch 106)
(Validation perplexity sits below train because training applies backbone noise + dropout; both curves are smooth and plateaued, with no sign of overfitting.)
Evaluation — epitope design on held-out TCR–pMHC (leakage probe)
Design only the 9-mer peptide (chain C), MHC + TCR backbones fixed, on five held-out 1G4
TCR–pMHC complexes (2BNR, 2F53, 2F54, 2P5E, 2P5W; native epitope SLLMWITQC). 1,000 designs each
(5,000 total), sampling_temp=0.1, seed 37.
Result: 0 / 5,000 exact epitope recovery (both this model and a leakier control). The discriminating signal is the buried P5 Trp anchor (pooled over the 5 structures):
| Pos | Native | Leaky control (saw 1s9w, e41) |
MHC-blind (e94 snapshot ≈ released e106) |
|---|---|---|---|
| 5 | W | 0.999 | ~0.50 |
| 2 | L | 0.590 | 1.000 |
| 7 | T | 0.594 | 0.800 |
| 1/3/4/8/9 | S/L/M/Q/C | ≤0.23 | ≤0.09 |
Interpretation. Conditioned on the MHC-groove backbone, the MHC-blind model still designs
groove-compatible peptides with the correct HLA-A2 anchor grammar (P2→L, C-terminus→L) — i.e. general
protein-packing rules generalize to the groove. But it never recovers the true TCR-selected epitope,
and its confidence at the most MHC-specific buried position (P5 Trp) is well below the leaky control's
100%. At strictly matched training length (epoch ~44) the MHC-blind W5 recovery was 0.30, rising
to **0.50** by epoch 94 (the released epoch-106 checkpoint is plateaued at the same level) as general
structure learning improves — still far short of the leaky model's
~1.00. That near-perfect leaky-control W5 was largely memorization of the 1s9w pMHC, now removed.
Files
proteinmpnn_nomhc.pt— model weights (epoch 106).mhc_excluded_pdb_ids.txt— the 1,822 excluded MHC PDB IDs.filter_list.py— PDB-levellist.csvfilter.get_mhc_chains.py— RCSB polymer-entity Pfam → MHC chain resolver.
Citation
Built on ProteinMPNN:
@article{dauparas2022robust,
title={Robust deep learning--based protein sequence design using ProteinMPNN},
author={Dauparas, J. and Anishchenko, I. and Bennett, N. and others},
journal={Science}, volume={378}, number={6615}, pages={49--56}, year={2022}
}

