CPPro-600M / README.md
JoshuaFreeman's picture
swap to HNM-hardened seqcnn (v6 MCC 0.814, FPR 9.5%); verified card
e8e147a verified
|
Raw
History Blame Contribute Delete
1.92 kB
metadata
license: mit
language:
  - en
tags:
  - biology
  - protein
  - peptide
  - cell-penetrating-peptide
  - esm-c
  - protein-language-model
pipeline_tag: text-classification
library_name: pytorch

CPPro-600M

Open-weight cell-penetrating-peptide (CPP) classifier. Given a peptide (5-50 aa, canonical residues) it returns P(CPP). A masked sequence-CNN head on frozen ESM-C 600M embeddings, hard-negative-mining hardened, as a 5-seed ensemble. Runs locally, no API key.

Performance (v6 benchmark, computed on n=570)

Metric value
v6 held-out test MCC 0.814
AUROC 0.974
recall @0.5 (TPR) 90.9%
false-positive rate @0.5 9.5%

The v6 test is exact-disjoint from the published training data of pLM4CPPs and GraphCPP. On this same disjoint test, prior SOTA pLM4CPPs scores 0.652; the hosted CPPro-6B reaches ~0.87. Hard-negative mining lowers the false-positive rate vs the base head (screening is FP-sensitive).

Why a fresh benchmark? Prior CPP predictors report MCC 0.80-0.96 but train and test on the same CPPsite-2.0 lineage with identity-only dedup (in-distribution). On a strictly disjoint test those numbers don't hold (GraphCPP's own diverse benchmark drops to 0.58).

Usage

pip install "git+https://github.com/VicCar/CPP-Pro"
cppro-score --seq RRRRRRRRR
cppro-score --csv designs.csv --out scored.csv

A high per-seed std flags an out-of-distribution / unstable call. Best on realistic peptide sequences; synthetic homopolymers are out of distribution.

Files

seed0..4.pt (SeqCNN state dicts, in_dim=1152) + metrics.json.

Limitations

Canonical 20 amino acids only; no non-canonical residues or macrocyclisation. Lengths 5-50 aa.

License

MIT (head weights). ESM-C 600M backbone is EvolutionaryScale's, downloaded separately. Code: https://github.com/VicCar/CPP-Pro · Citation: CPPro (Carreira et al., in prep).