--- license: mit language: [en] tags: [biology, protein, peptide, cell-penetrating-peptide, esm-c, protein-language-model] pipeline_tag: text-classification library_name: pytorch --- # CPPro-600M **Open-weight cell-penetrating-peptide (CPP) classifier.** Given a peptide (5-50 aa, canonical residues) it returns `P(CPP)`. A masked sequence-CNN head on **frozen ESM-C 600M** embeddings, **hard-negative-mining hardened**, as a 5-seed ensemble. Runs locally, no API key. ## Performance (v6 benchmark, computed on n=570) | Metric | value | |---|---| | v6 held-out test **MCC** | **0.814** | | AUROC | 0.974 | | recall @0.5 (TPR) | 90.9% | | false-positive rate @0.5 | 9.5% | The v6 test is **exact-disjoint** from the published training data of pLM4CPPs and GraphCPP. On this same disjoint test, prior SOTA **pLM4CPPs scores 0.652**; the hosted **CPPro-6B reaches ~0.87**. Hard-negative mining lowers the false-positive rate vs the base head (screening is FP-sensitive). > **Why a fresh benchmark?** Prior CPP predictors report MCC 0.80-0.96 but train *and* test on the > same CPPsite-2.0 lineage with identity-only dedup (in-distribution). On a strictly disjoint test > those numbers don't hold (GraphCPP's own diverse benchmark drops to 0.58). ## Usage ```bash pip install "git+https://github.com/VicCar/CPP-Pro" cppro-score --seq RRRRRRRRR cppro-score --csv designs.csv --out scored.csv ``` A high per-seed std flags an out-of-distribution / unstable call. Best on realistic peptide sequences; synthetic homopolymers are out of distribution. ## Files `seed0..4.pt` (SeqCNN state dicts, in_dim=1152) + `metrics.json`. ## Limitations Canonical 20 amino acids only; no non-canonical residues or macrocyclisation. Lengths 5-50 aa. ## License MIT (head weights). ESM-C 600M backbone is EvolutionaryScale's, downloaded separately. Code: https://github.com/VicCar/CPP-Pro ยท Citation: CPPro (Carreira et al., in prep).