| --- |
| license: mit |
| language: [en] |
| tags: [biology, protein, peptide, cell-penetrating-peptide, esm-c, protein-language-model] |
| pipeline_tag: text-classification |
| library_name: pytorch |
| --- |
| |
| # CPPro-600M |
|
|
| **Open-weight cell-penetrating-peptide (CPP) classifier.** Given a peptide (5-50 aa, canonical |
| residues) it returns `P(CPP)`. A masked sequence-CNN head on **frozen ESM-C 600M** embeddings, |
| **hard-negative-mining hardened**, as a 5-seed ensemble. Runs locally, no API key. |
|
|
| ## Performance (v6 benchmark, computed on n=570) |
| | Metric | value | |
| |---|---| |
| | v6 held-out test **MCC** | **0.814** | |
| | AUROC | 0.974 | |
| | recall @0.5 (TPR) | 90.9% | |
| | false-positive rate @0.5 | 9.5% | |
|
|
| The v6 test is **exact-disjoint** from the published training data of pLM4CPPs and GraphCPP. |
| On this same disjoint test, prior SOTA **pLM4CPPs scores 0.652**; the hosted **CPPro-6B reaches ~0.87**. |
| Hard-negative mining lowers the false-positive rate vs the base head (screening is FP-sensitive). |
|
|
| > **Why a fresh benchmark?** Prior CPP predictors report MCC 0.80-0.96 but train *and* test on the |
| > same CPPsite-2.0 lineage with identity-only dedup (in-distribution). On a strictly disjoint test |
| > those numbers don't hold (GraphCPP's own diverse benchmark drops to 0.58). |
|
|
| ## Usage |
| ```bash |
| pip install "git+https://github.com/VicCar/CPP-Pro" |
| cppro-score --seq RRRRRRRRR |
| cppro-score --csv designs.csv --out scored.csv |
| ``` |
| A high per-seed std flags an out-of-distribution / unstable call. Best on realistic peptide |
| sequences; synthetic homopolymers are out of distribution. |
|
|
| ## Files |
| `seed0..4.pt` (SeqCNN state dicts, in_dim=1152) + `metrics.json`. |
| |
| ## Limitations |
| Canonical 20 amino acids only; no non-canonical residues or macrocyclisation. Lengths 5-50 aa. |
| |
| ## License |
| MIT (head weights). ESM-C 600M backbone is EvolutionaryScale's, downloaded separately. |
| Code: https://github.com/VicCar/CPP-Pro · Citation: CPPro (Carreira et al., in prep). |
| |