---
license: mit
language: [en]
tags: [biology, protein, peptide, cell-penetrating-peptide, esm-c, protein-language-model]
pipeline_tag: text-classification
library_name: pytorch
---

# CPPro-600M

**Open-weight cell-penetrating-peptide (CPP) classifier.** Given a peptide (5-50 aa, canonical
residues) it returns `P(CPP)`. A masked sequence-CNN head on **frozen ESM-C 600M** embeddings,
**hard-negative-mining hardened**, as a 5-seed ensemble. Runs locally, no API key.

## Performance (v6 benchmark, computed on n=570)
| Metric | value |
|---|---|
| v6 held-out test **MCC** | **0.814** |
| AUROC | 0.974 |
| recall @0.5 (TPR) | 90.9% |
| false-positive rate @0.5 | 9.5% |

The v6 test is **exact-disjoint** from the published training data of pLM4CPPs and GraphCPP.
On this same disjoint test, prior SOTA **pLM4CPPs scores 0.652**; the hosted **CPPro-6B reaches ~0.87**.
Hard-negative mining lowers the false-positive rate vs the base head (screening is FP-sensitive).

> **Why a fresh benchmark?** Prior CPP predictors report MCC 0.80-0.96 but train *and* test on the
> same CPPsite-2.0 lineage with identity-only dedup (in-distribution). On a strictly disjoint test
> those numbers don't hold (GraphCPP's own diverse benchmark drops to 0.58).

## Usage
```bash
pip install "git+https://github.com/VicCar/CPP-Pro"
cppro-score --seq RRRRRRRRR
cppro-score --csv designs.csv --out scored.csv
```
A high per-seed std flags an out-of-distribution / unstable call. Best on realistic peptide
sequences; synthetic homopolymers are out of distribution.

## Files
`seed0..4.pt` (SeqCNN state dicts, in_dim=1152) + `metrics.json`.

## Limitations
Canonical 20 amino acids only; no non-canonical residues or macrocyclisation. Lengths 5-50 aa.

## License
MIT (head weights). ESM-C 600M backbone is EvolutionaryScale's, downloaded separately.
Code: https://github.com/VicCar/CPP-Pro · Citation: CPPro (Carreira et al., in prep).