cpr / README.md
LoocasGoose's picture
Add HuggingFace Spaces configuration to README
a6bb8c6
---
title: Conformal Protein Retrieval
emoji: 🧬
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: mit
short_description: Functional protein mining with statistical guarantees
---
# Conformal Protein Retrieval
Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
**[β†’ GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes)
## Quick Setup
```bash
# 1. Clone and install
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
pip install -e .
# 2. Download data from Zenodo (4GB total)
# https://zenodo.org/records/14272215
# β†’ lookup_embeddings.npy (1.1 GB) β†’ data/
# β†’ lookup_embeddings_meta_data.tsv (535 MB) β†’ data/
# β†’ pfam_new_proteins.npy (2.4 GB) β†’ data/
# 3. Verify setup
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
```
See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions.
## Repository Structure
```
conformal-protein-retrieval/
β”œβ”€β”€ protein_conformal/ # Core library (FDR/FNR control, Venn-Abers)
β”œβ”€β”€ notebooks/ # Analysis notebooks organized by experiment
β”‚ β”œβ”€β”€ pfam/ # Pfam domain annotation (Figure 2)
β”‚ β”œβ”€β”€ scope/ # SCOPe structural classification
β”‚ β”œβ”€β”€ ec/ # EC number classification
β”‚ └── clean_selection/ # CLEAN enzyme experiments (Tables 1-2)
β”œβ”€β”€ scripts/ # CLI scripts and SLURM jobs
β”œβ”€β”€ data/ # Data files (see GETTING_STARTED.md)
β”œβ”€β”€ results/ # Pre-computed thresholds and outputs
└── docs/ # Additional documentation
```
## Quick Start
The `cpr` CLI provides five main commands for functional protein mining:
### 1. Embed protein sequences
```bash
# Embed with Protein-Vec (for general protein search)
cpr embed --input sequences.fasta --output embeddings.npy --model protein-vec
# Embed with CLEAN (for enzyme classification)
cpr embed --input sequences.fasta --output embeddings.npy --model clean
```
### 2. Search for similar proteins with conformal guarantees
The `cpr search` command accepts **both FASTA files and pre-computed embeddings**:
```bash
# From FASTA file (auto-embeds with Protein-Vec)
cpr search --input sequences.fasta --output results.csv --fdr 0.1
# From pre-computed embeddings
cpr search --input embeddings.npy --output results.csv --fdr 0.1
# With FNR control instead of FDR
cpr search --input sequences.fasta --output results.csv --fnr 0.1
# With explicit threshold
cpr search --input sequences.fasta --output results.csv --threshold 0.99998
# Exploratory mode (no filtering, return all k neighbors)
cpr search --input sequences.fasta --output results.csv --no-filter
```
### 3. Convert similarity scores to calibrated probabilities
```bash
# Add Venn-Abers calibrated probabilities to search results
cpr prob \
--input results.csv \
--calibration data/pfam_new_proteins.npy \
--output results_with_probs.csv \
--n-calib 1000
```
### 4. Calibrate FDR/FNR thresholds for a new embedding model
```bash
# Compute thresholds from your own calibration data
cpr calibrate \
--calibration my_calibration_data.npy \
--output thresholds.csv \
--alpha 0.1 \
--n-trials 100 \
--n-calib 1000
```
### 5. Verify paper results
```bash
# Reproduce key results from the paper
cpr verify --check syn30 # JCVI Syn3.0 annotation (39.6% at FDR Ξ±=0.1)
cpr verify --check fdr # FDR threshold calibration
cpr verify --check dali # DALI prefiltering (82.8% TPR, 31.5% DB reduction)
cpr verify --check clean # CLEAN enzyme classification
```
## Data Files
### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215))
```bash
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
```
### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences
```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```
## Protein-Vec vs CLEAN Models
### Protein-Vec (general protein search)
- Trained on UniProt with multi-task objectives (Pfam, EC, GO, transmembrane, etc.)
- Best for: broad functional annotation, domain identification, general homology search
- Output: 128-dimensional embeddings
- FDR threshold at Ξ±=0.1: Ξ» β‰ˆ 0.9999802
### CLEAN (enzyme classification)
- Trained specifically for EC number classification
- Best for: enzyme function prediction, detailed catalytic annotation
- Output: 128-dimensional embeddings
- Requires ESM embeddings as input (computed automatically)
- See `ec/` directory for CLEAN-specific notebooks
## Creating Custom Calibration Datasets
To calibrate FDR/FNR thresholds for your own protein search tasks:
1. Create a calibration dataset with ground-truth labels (see `data/create_pfam_data.ipynb`)
2. Embed sequences using your chosen model (`cpr embed`)
3. Compute similarity scores and labels (save as .npy with shape `(n_samples, 3)`: `[sim, label_exact, label_partial]`)
4. Run calibration: `cpr calibrate --calibration my_data.npy --output thresholds.csv --alpha 0.1`
**Important:** Ensure your calibration dataset is outside the training data of your embedding model to avoid data leakage.
## Complete Workflow Example
Here's a full example searching viral domains against the Pfam database with FDR control:
```bash
# Option A: One-step search from FASTA (embeds automatically)
cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1
# Option B: Two-step with explicit embedding
cpr embed --input viral_domains.fasta --output viral_embeddings.npy
cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1
```
The output CSV will contain:
- `query_idx`: Query sequence index
- `match_idx`: Database match index
- `similarity`: Cosine similarity score
- `match_*`: Metadata columns from database (UniProt ID, Pfam domains, etc.)
- `probability`: Calibrated probability of functional match
- `uncertainty`: Venn-Abers uncertainty interval (|p1 - p0|)
## Advanced Usage
### Using Legacy Scripts
For advanced use cases, the original Python scripts are still available in `scripts/`:
```bash
# Legacy search script with more options
python scripts/search.py \
--fdr \
--fdr_lambda 0.99998 \
--output results.csv \
--query_embedding query.npy \
--query_fasta query.fasta \
--lookup_embedding data/lookup_embeddings.npy \
--lookup_fasta data/lookup_embeddings_meta_data.tsv \
--k 1000
# Precompute similarity-to-probability lookup table
python scripts/precompute_SVA_probs.py \
--cal_data data/pfam_new_proteins.npy \
--output data/pfam_sims_to_probs.csv \
--partial \
--n_bins 1000 \
--n_calib 1000
# Apply precomputed probabilities (faster than on-the-fly computation)
python scripts/get_probs.py \
--precomputed \
--precomputed_path data/pfam_sims_to_probs.csv \
--input results.csv \
--output results_with_probs.csv \
--partial
```
## Key Paper Results
This repository reproduces the following results from the paper:
| Claim | Paper | CLI Command | Status |
|-------|-------|-------------|--------|
| JCVI Syn3.0 annotation (Fig 2A) | 39.6% (59/149) at FDR Ξ±=0.1 | `cpr verify --check syn30` | βœ“ Exact |
| FDR threshold | Ξ» = 0.9999802250 at Ξ±=0.1 | `cpr verify --check fdr` | βœ“ (~0.002% diff) |
| DALI prefiltering TPR (Table 4-6) | 82.8% | `cpr verify --check dali` | βœ“ (~1% diff) |
| DALI database reduction | 31.5% | `cpr verify --check dali` | βœ“ Exact |
| CLEAN enzyme loss (Table 1-2) | ≀ Ξ±=1.0 | `cpr verify --check clean` | βœ“ (0.97) |
## Repository Structure
- `protein_conformal/` - Core utilities for conformal prediction and search
- `scripts/` - Verification scripts and legacy search tools
- `scope/` - SCOPe structural classification experiments
- `pfam/` - Pfam domain annotation notebooks
- `ec/` - EC number classification with CLEAN model
- `data/` - Data processing notebooks and scripts
- `clean_selection/` - CLEAN enzyme selection pipeline
- `tests/` - Test suite (run with `pytest tests/ -v`)
## Contributing & Feature Requests
If you'd like expanded support for specific models or search tasks, please open an issue describing:
1. The embedding model you'd like to use
2. The search/annotation task you're working on
3. Any specific conformal guarantees you need (FDR, FNR, coverage, etc.)
We welcome contributions and look forward to hearing from you!
## Citation
If you use this code or method in your work, please cite:
```bibtex
@article{boger2025functional,
title={Functional protein mining with conformal guarantees},
author={Boger, Ron S and Chithrananda, Seyone and Angelopoulos, Anastasios N and Yoon, Peter H and Jordan, Michael I and Doudna, Jennifer A},
journal={Nature Communications},
volume={16},
number={1},
pages={85},
year={2025},
publisher={Nature Publishing Group},
doi={10.1038/s41467-024-55676-y}
}
```
## License
See LICENSE file for details.