--- title: Conformal Protein Retrieval emoji: 🧬 colorFrom: purple colorTo: blue sdk: gradio sdk_version: 5.9.1 app_file: app.py pinned: false license: mit short_description: Functional protein mining with statistical guarantees --- # Conformal Protein Retrieval Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control. **[→ GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes) ## Quick Setup ```bash # 1. Clone and install git clone https://github.com/ronboger/conformal-protein-retrieval.git cd conformal-protein-retrieval pip install -e . # 2. Download data from Zenodo (4GB total) # https://zenodo.org/records/14272215 # → lookup_embeddings.npy (1.1 GB) → data/ # → lookup_embeddings_meta_data.tsv (535 MB) → data/ # → pfam_new_proteins.npy (2.4 GB) → data/ # 3. Verify setup cpr verify --check syn30 # Expected: 59/149 = 39.6% hits at FDR α=0.1 ``` See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions. ## Repository Structure ``` conformal-protein-retrieval/ ├── protein_conformal/ # Core library (FDR/FNR control, Venn-Abers) ├── notebooks/ # Analysis notebooks organized by experiment │ ├── pfam/ # Pfam domain annotation (Figure 2) │ ├── scope/ # SCOPe structural classification │ ├── ec/ # EC number classification │ └── clean_selection/ # CLEAN enzyme experiments (Tables 1-2) ├── scripts/ # CLI scripts and SLURM jobs ├── data/ # Data files (see GETTING_STARTED.md) ├── results/ # Pre-computed thresholds and outputs └── docs/ # Additional documentation ``` ## Quick Start The `cpr` CLI provides five main commands for functional protein mining: ### 1. Embed protein sequences ```bash # Embed with Protein-Vec (for general protein search) cpr embed --input sequences.fasta --output embeddings.npy --model protein-vec # Embed with CLEAN (for enzyme classification) cpr embed --input sequences.fasta --output embeddings.npy --model clean ``` ### 2. Search for similar proteins with conformal guarantees The `cpr search` command accepts **both FASTA files and pre-computed embeddings**: ```bash # From FASTA file (auto-embeds with Protein-Vec) cpr search --input sequences.fasta --output results.csv --fdr 0.1 # From pre-computed embeddings cpr search --input embeddings.npy --output results.csv --fdr 0.1 # With FNR control instead of FDR cpr search --input sequences.fasta --output results.csv --fnr 0.1 # With explicit threshold cpr search --input sequences.fasta --output results.csv --threshold 0.99998 # Exploratory mode (no filtering, return all k neighbors) cpr search --input sequences.fasta --output results.csv --no-filter ``` ### 3. Convert similarity scores to calibrated probabilities ```bash # Add Venn-Abers calibrated probabilities to search results cpr prob \ --input results.csv \ --calibration data/pfam_new_proteins.npy \ --output results_with_probs.csv \ --n-calib 1000 ``` ### 4. Calibrate FDR/FNR thresholds for a new embedding model ```bash # Compute thresholds from your own calibration data cpr calibrate \ --calibration my_calibration_data.npy \ --output thresholds.csv \ --alpha 0.1 \ --n-trials 100 \ --n-calib 1000 ``` ### 5. Verify paper results ```bash # Reproduce key results from the paper cpr verify --check syn30 # JCVI Syn3.0 annotation (39.6% at FDR α=0.1) cpr verify --check fdr # FDR threshold calibration cpr verify --check dali # DALI prefiltering (82.8% TPR, 31.5% DB reduction) cpr verify --check clean # CLEAN enzyme classification ``` ## Data Files ### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215)) ```bash cd data/ wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy ``` ### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences ```bash wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz tar -xzf protein_vec_models.gz ``` ## Protein-Vec vs CLEAN Models ### Protein-Vec (general protein search) - Trained on UniProt with multi-task objectives (Pfam, EC, GO, transmembrane, etc.) - Best for: broad functional annotation, domain identification, general homology search - Output: 128-dimensional embeddings - FDR threshold at α=0.1: λ ≈ 0.9999802 ### CLEAN (enzyme classification) - Trained specifically for EC number classification - Best for: enzyme function prediction, detailed catalytic annotation - Output: 128-dimensional embeddings - Requires ESM embeddings as input (computed automatically) - See `ec/` directory for CLEAN-specific notebooks ## Creating Custom Calibration Datasets To calibrate FDR/FNR thresholds for your own protein search tasks: 1. Create a calibration dataset with ground-truth labels (see `data/create_pfam_data.ipynb`) 2. Embed sequences using your chosen model (`cpr embed`) 3. Compute similarity scores and labels (save as .npy with shape `(n_samples, 3)`: `[sim, label_exact, label_partial]`) 4. Run calibration: `cpr calibrate --calibration my_data.npy --output thresholds.csv --alpha 0.1` **Important:** Ensure your calibration dataset is outside the training data of your embedding model to avoid data leakage. ## Complete Workflow Example Here's a full example searching viral domains against the Pfam database with FDR control: ```bash # Option A: One-step search from FASTA (embeds automatically) cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1 # Option B: Two-step with explicit embedding cpr embed --input viral_domains.fasta --output viral_embeddings.npy cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1 ``` The output CSV will contain: - `query_idx`: Query sequence index - `match_idx`: Database match index - `similarity`: Cosine similarity score - `match_*`: Metadata columns from database (UniProt ID, Pfam domains, etc.) - `probability`: Calibrated probability of functional match - `uncertainty`: Venn-Abers uncertainty interval (|p1 - p0|) ## Advanced Usage ### Using Legacy Scripts For advanced use cases, the original Python scripts are still available in `scripts/`: ```bash # Legacy search script with more options python scripts/search.py \ --fdr \ --fdr_lambda 0.99998 \ --output results.csv \ --query_embedding query.npy \ --query_fasta query.fasta \ --lookup_embedding data/lookup_embeddings.npy \ --lookup_fasta data/lookup_embeddings_meta_data.tsv \ --k 1000 # Precompute similarity-to-probability lookup table python scripts/precompute_SVA_probs.py \ --cal_data data/pfam_new_proteins.npy \ --output data/pfam_sims_to_probs.csv \ --partial \ --n_bins 1000 \ --n_calib 1000 # Apply precomputed probabilities (faster than on-the-fly computation) python scripts/get_probs.py \ --precomputed \ --precomputed_path data/pfam_sims_to_probs.csv \ --input results.csv \ --output results_with_probs.csv \ --partial ``` ## Key Paper Results This repository reproduces the following results from the paper: | Claim | Paper | CLI Command | Status | |-------|-------|-------------|--------| | JCVI Syn3.0 annotation (Fig 2A) | 39.6% (59/149) at FDR α=0.1 | `cpr verify --check syn30` | ✓ Exact | | FDR threshold | λ = 0.9999802250 at α=0.1 | `cpr verify --check fdr` | ✓ (~0.002% diff) | | DALI prefiltering TPR (Table 4-6) | 82.8% | `cpr verify --check dali` | ✓ (~1% diff) | | DALI database reduction | 31.5% | `cpr verify --check dali` | ✓ Exact | | CLEAN enzyme loss (Table 1-2) | ≤ α=1.0 | `cpr verify --check clean` | ✓ (0.97) | ## Repository Structure - `protein_conformal/` - Core utilities for conformal prediction and search - `scripts/` - Verification scripts and legacy search tools - `scope/` - SCOPe structural classification experiments - `pfam/` - Pfam domain annotation notebooks - `ec/` - EC number classification with CLEAN model - `data/` - Data processing notebooks and scripts - `clean_selection/` - CLEAN enzyme selection pipeline - `tests/` - Test suite (run with `pytest tests/ -v`) ## Contributing & Feature Requests If you'd like expanded support for specific models or search tasks, please open an issue describing: 1. The embedding model you'd like to use 2. The search/annotation task you're working on 3. Any specific conformal guarantees you need (FDR, FNR, coverage, etc.) We welcome contributions and look forward to hearing from you! ## Citation If you use this code or method in your work, please cite: ```bibtex @article{boger2025functional, title={Functional protein mining with conformal guarantees}, author={Boger, Ron S and Chithrananda, Seyone and Angelopoulos, Anastasios N and Yoon, Peter H and Jordan, Michael I and Doudna, Jennifer A}, journal={Nature Communications}, volume={16}, number={1}, pages={85}, year={2025}, publisher={Nature Publishing Group}, doi={10.1038/s41467-024-55676-y} } ``` ## License See LICENSE file for details.