Spaces:
Running
Running
| title: Conformal Protein Retrieval | |
| emoji: 𧬠| |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.9.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Functional protein mining with statistical guarantees | |
| # Conformal Protein Retrieval | |
| Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control. | |
| **[β GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes) | |
| ## Quick Setup | |
| ```bash | |
| # 1. Clone and install | |
| git clone https://github.com/ronboger/conformal-protein-retrieval.git | |
| cd conformal-protein-retrieval | |
| pip install -e . | |
| # 2. Download data from Zenodo (4GB total) | |
| # https://zenodo.org/records/14272215 | |
| # β lookup_embeddings.npy (1.1 GB) β data/ | |
| # β lookup_embeddings_meta_data.tsv (535 MB) β data/ | |
| # β pfam_new_proteins.npy (2.4 GB) β data/ | |
| # 3. Verify setup | |
| cpr verify --check syn30 | |
| # Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1 | |
| ``` | |
| See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions. | |
| ## Repository Structure | |
| ``` | |
| conformal-protein-retrieval/ | |
| βββ protein_conformal/ # Core library (FDR/FNR control, Venn-Abers) | |
| βββ notebooks/ # Analysis notebooks organized by experiment | |
| β βββ pfam/ # Pfam domain annotation (Figure 2) | |
| β βββ scope/ # SCOPe structural classification | |
| β βββ ec/ # EC number classification | |
| β βββ clean_selection/ # CLEAN enzyme experiments (Tables 1-2) | |
| βββ scripts/ # CLI scripts and SLURM jobs | |
| βββ data/ # Data files (see GETTING_STARTED.md) | |
| βββ results/ # Pre-computed thresholds and outputs | |
| βββ docs/ # Additional documentation | |
| ``` | |
| ## Quick Start | |
| The `cpr` CLI provides five main commands for functional protein mining: | |
| ### 1. Embed protein sequences | |
| ```bash | |
| # Embed with Protein-Vec (for general protein search) | |
| cpr embed --input sequences.fasta --output embeddings.npy --model protein-vec | |
| # Embed with CLEAN (for enzyme classification) | |
| cpr embed --input sequences.fasta --output embeddings.npy --model clean | |
| ``` | |
| ### 2. Search for similar proteins with conformal guarantees | |
| The `cpr search` command accepts **both FASTA files and pre-computed embeddings**: | |
| ```bash | |
| # From FASTA file (auto-embeds with Protein-Vec) | |
| cpr search --input sequences.fasta --output results.csv --fdr 0.1 | |
| # From pre-computed embeddings | |
| cpr search --input embeddings.npy --output results.csv --fdr 0.1 | |
| # With FNR control instead of FDR | |
| cpr search --input sequences.fasta --output results.csv --fnr 0.1 | |
| # With explicit threshold | |
| cpr search --input sequences.fasta --output results.csv --threshold 0.99998 | |
| # Exploratory mode (no filtering, return all k neighbors) | |
| cpr search --input sequences.fasta --output results.csv --no-filter | |
| ``` | |
| ### 3. Convert similarity scores to calibrated probabilities | |
| ```bash | |
| # Add Venn-Abers calibrated probabilities to search results | |
| cpr prob \ | |
| --input results.csv \ | |
| --calibration data/pfam_new_proteins.npy \ | |
| --output results_with_probs.csv \ | |
| --n-calib 1000 | |
| ``` | |
| ### 4. Calibrate FDR/FNR thresholds for a new embedding model | |
| ```bash | |
| # Compute thresholds from your own calibration data | |
| cpr calibrate \ | |
| --calibration my_calibration_data.npy \ | |
| --output thresholds.csv \ | |
| --alpha 0.1 \ | |
| --n-trials 100 \ | |
| --n-calib 1000 | |
| ``` | |
| ### 5. Verify paper results | |
| ```bash | |
| # Reproduce key results from the paper | |
| cpr verify --check syn30 # JCVI Syn3.0 annotation (39.6% at FDR Ξ±=0.1) | |
| cpr verify --check fdr # FDR threshold calibration | |
| cpr verify --check dali # DALI prefiltering (82.8% TPR, 31.5% DB reduction) | |
| cpr verify --check clean # CLEAN enzyme classification | |
| ``` | |
| ## Data Files | |
| ### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215)) | |
| ```bash | |
| cd data/ | |
| wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy | |
| wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv | |
| wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy | |
| ``` | |
| ### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences | |
| ```bash | |
| wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz | |
| tar -xzf protein_vec_models.gz | |
| ``` | |
| ## Protein-Vec vs CLEAN Models | |
| ### Protein-Vec (general protein search) | |
| - Trained on UniProt with multi-task objectives (Pfam, EC, GO, transmembrane, etc.) | |
| - Best for: broad functional annotation, domain identification, general homology search | |
| - Output: 128-dimensional embeddings | |
| - FDR threshold at Ξ±=0.1: Ξ» β 0.9999802 | |
| ### CLEAN (enzyme classification) | |
| - Trained specifically for EC number classification | |
| - Best for: enzyme function prediction, detailed catalytic annotation | |
| - Output: 128-dimensional embeddings | |
| - Requires ESM embeddings as input (computed automatically) | |
| - See `ec/` directory for CLEAN-specific notebooks | |
| ## Creating Custom Calibration Datasets | |
| To calibrate FDR/FNR thresholds for your own protein search tasks: | |
| 1. Create a calibration dataset with ground-truth labels (see `data/create_pfam_data.ipynb`) | |
| 2. Embed sequences using your chosen model (`cpr embed`) | |
| 3. Compute similarity scores and labels (save as .npy with shape `(n_samples, 3)`: `[sim, label_exact, label_partial]`) | |
| 4. Run calibration: `cpr calibrate --calibration my_data.npy --output thresholds.csv --alpha 0.1` | |
| **Important:** Ensure your calibration dataset is outside the training data of your embedding model to avoid data leakage. | |
| ## Complete Workflow Example | |
| Here's a full example searching viral domains against the Pfam database with FDR control: | |
| ```bash | |
| # Option A: One-step search from FASTA (embeds automatically) | |
| cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1 | |
| # Option B: Two-step with explicit embedding | |
| cpr embed --input viral_domains.fasta --output viral_embeddings.npy | |
| cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1 | |
| ``` | |
| The output CSV will contain: | |
| - `query_idx`: Query sequence index | |
| - `match_idx`: Database match index | |
| - `similarity`: Cosine similarity score | |
| - `match_*`: Metadata columns from database (UniProt ID, Pfam domains, etc.) | |
| - `probability`: Calibrated probability of functional match | |
| - `uncertainty`: Venn-Abers uncertainty interval (|p1 - p0|) | |
| ## Advanced Usage | |
| ### Using Legacy Scripts | |
| For advanced use cases, the original Python scripts are still available in `scripts/`: | |
| ```bash | |
| # Legacy search script with more options | |
| python scripts/search.py \ | |
| --fdr \ | |
| --fdr_lambda 0.99998 \ | |
| --output results.csv \ | |
| --query_embedding query.npy \ | |
| --query_fasta query.fasta \ | |
| --lookup_embedding data/lookup_embeddings.npy \ | |
| --lookup_fasta data/lookup_embeddings_meta_data.tsv \ | |
| --k 1000 | |
| # Precompute similarity-to-probability lookup table | |
| python scripts/precompute_SVA_probs.py \ | |
| --cal_data data/pfam_new_proteins.npy \ | |
| --output data/pfam_sims_to_probs.csv \ | |
| --partial \ | |
| --n_bins 1000 \ | |
| --n_calib 1000 | |
| # Apply precomputed probabilities (faster than on-the-fly computation) | |
| python scripts/get_probs.py \ | |
| --precomputed \ | |
| --precomputed_path data/pfam_sims_to_probs.csv \ | |
| --input results.csv \ | |
| --output results_with_probs.csv \ | |
| --partial | |
| ``` | |
| ## Key Paper Results | |
| This repository reproduces the following results from the paper: | |
| | Claim | Paper | CLI Command | Status | | |
| |-------|-------|-------------|--------| | |
| | JCVI Syn3.0 annotation (Fig 2A) | 39.6% (59/149) at FDR Ξ±=0.1 | `cpr verify --check syn30` | β Exact | | |
| | FDR threshold | Ξ» = 0.9999802250 at Ξ±=0.1 | `cpr verify --check fdr` | β (~0.002% diff) | | |
| | DALI prefiltering TPR (Table 4-6) | 82.8% | `cpr verify --check dali` | β (~1% diff) | | |
| | DALI database reduction | 31.5% | `cpr verify --check dali` | β Exact | | |
| | CLEAN enzyme loss (Table 1-2) | β€ Ξ±=1.0 | `cpr verify --check clean` | β (0.97) | | |
| ## Repository Structure | |
| - `protein_conformal/` - Core utilities for conformal prediction and search | |
| - `scripts/` - Verification scripts and legacy search tools | |
| - `scope/` - SCOPe structural classification experiments | |
| - `pfam/` - Pfam domain annotation notebooks | |
| - `ec/` - EC number classification with CLEAN model | |
| - `data/` - Data processing notebooks and scripts | |
| - `clean_selection/` - CLEAN enzyme selection pipeline | |
| - `tests/` - Test suite (run with `pytest tests/ -v`) | |
| ## Contributing & Feature Requests | |
| If you'd like expanded support for specific models or search tasks, please open an issue describing: | |
| 1. The embedding model you'd like to use | |
| 2. The search/annotation task you're working on | |
| 3. Any specific conformal guarantees you need (FDR, FNR, coverage, etc.) | |
| We welcome contributions and look forward to hearing from you! | |
| ## Citation | |
| If you use this code or method in your work, please cite: | |
| ```bibtex | |
| @article{boger2025functional, | |
| title={Functional protein mining with conformal guarantees}, | |
| author={Boger, Ron S and Chithrananda, Seyone and Angelopoulos, Anastasios N and Yoon, Peter H and Jordan, Michael I and Doudna, Jennifer A}, | |
| journal={Nature Communications}, | |
| volume={16}, | |
| number={1}, | |
| pages={85}, | |
| year={2025}, | |
| publisher={Nature Publishing Group}, | |
| doi={10.1038/s41467-024-55676-y} | |
| } | |
| ``` | |
| ## License | |
| See LICENSE file for details. | |