Spaces:
Sleeping
Sleeping
File size: 9,683 Bytes
a6bb8c6 6754836 c993f80 6754836 7453ae1 c993f80 7453ae1 c993f80 6754836 7453ae1 7f84955 6754836 7453ae1 7f84955 c993f80 7453ae1 6754836 7453ae1 c993f80 7453ae1 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 174c120 6754836 174c120 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 0d63974 7f84955 0d63974 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 174c120 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 6754836 7f84955 ed3661d 6754836 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | ---
title: Conformal Protein Retrieval
emoji: π§¬
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: mit
short_description: Functional protein mining with statistical guarantees
---
# Conformal Protein Retrieval
Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
**[β GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes)
## Quick Setup
```bash
# 1. Clone and install
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
pip install -e .
# 2. Download data from Zenodo (4GB total)
# https://zenodo.org/records/14272215
# β lookup_embeddings.npy (1.1 GB) β data/
# β lookup_embeddings_meta_data.tsv (535 MB) β data/
# β pfam_new_proteins.npy (2.4 GB) β data/
# 3. Verify setup
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
```
See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions.
## Repository Structure
```
conformal-protein-retrieval/
βββ protein_conformal/ # Core library (FDR/FNR control, Venn-Abers)
βββ notebooks/ # Analysis notebooks organized by experiment
β βββ pfam/ # Pfam domain annotation (Figure 2)
β βββ scope/ # SCOPe structural classification
β βββ ec/ # EC number classification
β βββ clean_selection/ # CLEAN enzyme experiments (Tables 1-2)
βββ scripts/ # CLI scripts and SLURM jobs
βββ data/ # Data files (see GETTING_STARTED.md)
βββ results/ # Pre-computed thresholds and outputs
βββ docs/ # Additional documentation
```
## Quick Start
The `cpr` CLI provides five main commands for functional protein mining:
### 1. Embed protein sequences
```bash
# Embed with Protein-Vec (for general protein search)
cpr embed --input sequences.fasta --output embeddings.npy --model protein-vec
# Embed with CLEAN (for enzyme classification)
cpr embed --input sequences.fasta --output embeddings.npy --model clean
```
### 2. Search for similar proteins with conformal guarantees
The `cpr search` command accepts **both FASTA files and pre-computed embeddings**:
```bash
# From FASTA file (auto-embeds with Protein-Vec)
cpr search --input sequences.fasta --output results.csv --fdr 0.1
# From pre-computed embeddings
cpr search --input embeddings.npy --output results.csv --fdr 0.1
# With FNR control instead of FDR
cpr search --input sequences.fasta --output results.csv --fnr 0.1
# With explicit threshold
cpr search --input sequences.fasta --output results.csv --threshold 0.99998
# Exploratory mode (no filtering, return all k neighbors)
cpr search --input sequences.fasta --output results.csv --no-filter
```
### 3. Convert similarity scores to calibrated probabilities
```bash
# Add Venn-Abers calibrated probabilities to search results
cpr prob \
--input results.csv \
--calibration data/pfam_new_proteins.npy \
--output results_with_probs.csv \
--n-calib 1000
```
### 4. Calibrate FDR/FNR thresholds for a new embedding model
```bash
# Compute thresholds from your own calibration data
cpr calibrate \
--calibration my_calibration_data.npy \
--output thresholds.csv \
--alpha 0.1 \
--n-trials 100 \
--n-calib 1000
```
### 5. Verify paper results
```bash
# Reproduce key results from the paper
cpr verify --check syn30 # JCVI Syn3.0 annotation (39.6% at FDR Ξ±=0.1)
cpr verify --check fdr # FDR threshold calibration
cpr verify --check dali # DALI prefiltering (82.8% TPR, 31.5% DB reduction)
cpr verify --check clean # CLEAN enzyme classification
```
## Data Files
### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215))
```bash
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
```
### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences
```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```
## Protein-Vec vs CLEAN Models
### Protein-Vec (general protein search)
- Trained on UniProt with multi-task objectives (Pfam, EC, GO, transmembrane, etc.)
- Best for: broad functional annotation, domain identification, general homology search
- Output: 128-dimensional embeddings
- FDR threshold at Ξ±=0.1: Ξ» β 0.9999802
### CLEAN (enzyme classification)
- Trained specifically for EC number classification
- Best for: enzyme function prediction, detailed catalytic annotation
- Output: 128-dimensional embeddings
- Requires ESM embeddings as input (computed automatically)
- See `ec/` directory for CLEAN-specific notebooks
## Creating Custom Calibration Datasets
To calibrate FDR/FNR thresholds for your own protein search tasks:
1. Create a calibration dataset with ground-truth labels (see `data/create_pfam_data.ipynb`)
2. Embed sequences using your chosen model (`cpr embed`)
3. Compute similarity scores and labels (save as .npy with shape `(n_samples, 3)`: `[sim, label_exact, label_partial]`)
4. Run calibration: `cpr calibrate --calibration my_data.npy --output thresholds.csv --alpha 0.1`
**Important:** Ensure your calibration dataset is outside the training data of your embedding model to avoid data leakage.
## Complete Workflow Example
Here's a full example searching viral domains against the Pfam database with FDR control:
```bash
# Option A: One-step search from FASTA (embeds automatically)
cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1
# Option B: Two-step with explicit embedding
cpr embed --input viral_domains.fasta --output viral_embeddings.npy
cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1
```
The output CSV will contain:
- `query_idx`: Query sequence index
- `match_idx`: Database match index
- `similarity`: Cosine similarity score
- `match_*`: Metadata columns from database (UniProt ID, Pfam domains, etc.)
- `probability`: Calibrated probability of functional match
- `uncertainty`: Venn-Abers uncertainty interval (|p1 - p0|)
## Advanced Usage
### Using Legacy Scripts
For advanced use cases, the original Python scripts are still available in `scripts/`:
```bash
# Legacy search script with more options
python scripts/search.py \
--fdr \
--fdr_lambda 0.99998 \
--output results.csv \
--query_embedding query.npy \
--query_fasta query.fasta \
--lookup_embedding data/lookup_embeddings.npy \
--lookup_fasta data/lookup_embeddings_meta_data.tsv \
--k 1000
# Precompute similarity-to-probability lookup table
python scripts/precompute_SVA_probs.py \
--cal_data data/pfam_new_proteins.npy \
--output data/pfam_sims_to_probs.csv \
--partial \
--n_bins 1000 \
--n_calib 1000
# Apply precomputed probabilities (faster than on-the-fly computation)
python scripts/get_probs.py \
--precomputed \
--precomputed_path data/pfam_sims_to_probs.csv \
--input results.csv \
--output results_with_probs.csv \
--partial
```
## Key Paper Results
This repository reproduces the following results from the paper:
| Claim | Paper | CLI Command | Status |
|-------|-------|-------------|--------|
| JCVI Syn3.0 annotation (Fig 2A) | 39.6% (59/149) at FDR Ξ±=0.1 | `cpr verify --check syn30` | β Exact |
| FDR threshold | Ξ» = 0.9999802250 at Ξ±=0.1 | `cpr verify --check fdr` | β (~0.002% diff) |
| DALI prefiltering TPR (Table 4-6) | 82.8% | `cpr verify --check dali` | β (~1% diff) |
| DALI database reduction | 31.5% | `cpr verify --check dali` | β Exact |
| CLEAN enzyme loss (Table 1-2) | β€ Ξ±=1.0 | `cpr verify --check clean` | β (0.97) |
## Repository Structure
- `protein_conformal/` - Core utilities for conformal prediction and search
- `scripts/` - Verification scripts and legacy search tools
- `scope/` - SCOPe structural classification experiments
- `pfam/` - Pfam domain annotation notebooks
- `ec/` - EC number classification with CLEAN model
- `data/` - Data processing notebooks and scripts
- `clean_selection/` - CLEAN enzyme selection pipeline
- `tests/` - Test suite (run with `pytest tests/ -v`)
## Contributing & Feature Requests
If you'd like expanded support for specific models or search tasks, please open an issue describing:
1. The embedding model you'd like to use
2. The search/annotation task you're working on
3. Any specific conformal guarantees you need (FDR, FNR, coverage, etc.)
We welcome contributions and look forward to hearing from you!
## Citation
If you use this code or method in your work, please cite:
```bibtex
@article{boger2025functional,
title={Functional protein mining with conformal guarantees},
author={Boger, Ron S and Chithrananda, Seyone and Angelopoulos, Anastasios N and Yoon, Peter H and Jordan, Michael I and Doudna, Jennifer A},
journal={Nature Communications},
volume={16},
number={1},
pages={85},
year={2025},
publisher={Nature Publishing Group},
doi={10.1038/s41467-024-55676-y}
}
```
## License
See LICENSE file for details.
|