Spaces:
Running
Running
File size: 14,915 Bytes
7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 aae26ca 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 0d63974 7453ae1 0d63974 7453ae1 b6ba05f 7453ae1 aae26ca 7453ae1 aae26ca 7453ae1 aae26ca 7453ae1 aae26ca 7453ae1 aae26ca 7453ae1 aae26ca 7453ae1 b6ba05f aae26ca b6ba05f aae26ca b6ba05f aae26ca 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 73421ff b6ba05f 7453ae1 b6ba05f 7453ae1 dd5ecfc 7453ae1 dd5ecfc 11112f0 dd5ecfc 7453ae1 dd5ecfc 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 ab34d07 b6ba05f 7453ae1 ab34d07 7453ae1 ab34d07 7453ae1 ab34d07 b6ba05f ab34d07 b6ba05f 7453ae1 ab34d07 aae26ca b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 9f07ba7 b6ba05f 9f07ba7 b6ba05f 9f07ba7 b6ba05f ab34d07 7453ae1 ab34d07 7453ae1 ab34d07 7453ae1 ab34d07 7453ae1 ab34d07 7453ae1 ab34d07 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 b6ba05f 7453ae1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 |
# Getting Started with CPR
This guide will get you from zero to running protein searches with conformal guarantees.
## Statistical Guarantees
CPR provides rigorous statistical guarantees based on conformal prediction:
| Guarantee | Meaning | How to Use |
|-----------|---------|------------|
| **Expected Marginal FDR ≤ α** | On average, at most α fraction of your hits are false positives | Use `--fdr 0.1` for 10% expected FDR |
| **FNR Control** | Controls the expected fraction of true matches you miss | Use `--fnr 0.1` to miss ≤10% of true hits |
| **Calibrated Probabilities** | Venn-Abers calibration provides valid probability estimates | Output includes `probability` column |
**Key insight**: Unlike p-values or arbitrary thresholds, our FDR guarantees are *marginal* guarantees that hold across all queries in expectation. See the [paper](https://doi.org/10.1038/s41467-024-55676-y) for theoretical details.
---
## Quick Start
```bash
# 1. Clone and install
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
pip install -e .
# 2. Download required data (see wget commands below)
# 3. Search with your sequences (FASTA or embeddings)
cpr search --input your_sequences.fasta --output results.csv --fdr 0.1
```
---
## What You Need
### Already Included (GitHub clone)
| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
| `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
| `protein_conformal/*.py` | ~100 KB | All the code |
### Download from Zenodo (Required)
**Zenodo URL**: https://zenodo.org/records/14272215
```bash
# Download all required files with wget
cd data/
# Database embeddings (1.1 GB) - 540K UniProt protein embeddings
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
# Database metadata (535 MB) - protein names, Pfam domains, etc.
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
# Calibration data (2.4 GB) - Pfam data for FDR/probability computation
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
# Verify downloads
ls -lh lookup_embeddings.npy lookup_embeddings_meta_data.tsv pfam_new_proteins.npy
# Expected: 1.1G, 535M, 2.4G
```
Or with curl:
```bash
cd data/
curl -L -o lookup_embeddings.npy "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1"
curl -L -o lookup_embeddings_meta_data.tsv "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1"
curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
```
### Protein-Vec Model Weights (Required for embedding new sequences)
If you want to embed new FASTA sequences (not just use pre-computed embeddings), download the model weights:
**Zenodo URL**: https://zenodo.org/records/18478696
```bash
# Download and extract Protein-Vec model weights (2.9 GB compressed)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
# Extract to protein_vec_models/ directory
tar -xzf protein_vec_models.gz
# Verify extraction
ls protein_vec_models/
# Expected: protein_vec.ckpt, protein_vec_params.json, aspect_vec_*.ckpt, etc.
```
Or with curl:
```bash
curl -L -o protein_vec_models.gz "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1"
tar -xzf protein_vec_models.gz
```
### Other Optional Downloads
| File | Size | When you need it |
|------|------|------------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
| CLEAN model weights | ~1 GB | Enzyme classification with CLEAN |
---
## CLI Commands
### `cpr search` - Search with Conformal Guarantees
The main command for protein search. Accepts both FASTA files and pre-computed embeddings:
```bash
# From FASTA (embeds automatically using Protein-Vec)
cpr search --input proteins.fasta --output results.csv --fdr 0.1
# From pre-computed embeddings
cpr search --input embeddings.npy --output results.csv --fdr 0.1
```
When given a FASTA file, `cpr search` will:
1. Embed your sequences using Protein-Vec (or CLEAN with `--model clean`)
2. Search the UniProt database (540K proteins)
3. Filter to confident hits at your specified FDR
4. Add calibrated probability estimates
5. Include Pfam/functional annotations
**More examples:**
```bash
# With FNR control instead (control false negatives)
cpr search --input proteins.fasta --output results.csv --fnr 0.1
# With a specific threshold you've computed
cpr search --input proteins.fasta --output results.csv --threshold 0.999980
# Use CLEAN model for enzyme classification
cpr search --input enzymes.fasta --output results.csv --model clean --fdr 0.1
# Exploratory: get all neighbors without filtering
cpr search --input proteins.fasta --output results.csv --no-filter
```
**Threshold options** (mutually exclusive):
- `--fdr ALPHA`: Look up threshold for target FDR level (e.g., `--fdr 0.1` for 10% FDR)
- `--fnr ALPHA`: Look up threshold for target FNR level
- `--threshold VALUE`: Use a specific similarity threshold you provide
- `--no-filter`: Return all k nearest neighbors without filtering
### `cpr embed` - Generate Embeddings
Convert FASTA sequences to embeddings:
```bash
# Using Protein-Vec (default, general-purpose)
cpr embed --input proteins.fasta --output embeddings.npy --model protein-vec
# Using CLEAN (enzyme-specific)
cpr embed --input enzymes.fasta --output embeddings.npy --model clean
```
### `cpr verify` - Verify Paper Results
```bash
cpr verify --check syn30 # Verify JCVI Syn3.0 result (39.6% annotation)
cpr verify --check all # Run all verification checks
```
### Test with Included Data
The repo includes JCVI Syn3.0 sequences for testing:
```bash
# Test search with included FASTA (requires Zenodo data downloaded)
cpr search --input data/gene_unknown/unknown_aa_seqs.fasta --output test_results.csv --fdr 0.1
# Or use pre-computed embeddings (faster, no model weights needed)
cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
--database data/lookup_embeddings.npy \
--output test_results.csv --fdr 0.1
# Expected: ~59 hits (39.6% of 149 sequences)
```
---
## FDR/FNR Threshold Reference
These thresholds control the trade-off between hits and false positives.
### FDR Thresholds (False Discovery Rate)
Controls the expected fraction of hits that are false positives.
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |
**Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.
### FNR Thresholds (False Negative Rate) - Exact Match
Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
| 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
| 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
| 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
| 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
| **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
| 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
| 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |
### FNR Thresholds - Partial Match
"Partial match" requires at least one Pfam domain to match (more permissive).
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
| 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
| 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
| 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
| 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
| **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
| 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
| 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |
Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.
---
## CLEAN Enzyme Classification
For enzyme-specific searches with EC number predictions:
### Setup
```bash
# 1. Clone CLEAN repository with pretrained weights
git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
# 2. Install CLEAN and dependencies
cd CLEAN_repo
pip install -e .
pip install fair-esm>=2.0.0
cd ..
# 3. Verify weights are present
ls CLEAN_repo/app/data/pretrained/
# Expected: 100.pt (123 MB), 70.pt (40 MB), split100.pth, split70.pth
```
**Note**: CLEAN uses ESM-1b embeddings internally (computed automatically). The model produces 128-dimensional embeddings (vs 1024 for Protein-Vec).
### Usage with CPR
```bash
# Generate CLEAN embeddings (128-dim) - requires GPU
cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
# Search with CLEAN model
cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
```
### Verify CLEAN Results (Paper Tables 1-2)
```bash
python scripts/verify_clean.py
# Expected output:
# Mean test loss: 0.97 ± 0.XX
# ✓ VERIFICATION PASSED - Risk controlled at α=1.0
```
---
## DALI Structural Prefiltering
For structural homology search (DALI + AFDB), we use z-score thresholds:
| Metric | Value | Description |
|--------|-------|-------------|
| **elbow_z** | **~5.1** | Z-score threshold for prefiltering |
| TPR | 81.8% | True Positive Rate at elbow threshold |
| FNR | 18.2% | False Negative Rate (miss rate) |
| DB Reduction | 31.5% | Fraction of database filtered out |
Pre-computed results in `results/dali_thresholds.csv` (73 trials from paper experiments).
**Usage**: When running DALI, filter candidates with z-score ≥ 5.1 to achieve ~82% TPR while reducing database size by ~31%.
---
## Legacy Scripts
These scripts from the original paper analysis can be used for advanced workflows:
### FDR/FNR Threshold Computation
```bash
# Compute FDR thresholds at custom alpha levels
python scripts/compute_fdr_table.py \
--calibration data/pfam_new_proteins.npy \
--output results/my_fdr_thresholds.csv \
--n-trials 100 \
--alpha-levels 0.01,0.05,0.1,0.2
# Compute FNR thresholds
python scripts/compute_fnr_table.py \
--calibration data/pfam_new_proteins.npy \
--output results/my_fnr_thresholds.csv \
--n-trials 100
# Use partial matches (at least one Pfam domain matches)
python scripts/compute_fdr_table.py --partial ...
```
### Verification Scripts
```bash
# Verify JCVI Syn3.0 annotation (Paper Figure 2A)
python scripts/verify_syn30.py
# Verify DALI prefiltering (Paper Tables 4-6)
python scripts/verify_dali.py
# Verify CLEAN enzyme classification (Paper Tables 1-2)
python scripts/verify_clean.py
# Verify FDR algorithm correctness
python scripts/verify_fdr_algorithm.py
```
### Probability Computation
```bash
# Precompute SVA probabilities for a database
python scripts/precompute_SVA_probs.py \
--calibration data/pfam_new_proteins.npy \
--output data/sva_probabilities.csv
# Get probabilities for search results
python scripts/get_probs.py \
--input results.csv \
--calibration data/pfam_new_proteins.npy \
--output results_with_probs.csv
```
### Original Paper Scripts (in `scripts/pfam/`)
```bash
# Original FDR threshold generation (paper methodology)
python scripts/pfam/generate_fdr.py
# Original FNR threshold generation
python scripts/pfam/generate_fnr.py
# SVA reliability analysis
python scripts/pfam/sva_results.py
```
---
## Docker / Container Usage
Run CPR without installing dependencies locally:
### Docker
```bash
# Build the image
docker build -t cpr:latest .
# Run with your data mounted
docker run -it --rm \
-v $(pwd)/data:/workspace/data \
-v $(pwd)/protein_vec_models:/workspace/protein_vec_models \
-v $(pwd)/results:/workspace/results \
cpr:latest bash
# Inside container: run searches
cpr search --input data/your_sequences.fasta --output results/hits.csv --fdr 0.1
# Or launch the Gradio web interface
docker run -p 7860:7860 \
-v $(pwd)/data:/workspace/data \
cpr:latest
# Then open http://localhost:7860
```
### Docker Compose
```bash
# Start the Gradio web interface
docker-compose up
# Access at http://localhost:7860
```
### Apptainer (HPC clusters)
```bash
# Build the container
apptainer build cpr.sif apptainer.def
# Run a search
apptainer exec --nv cpr.sif cpr search \
--input data/sequences.fasta \
--output results/hits.csv \
--fdr 0.1
# Interactive shell
apptainer shell --nv cpr.sif
```
**Note**: Use `--nv` flag for GPU support on NVIDIA systems.
---
## Troubleshooting
### "FileNotFoundError: data/lookup_embeddings.npy"
→ Download from Zenodo (see wget commands above)
### "ModuleNotFoundError: No module named 'faiss'"
→ Install FAISS: `pip install faiss-cpu` (or `conda install faiss-gpu` for GPU)
### "Got 58 hits, expected 59"
→ This is expected! See `docs/REPRODUCIBILITY.md` - varies by ±1 due to threshold boundary effects.
### "CUDA out of memory"
→ Use CPU: `--cpu` flag or reduce batch size
### "ModuleNotFoundError: No module named 'fair_esm'"
→ For CLEAN embeddings: `pip install fair-esm`
---
## Output Columns
Search results include:
| Column | Description |
|--------|-------------|
| `query_name` | Your sequence ID from FASTA |
| `similarity` | Cosine similarity score |
| `probability` | Calibrated probability of functional match |
| `uncertainty` | Venn-Abers uncertainty interval |
| `match_name` | Matched protein name |
| `match_pfam` | Pfam domain annotations |
---
## What's Next?
- **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
- **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
- **Run verification**: `cpr verify --check all` tests all paper claims
- **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues
---
## Files Checklist
| Source | Files | Size | Status |
|--------|-------|------|--------|
| **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
| **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
| **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
| **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
| **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
| **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |
|