cpr / GETTING_STARTED.md
ronboger's picture
docs: add CLEAN setup details and Docker/Apptainer usage
ab34d07
# Getting Started with CPR
This guide will get you from zero to running protein searches with conformal guarantees.
## Statistical Guarantees
CPR provides rigorous statistical guarantees based on conformal prediction:
| Guarantee | Meaning | How to Use |
|-----------|---------|------------|
| **Expected Marginal FDR ≤ α** | On average, at most α fraction of your hits are false positives | Use `--fdr 0.1` for 10% expected FDR |
| **FNR Control** | Controls the expected fraction of true matches you miss | Use `--fnr 0.1` to miss ≤10% of true hits |
| **Calibrated Probabilities** | Venn-Abers calibration provides valid probability estimates | Output includes `probability` column |
**Key insight**: Unlike p-values or arbitrary thresholds, our FDR guarantees are *marginal* guarantees that hold across all queries in expectation. See the [paper](https://doi.org/10.1038/s41467-024-55676-y) for theoretical details.
---
## Quick Start
```bash
# 1. Clone and install
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
pip install -e .
# 2. Download required data (see wget commands below)
# 3. Search with your sequences (FASTA or embeddings)
cpr search --input your_sequences.fasta --output results.csv --fdr 0.1
```
---
## What You Need
### Already Included (GitHub clone)
| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
| `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
| `protein_conformal/*.py` | ~100 KB | All the code |
### Download from Zenodo (Required)
**Zenodo URL**: https://zenodo.org/records/14272215
```bash
# Download all required files with wget
cd data/
# Database embeddings (1.1 GB) - 540K UniProt protein embeddings
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
# Database metadata (535 MB) - protein names, Pfam domains, etc.
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
# Calibration data (2.4 GB) - Pfam data for FDR/probability computation
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
# Verify downloads
ls -lh lookup_embeddings.npy lookup_embeddings_meta_data.tsv pfam_new_proteins.npy
# Expected: 1.1G, 535M, 2.4G
```
Or with curl:
```bash
cd data/
curl -L -o lookup_embeddings.npy "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1"
curl -L -o lookup_embeddings_meta_data.tsv "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1"
curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
```
### Protein-Vec Model Weights (Required for embedding new sequences)
If you want to embed new FASTA sequences (not just use pre-computed embeddings), download the model weights:
**Zenodo URL**: https://zenodo.org/records/18478696
```bash
# Download and extract Protein-Vec model weights (2.9 GB compressed)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
# Extract to protein_vec_models/ directory
tar -xzf protein_vec_models.gz
# Verify extraction
ls protein_vec_models/
# Expected: protein_vec.ckpt, protein_vec_params.json, aspect_vec_*.ckpt, etc.
```
Or with curl:
```bash
curl -L -o protein_vec_models.gz "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1"
tar -xzf protein_vec_models.gz
```
### Other Optional Downloads
| File | Size | When you need it |
|------|------|------------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
| CLEAN model weights | ~1 GB | Enzyme classification with CLEAN |
---
## CLI Commands
### `cpr search` - Search with Conformal Guarantees
The main command for protein search. Accepts both FASTA files and pre-computed embeddings:
```bash
# From FASTA (embeds automatically using Protein-Vec)
cpr search --input proteins.fasta --output results.csv --fdr 0.1
# From pre-computed embeddings
cpr search --input embeddings.npy --output results.csv --fdr 0.1
```
When given a FASTA file, `cpr search` will:
1. Embed your sequences using Protein-Vec (or CLEAN with `--model clean`)
2. Search the UniProt database (540K proteins)
3. Filter to confident hits at your specified FDR
4. Add calibrated probability estimates
5. Include Pfam/functional annotations
**More examples:**
```bash
# With FNR control instead (control false negatives)
cpr search --input proteins.fasta --output results.csv --fnr 0.1
# With a specific threshold you've computed
cpr search --input proteins.fasta --output results.csv --threshold 0.999980
# Use CLEAN model for enzyme classification
cpr search --input enzymes.fasta --output results.csv --model clean --fdr 0.1
# Exploratory: get all neighbors without filtering
cpr search --input proteins.fasta --output results.csv --no-filter
```
**Threshold options** (mutually exclusive):
- `--fdr ALPHA`: Look up threshold for target FDR level (e.g., `--fdr 0.1` for 10% FDR)
- `--fnr ALPHA`: Look up threshold for target FNR level
- `--threshold VALUE`: Use a specific similarity threshold you provide
- `--no-filter`: Return all k nearest neighbors without filtering
### `cpr embed` - Generate Embeddings
Convert FASTA sequences to embeddings:
```bash
# Using Protein-Vec (default, general-purpose)
cpr embed --input proteins.fasta --output embeddings.npy --model protein-vec
# Using CLEAN (enzyme-specific)
cpr embed --input enzymes.fasta --output embeddings.npy --model clean
```
### `cpr verify` - Verify Paper Results
```bash
cpr verify --check syn30 # Verify JCVI Syn3.0 result (39.6% annotation)
cpr verify --check all # Run all verification checks
```
### Test with Included Data
The repo includes JCVI Syn3.0 sequences for testing:
```bash
# Test search with included FASTA (requires Zenodo data downloaded)
cpr search --input data/gene_unknown/unknown_aa_seqs.fasta --output test_results.csv --fdr 0.1
# Or use pre-computed embeddings (faster, no model weights needed)
cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
--database data/lookup_embeddings.npy \
--output test_results.csv --fdr 0.1
# Expected: ~59 hits (39.6% of 149 sequences)
```
---
## FDR/FNR Threshold Reference
These thresholds control the trade-off between hits and false positives.
### FDR Thresholds (False Discovery Rate)
Controls the expected fraction of hits that are false positives.
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |
**Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.
### FNR Thresholds (False Negative Rate) - Exact Match
Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
| 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
| 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
| 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
| 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
| **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
| 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
| 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |
### FNR Thresholds - Partial Match
"Partial match" requires at least one Pfam domain to match (more permissive).
| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
| 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
| 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
| 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
| 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
| **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
| 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
| 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |
Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.
---
## CLEAN Enzyme Classification
For enzyme-specific searches with EC number predictions:
### Setup
```bash
# 1. Clone CLEAN repository with pretrained weights
git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
# 2. Install CLEAN and dependencies
cd CLEAN_repo
pip install -e .
pip install fair-esm>=2.0.0
cd ..
# 3. Verify weights are present
ls CLEAN_repo/app/data/pretrained/
# Expected: 100.pt (123 MB), 70.pt (40 MB), split100.pth, split70.pth
```
**Note**: CLEAN uses ESM-1b embeddings internally (computed automatically). The model produces 128-dimensional embeddings (vs 1024 for Protein-Vec).
### Usage with CPR
```bash
# Generate CLEAN embeddings (128-dim) - requires GPU
cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
# Search with CLEAN model
cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
```
### Verify CLEAN Results (Paper Tables 1-2)
```bash
python scripts/verify_clean.py
# Expected output:
# Mean test loss: 0.97 ± 0.XX
# ✓ VERIFICATION PASSED - Risk controlled at α=1.0
```
---
## DALI Structural Prefiltering
For structural homology search (DALI + AFDB), we use z-score thresholds:
| Metric | Value | Description |
|--------|-------|-------------|
| **elbow_z** | **~5.1** | Z-score threshold for prefiltering |
| TPR | 81.8% | True Positive Rate at elbow threshold |
| FNR | 18.2% | False Negative Rate (miss rate) |
| DB Reduction | 31.5% | Fraction of database filtered out |
Pre-computed results in `results/dali_thresholds.csv` (73 trials from paper experiments).
**Usage**: When running DALI, filter candidates with z-score ≥ 5.1 to achieve ~82% TPR while reducing database size by ~31%.
---
## Legacy Scripts
These scripts from the original paper analysis can be used for advanced workflows:
### FDR/FNR Threshold Computation
```bash
# Compute FDR thresholds at custom alpha levels
python scripts/compute_fdr_table.py \
--calibration data/pfam_new_proteins.npy \
--output results/my_fdr_thresholds.csv \
--n-trials 100 \
--alpha-levels 0.01,0.05,0.1,0.2
# Compute FNR thresholds
python scripts/compute_fnr_table.py \
--calibration data/pfam_new_proteins.npy \
--output results/my_fnr_thresholds.csv \
--n-trials 100
# Use partial matches (at least one Pfam domain matches)
python scripts/compute_fdr_table.py --partial ...
```
### Verification Scripts
```bash
# Verify JCVI Syn3.0 annotation (Paper Figure 2A)
python scripts/verify_syn30.py
# Verify DALI prefiltering (Paper Tables 4-6)
python scripts/verify_dali.py
# Verify CLEAN enzyme classification (Paper Tables 1-2)
python scripts/verify_clean.py
# Verify FDR algorithm correctness
python scripts/verify_fdr_algorithm.py
```
### Probability Computation
```bash
# Precompute SVA probabilities for a database
python scripts/precompute_SVA_probs.py \
--calibration data/pfam_new_proteins.npy \
--output data/sva_probabilities.csv
# Get probabilities for search results
python scripts/get_probs.py \
--input results.csv \
--calibration data/pfam_new_proteins.npy \
--output results_with_probs.csv
```
### Original Paper Scripts (in `scripts/pfam/`)
```bash
# Original FDR threshold generation (paper methodology)
python scripts/pfam/generate_fdr.py
# Original FNR threshold generation
python scripts/pfam/generate_fnr.py
# SVA reliability analysis
python scripts/pfam/sva_results.py
```
---
## Docker / Container Usage
Run CPR without installing dependencies locally:
### Docker
```bash
# Build the image
docker build -t cpr:latest .
# Run with your data mounted
docker run -it --rm \
-v $(pwd)/data:/workspace/data \
-v $(pwd)/protein_vec_models:/workspace/protein_vec_models \
-v $(pwd)/results:/workspace/results \
cpr:latest bash
# Inside container: run searches
cpr search --input data/your_sequences.fasta --output results/hits.csv --fdr 0.1
# Or launch the Gradio web interface
docker run -p 7860:7860 \
-v $(pwd)/data:/workspace/data \
cpr:latest
# Then open http://localhost:7860
```
### Docker Compose
```bash
# Start the Gradio web interface
docker-compose up
# Access at http://localhost:7860
```
### Apptainer (HPC clusters)
```bash
# Build the container
apptainer build cpr.sif apptainer.def
# Run a search
apptainer exec --nv cpr.sif cpr search \
--input data/sequences.fasta \
--output results/hits.csv \
--fdr 0.1
# Interactive shell
apptainer shell --nv cpr.sif
```
**Note**: Use `--nv` flag for GPU support on NVIDIA systems.
---
## Troubleshooting
### "FileNotFoundError: data/lookup_embeddings.npy"
→ Download from Zenodo (see wget commands above)
### "ModuleNotFoundError: No module named 'faiss'"
→ Install FAISS: `pip install faiss-cpu` (or `conda install faiss-gpu` for GPU)
### "Got 58 hits, expected 59"
→ This is expected! See `docs/REPRODUCIBILITY.md` - varies by ±1 due to threshold boundary effects.
### "CUDA out of memory"
→ Use CPU: `--cpu` flag or reduce batch size
### "ModuleNotFoundError: No module named 'fair_esm'"
→ For CLEAN embeddings: `pip install fair-esm`
---
## Output Columns
Search results include:
| Column | Description |
|--------|-------------|
| `query_name` | Your sequence ID from FASTA |
| `similarity` | Cosine similarity score |
| `probability` | Calibrated probability of functional match |
| `uncertainty` | Venn-Abers uncertainty interval |
| `match_name` | Matched protein name |
| `match_pfam` | Pfam domain annotations |
---
## What's Next?
- **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
- **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
- **Run verification**: `cpr verify --check all` tests all paper claims
- **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues
---
## Files Checklist
| Source | Files | Size | Status |
|--------|-------|------|--------|
| **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
| **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
| **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
| **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
| **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
| **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |