Spaces:

LoocasGoose
/

cpr

Running

App Files Files Community

cpr / README.md

LoocasGoose

Add HuggingFace Spaces configuration to README

a6bb8c6 3 days ago

preview code

raw

history blame contribute delete

9.68 kB

	---
	title: Conformal Protein Retrieval
	emoji: 🧬
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	sdk_version: 5.9.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: Functional protein mining with statistical guarantees
	---

	# Conformal Protein Retrieval

	Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.

	[→ GETTING STARTED](GETTING_STARTED.md) - Quick setup guide (10 minutes)

	## Quick Setup

	```bash
	# 1. Clone and install
	git clone https://github.com/ronboger/conformal-protein-retrieval.git
	cd conformal-protein-retrieval
	pip install -e .

	# 2. Download data from Zenodo (4GB total)
	# https://zenodo.org/records/14272215
	# → lookup_embeddings.npy (1.1 GB) → data/
	# → lookup_embeddings_meta_data.tsv (535 MB) → data/
	# → pfam_new_proteins.npy (2.4 GB) → data/

	# 3. Verify setup
	cpr verify --check syn30
	# Expected: 59/149 = 39.6% hits at FDR α=0.1
	```

	See [GETTING_STARTED.md](GETTING_STARTED.md) for detailed instructions.

	## Repository Structure

	```
	conformal-protein-retrieval/
	├── protein_conformal/ # Core library (FDR/FNR control, Venn-Abers)
	├── notebooks/ # Analysis notebooks organized by experiment
	│ ├── pfam/ # Pfam domain annotation (Figure 2)
	│ ├── scope/ # SCOPe structural classification
	│ ├── ec/ # EC number classification
	│ └── clean_selection/ # CLEAN enzyme experiments (Tables 1-2)
	├── scripts/ # CLI scripts and SLURM jobs
	├── data/ # Data files (see GETTING_STARTED.md)
	├── results/ # Pre-computed thresholds and outputs
	└── docs/ # Additional documentation
	```

	## Quick Start

	The `cpr` CLI provides five main commands for functional protein mining:

	### 1. Embed protein sequences

	```bash
	# Embed with Protein-Vec (for general protein search)
	cpr embed --input sequences.fasta --output embeddings.npy --model protein-vec

	# Embed with CLEAN (for enzyme classification)
	cpr embed --input sequences.fasta --output embeddings.npy --model clean
	```

	### 2. Search for similar proteins with conformal guarantees

	The `cpr search` command accepts both FASTA files and pre-computed embeddings:

	```bash
	# From FASTA file (auto-embeds with Protein-Vec)
	cpr search --input sequences.fasta --output results.csv --fdr 0.1

	# From pre-computed embeddings
	cpr search --input embeddings.npy --output results.csv --fdr 0.1

	# With FNR control instead of FDR
	cpr search --input sequences.fasta --output results.csv --fnr 0.1

	# With explicit threshold
	cpr search --input sequences.fasta --output results.csv --threshold 0.99998

	# Exploratory mode (no filtering, return all k neighbors)
	cpr search --input sequences.fasta --output results.csv --no-filter
	```

	### 3. Convert similarity scores to calibrated probabilities

	```bash
	# Add Venn-Abers calibrated probabilities to search results
	cpr prob \
	--input results.csv \
	--calibration data/pfam_new_proteins.npy \
	--output results_with_probs.csv \
	--n-calib 1000
	```

	### 4. Calibrate FDR/FNR thresholds for a new embedding model

	```bash
	# Compute thresholds from your own calibration data
	cpr calibrate \
	--calibration my_calibration_data.npy \
	--output thresholds.csv \
	--alpha 0.1 \
	--n-trials 100 \
	--n-calib 1000
	```

	### 5. Verify paper results

	```bash
	# Reproduce key results from the paper
	cpr verify --check syn30 # JCVI Syn3.0 annotation (39.6% at FDR α=0.1)
	cpr verify --check fdr # FDR threshold calibration
	cpr verify --check dali # DALI prefiltering (82.8% TPR, 31.5% DB reduction)
	cpr verify --check clean # CLEAN enzyme classification
	```

	## Data Files

	### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215))

	```bash
	cd data/
	wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
	wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
	wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
	```

	### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences

	```bash
	wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
	tar -xzf protein_vec_models.gz
	```

	## Protein-Vec vs CLEAN Models

	### Protein-Vec (general protein search)
	- Trained on UniProt with multi-task objectives (Pfam, EC, GO, transmembrane, etc.)
	- Best for: broad functional annotation, domain identification, general homology search
	- Output: 128-dimensional embeddings
	- FDR threshold at α=0.1: λ ≈ 0.9999802

	### CLEAN (enzyme classification)
	- Trained specifically for EC number classification
	- Best for: enzyme function prediction, detailed catalytic annotation
	- Output: 128-dimensional embeddings
	- Requires ESM embeddings as input (computed automatically)
	- See `ec/` directory for CLEAN-specific notebooks

	## Creating Custom Calibration Datasets

	To calibrate FDR/FNR thresholds for your own protein search tasks:

	1. Create a calibration dataset with ground-truth labels (see `data/create_pfam_data.ipynb`)
	2. Embed sequences using your chosen model (`cpr embed`)
	3. Compute similarity scores and labels (save as .npy with shape `(n_samples, 3)`: `[sim, label_exact, label_partial]`)
	4. Run calibration: `cpr calibrate --calibration my_data.npy --output thresholds.csv --alpha 0.1`

	Important: Ensure your calibration dataset is outside the training data of your embedding model to avoid data leakage.

	## Complete Workflow Example

	Here's a full example searching viral domains against the Pfam database with FDR control:

	```bash
	# Option A: One-step search from FASTA (embeds automatically)
	cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1

	# Option B: Two-step with explicit embedding
	cpr embed --input viral_domains.fasta --output viral_embeddings.npy
	cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1
	```

	The output CSV will contain:
	- `query_idx`: Query sequence index
	- `match_idx`: Database match index
	- `similarity`: Cosine similarity score
	- `match_*`: Metadata columns from database (UniProt ID, Pfam domains, etc.)
	- `probability`: Calibrated probability of functional match
	- `uncertainty`: Venn-Abers uncertainty interval (\|p1 - p0\|)

	## Advanced Usage

	### Using Legacy Scripts

	For advanced use cases, the original Python scripts are still available in `scripts/`:

	```bash
	# Legacy search script with more options
	python scripts/search.py \
	--fdr \
	--fdr_lambda 0.99998 \
	--output results.csv \
	--query_embedding query.npy \
	--query_fasta query.fasta \
	--lookup_embedding data/lookup_embeddings.npy \
	--lookup_fasta data/lookup_embeddings_meta_data.tsv \
	--k 1000

	# Precompute similarity-to-probability lookup table
	python scripts/precompute_SVA_probs.py \
	--cal_data data/pfam_new_proteins.npy \
	--output data/pfam_sims_to_probs.csv \
	--partial \
	--n_bins 1000 \
	--n_calib 1000

	# Apply precomputed probabilities (faster than on-the-fly computation)
	python scripts/get_probs.py \
	--precomputed \
	--precomputed_path data/pfam_sims_to_probs.csv \
	--input results.csv \
	--output results_with_probs.csv \
	--partial
	```

	## Key Paper Results

	This repository reproduces the following results from the paper:

	\| Claim \| Paper \| CLI Command \| Status \|
	\|-------\|-------\|-------------\|--------\|
	\| JCVI Syn3.0 annotation (Fig 2A) \| 39.6% (59/149) at FDR α=0.1 \| `cpr verify --check syn30` \| ✓ Exact \|
	\| FDR threshold \| λ = 0.9999802250 at α=0.1 \| `cpr verify --check fdr` \| ✓ (~0.002% diff) \|
	\| DALI prefiltering TPR (Table 4-6) \| 82.8% \| `cpr verify --check dali` \| ✓ (~1% diff) \|
	\| DALI database reduction \| 31.5% \| `cpr verify --check dali` \| ✓ Exact \|
	\| CLEAN enzyme loss (Table 1-2) \| ≤ α=1.0 \| `cpr verify --check clean` \| ✓ (0.97) \|

	## Repository Structure

	- `protein_conformal/` - Core utilities for conformal prediction and search
	- `scripts/` - Verification scripts and legacy search tools
	- `scope/` - SCOPe structural classification experiments
	- `pfam/` - Pfam domain annotation notebooks
	- `ec/` - EC number classification with CLEAN model
	- `data/` - Data processing notebooks and scripts
	- `clean_selection/` - CLEAN enzyme selection pipeline
	- `tests/` - Test suite (run with `pytest tests/ -v`)

	## Contributing & Feature Requests

	If you'd like expanded support for specific models or search tasks, please open an issue describing:
	1. The embedding model you'd like to use
	2. The search/annotation task you're working on
	3. Any specific conformal guarantees you need (FDR, FNR, coverage, etc.)

	We welcome contributions and look forward to hearing from you!

	## Citation

	If you use this code or method in your work, please cite:

	```bibtex
	@article{boger2025functional,
	title={Functional protein mining with conformal guarantees},
	author={Boger, Ron S and Chithrananda, Seyone and Angelopoulos, Anastasios N and Yoon, Peter H and Jordan, Michael I and Doudna, Jennifer A},
	journal={Nature Communications},
	volume={16},
	number={1},
	pages={85},
	year={2025},
	publisher={Nature Publishing Group},
	doi={10.1038/s41467-024-55676-y}
	}
	```

	## License

	See LICENSE file for details.