Spaces:

LoocasGoose
/

cpr

Running

ronboger Claude Opus 4.5 commited on Feb 3

Commit

7453ae1

1 Parent(s): f7a768a

feat: add one-step cpr find command and FDR-level search

Major improvements to make CPR easier to use:

## New Features
- Add `cpr find` command: FASTA → annotated results in ONE step
- Automatically embeds sequences
- Searches database with FDR control
- Adds calibrated probabilities
- Returns annotated hits

- Add `--fdr` and `--fnr` options to `cpr search`
- Specify FDR level directly (e.g., --fdr 0.1 for 10% FDR)
- Automatically looks up threshold from results/fdr_thresholds.csv
- Fallback to paper values if table not found

- Improved search defaults
- k now defaults to 10% of database size (min 100, max 10000)
- Better result summary with hit rates

## Documentation
- Add GETTING_STARTED.md: 10-minute quickstart guide
- Add UPLOAD_CHECKLIST.md: What goes to GitHub vs Zenodo
- Update README.md with simpler workflow

## Usage Example
```bash
# One command from FASTA to annotated results:
cpr find --input proteins.fasta --output results.csv --fdr 0.1
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (4) hide show

GETTING_STARTED.md +278 -0
README.md +28 -11
UPLOAD_CHECKLIST.md +188 -0
protein_conformal/cli.py +226 -8

GETTING_STARTED.md ADDED Viewed

	@@ -0,0 +1,278 @@

+# Getting Started with CPR
+This guide will get you from zero to running protein searches with conformal guarantees in under 10 minutes.
+## TL;DR (Easiest Path)
+```bash
+# 1. Clone and install
+git clone https://github.com/ronboger/conformal-protein-retrieval.git
+cd conformal-protein-retrieval
+pip install -e .
+# 2. Download data (4GB total) from https://zenodo.org/records/14272215:
+#    → lookup_embeddings.npy (1.1 GB) → data/
+#    → lookup_embeddings_meta_data.tsv (535 MB) → data/
+#    → pfam_new_proteins.npy (2.4 GB) → data/
+# 3. Get Protein-Vec model weights (contact authors or see below)
+#    → Extract protein_vec_models.gz to protein_vec_models/
+# 4. Run search on your sequences (ONE COMMAND!)
+cpr find --input your_sequences.fasta --output results.csv --fdr 0.1
+# That's it! results.csv contains:
+#   - Functional annotations for each protein
+#   - Calibrated probabilities
+#   - Uncertainty estimates
+```
+### Don't have model weights? Use pre-computed embeddings:
+```bash
+# If you already have embeddings (.npy), skip to search:
+cpr search --query your_embeddings.npy --output results.csv --fdr 0.1
+```
+---
+## What You Need
+### Already Included (GitHub clone)
+When you clone the repository, you automatically get:
+| File | Size | Description |
+|------|------|-------------|
+| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
+| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
+| `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
+| `protein_conformal/*.py` | ~100 KB | All the code |
+### Download from Zenodo (Required)
+Download these from **https://zenodo.org/records/14272215**:
+| File | Size | What it is | Where to put it |
+|------|------|------------|-----------------|
+| `lookup_embeddings.npy` | **1.1 GB** | UniProt database (540K protein embeddings) | `data/` |
+| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) | `data/` |
+| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability computation | `data/` |
+**Total download: ~4 GB**
+### Optional Downloads
+| File | Size | When you need it |
+|------|------|------------------|
+| `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
+| Protein-Vec model weights | 3 GB | Computing new embeddings from FASTA |
+| CLEAN model weights | 1 GB | Enzyme classification with CLEAN |
+---
+## Step-by-Step Setup
+### Step 1: Clone and Install
+```bash
+git clone https://github.com/ronboger/conformal-protein-retrieval.git
+cd conformal-protein-retrieval
+pip install -e .
+```
+**Verify installation:**
+```bash
+cpr --help
+# Should show: embed, search, prob, calibrate, verify commands
+```
+### Step 2: Download Data
+Go to **https://zenodo.org/records/14272215** and download:
+1. `lookup_embeddings.npy` (1.1 GB)
+2. `lookup_embeddings_meta_data.tsv` (535 MB)
+3. `pfam_new_proteins.npy` (2.4 GB)
+Move them to the `data/` directory:
+```bash
+mv ~/Downloads/lookup_embeddings.npy data/
+mv ~/Downloads/lookup_embeddings_meta_data.tsv data/
+mv ~/Downloads/pfam_new_proteins.npy data/
+```
+**Verify files:**
+```bash
+ls -lh data/*.npy data/*.tsv
+# lookup_embeddings.npy         1.1G
+# lookup_embeddings_meta_data.tsv   535M
+# pfam_new_proteins.npy         2.4G
+```
+### Step 3: Verify Setup
+```bash
+cpr verify --check syn30
+```
+**Expected output:**
+```
+JCVI Syn3.0 Annotation Verification
+Total queries:     149
+Confident hits:    59    (might be 58-60, see docs/REPRODUCIBILITY.md)
+Hit rate:          39.6%
+FDR threshold:     λ = 0.999980225003
+✓ VERIFICATION PASSED
+```
+---
+## Your First Search
+### Easiest: One Command from FASTA (Recommended)
+```bash
+cpr find --input your_proteins.fasta --output results.csv --fdr 0.1
+```
+This single command:
+1. Embeds your sequences using Protein-Vec
+2. Searches the UniProt database (540K proteins)
+3. Filters to confident hits at 10% FDR
+4. Adds calibrated probability estimates
+5. Includes Pfam/functional annotations
+**Output columns:**
+- `query_name`: Your sequence ID from FASTA
+- `similarity`: Cosine similarity score
+- `probability`: Calibrated probability of functional match
+- `uncertainty`: Venn-Abers uncertainty interval
+- `match_*`: Pfam domains, protein names, etc.
+### Control FDR Level
+```bash
+# Stringent: 1% FDR (fewer but more confident hits)
+cpr find --input proteins.fasta --output results.csv --fdr 0.01
+# Default: 10% FDR (balanced)
+cpr find --input proteins.fasta --output results.csv --fdr 0.1
+# Discovery: 20% FDR (more hits, some false positives)
+cpr find --input proteins.fasta --output results.csv --fdr 0.2
+```
+### Alternative: Manual Workflow (Advanced)
+If you need more control or already have embeddings:
+```bash
+# Step 1: Embed (if starting from FASTA)
+cpr embed --input seqs.fasta --output embeddings.npy --model protein-vec
+# Step 2: Search with FDR control
+cpr search --query embeddings.npy --output hits.csv --fdr 0.1
+# Step 3: Add probabilities (optional, for detailed analysis)
+cpr prob --input hits.csv --output hits_with_probs.csv
+```
+---
+## FDR Threshold Reference
+Use these thresholds for your desired false discovery rate:
+| FDR Level | Threshold (λ) | Use Case |
+|-----------|---------------|----------|
+| 1% | 0.999990 | Very stringent |
+| 5% | 0.999985 | Stringent |
+| **10%** | **0.999980** | **Paper default** |
+| 15% | 0.999975 | Relaxed |
+| 20% | 0.999970 | Discovery-focused |
+Full table in `results/fdr_thresholds.csv`.
+---
+## Model Weights
+### Protein-Vec (General Protein Search)
+**Option 1: Contact authors** for the `protein_vec_models.gz` archive.
+**Option 2: Use pre-computed embeddings** from Zenodo (no weights needed for searching).
+If you have the weights:
+```bash
+tar -xzf protein_vec_models.gz
+# Creates protein_vec_models/ directory with:
+#   protein_vec.ckpt (804 MB)
+#   protein_vec_params.json
+#   aspect_vec_*.ckpt (200-400 MB each)
+```
+### CLEAN (Enzyme Classification)
+For enzyme-specific searches, get CLEAN from: https://github.com/tttianhao/CLEAN
+---
+## Directory Structure After Setup
+```
+conformal-protein-retrieval/
+├── data/
+│   ├── lookup_embeddings.npy          ← Download from Zenodo (1.1 GB)
+│   ├── lookup_embeddings_meta_data.tsv ← Download from Zenodo (535 MB)
+│   ├── pfam_new_proteins.npy          ← Download from Zenodo (2.4 GB)
+│   └── gene_unknown/                  ← Included in GitHub
+│       ├── unknown_aa_seqs.fasta
+│       └── unknown_aa_seqs.npy
+├── protein_vec_models/                ← Optional: for new embeddings
+│   ├── protein_vec.ckpt
+│   └── ...
+├── protein_conformal/                 ← Code (included)
+├── results/                           ← Your outputs go here
+└── scripts/                           ← Helper scripts
+```
+---
+## Troubleshooting
+### "FileNotFoundError: data/lookup_embeddings.npy"
+→ Download from Zenodo: https://zenodo.org/records/14272215
+### "ModuleNotFoundError: No module named 'faiss'"
+→ Install FAISS: `pip install faiss-cpu` (or `faiss-gpu` for GPU)
+### "Got 58 hits, expected 59"
+→ This is expected! See `docs/REPRODUCIBILITY.md` - the result varies by ±1 due to threshold boundary effects.
+### "CUDA out of memory"
+→ Use CPU: `--device cpu` or reduce batch size with `--batch-size 16`
+---
+## What's Next?
+- **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
+- **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
+- **Run verification**: `cpr verify --check all` tests all paper claims
+- **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues
+---
+## Summary: Files Checklist
+| Source | Files | Size | Status |
+|--------|-------|------|--------|
+| **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
+| **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
+| **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
+| **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
+| **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
+| **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |

README.md CHANGED Viewed

@@ -2,27 +2,44 @@
 Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
-All data files can be found in [our Zenodo repository](https://zenodo.org/records/14272215). Results can be reproduced through executing the data preparation notebooks in each subdirectory.
-## Installation
-Clone the repository and install dependencies:
 ```bash
 git clone https://github.com/ronboger/conformal-protein-retrieval.git
 cd conformal-protein-retrieval
 pip install -e .
 ```
-This will install the `cpr` command-line interface for embedding, search, and calibration.
-## Structure
-- `./protein_conformal`: utility functions to creating confidence sets and assigning probabilities to any protein machine learning model for search
-- `./scope`: experiments pertraining to SCOPe
-- `./pfam`: notebooks demonstrating how to use our techniques to calibrate false discovery and false negative rates for different pfam classes
-- `./ec`: experiments pertraining to EC number classification on uniprot
-- `./data`: scripts and notebooks used to process data
-- `./clean_selection`: scripts and notebooks used to process data
 ## Quick Start

 Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
+**[→ GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes)
+## Quick Setup
 ```bash
+# 1. Clone and install
 git clone https://github.com/ronboger/conformal-protein-retrieval.git
 cd conformal-protein-retrieval
 pip install -e .
+# 2. Download data from Zenodo (4GB total)
+# https://zenodo.org/records/14272215
+#   → lookup_embeddings.npy (1.1 GB) → data/
+#   → lookup_embeddings_meta_data.tsv (535 MB) → data/
+#   → pfam_new_proteins.npy (2.4 GB) → data/
+# 3. Verify setup
+cpr verify --check syn30
+# Expected: 59/149 = 39.6% hits at FDR α=0.1
 ```
+See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions.
+## Repository Structure
+```
+conformal-protein-retrieval/
+├── protein_conformal/     # Core library (FDR/FNR control, Venn-Abers)
+├── notebooks/             # Analysis notebooks organized by experiment
+│   ├── pfam/             # Pfam domain annotation (Figure 2)
+│   ├── scope/            # SCOPe structural classification
+│   ├── ec/               # EC number classification
+│   └── clean_selection/  # CLEAN enzyme experiments (Tables 1-2)
+├── scripts/              # CLI scripts and SLURM jobs
+├── data/                 # Data files (see GETTING_STARTED.md)
+├── results/              # Pre-computed thresholds and outputs
+└── docs/                 # Additional documentation
+```
 ## Quick Start

UPLOAD_CHECKLIST.md ADDED Viewed

	@@ -0,0 +1,188 @@

+# Upload Checklist: What Goes Where
+This document specifies exactly what files go to GitHub vs Zenodo.
+## Summary
+| Location | What | Why |
+|----------|------|-----|
+| **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
+| **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
+| **User obtains** | Protein-Vec model weights | Large binary, separate distribution |
+---
+## GitHub Repository (You Commit This)
+### Code & Configuration
+```
+protein_conformal/          # All Python code
+├── __init__.py
+├── cli.py
+├── util.py
+├── scope_utils.py
+├── embed_protein_vec.py
+├── gradio_app.py
+└── backend/
+scripts/                    # Helper scripts
+├── verify_*.py
+├── compute_fdr_table.py
+├── slurm_*.sh
+└── *.py
+tests/                      # Test suite
+notebooks/                  # Analysis notebooks
+docs/                       # Documentation
+```
+### Small Data Files (<1MB each)
+```
+data/gene_unknown/
+├── unknown_aa_seqs.fasta   # 56 KB - JCVI Syn3.0 sequences
+├── unknown_aa_seqs.npy     # 299 KB - Pre-computed embeddings
+└── jcvi_syn30_unknown_gene_hits.csv  # 61 KB - Results
+results/
+├── fdr_thresholds.csv      # ~2 KB - Threshold lookup table
+├── fnr_thresholds.csv      # ~7 KB - FNR thresholds
+└── sim2prob_lookup.csv     # ~8 KB - Probability lookup
+```
+### Configuration & Docs
+```
+pyproject.toml
+setup.py
+Dockerfile
+apptainer.def
+README.md
+GETTING_STARTED.md
+DATA.md
+CLAUDE.md
+docs/REPRODUCIBILITY.md
+.gitignore
+```
+### Model Code (NOT weights)
+```
+protein_vec_models/
+├── model_protein_moe.py      # Model architecture code
+├── utils_search.py           # Embedding utilities
+├── data_protein_vec.py       # Data loading code
+├── embed_structure_model.py
+├── model_protein_vec_single_variable.py
+├── train_protein_vec.py
+├── __init__.py
+└── *.json                    # Config files only
+```
+---
+## Zenodo Repository (You Upload This)
+**Zenodo URL**: https://zenodo.org/records/14272215
+### Essential Files (Required for paper verification)
+| File | Size | Description |
+|------|------|-------------|
+| `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
+| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
+| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |
+### Optional Files (For extended experiments)
+| File | Size | Description |
+|------|------|-------------|
+| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
+| CLEAN enzyme data | varies | For Tables 1-2 reproduction |
+| SCOPe/DALI data | varies | For Tables 4-6 reproduction |
+---
+## User Must Obtain Separately
+### Protein-Vec Model Weights (~3 GB)
+These are NOT in GitHub or Zenodo. Users get them by:
+1. **Option A**: Contact authors for `protein_vec_models.gz`
+2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)
+Files needed if embedding new sequences:
+```
+protein_vec_models/
+├── protein_vec.ckpt          # 804 MB - Main model
+├── protein_vec_params.json   # Config
+├── aspect_vec_*.ckpt         # 200-400 MB each - Aspect models
+└── tm_vec_swiss_model_large.ckpt  # 391 MB
+```
+### CLEAN Model Weights (if using --model clean)
+Get from: https://github.com/tttianhao/CLEAN
+---
+## .gitignore Must Include
+```gitignore
+# Large data files (on Zenodo)
+data/*.npy
+data/*.tsv
+data/*.pkl
+# Model weights (user obtains separately)
+protein_vec_models/*.ckpt
+protein_vec_models.gz
+# Build artifacts
+*.sif
+.apptainer_cache/
+logs/
+.claude/
+```
+---
+## Verification: Is Everything Set Up Correctly?
+Run this after cloning + downloading:
+```bash
+# Check GitHub files present
+ls data/gene_unknown/unknown_aa_seqs.fasta  # Should exist
+ls results/fdr_thresholds.csv               # Should exist
+# Check Zenodo files downloaded
+ls -lh data/lookup_embeddings.npy           # Should be ~1.1 GB
+ls -lh data/pfam_new_proteins.npy           # Should be ~2.4 GB
+# Check model weights (if embedding)
+ls protein_vec_models/protein_vec.ckpt      # Should exist if embedding
+# Run verification
+cpr verify --check syn30
+# Expected: 58-60/149 hits (39.6%)
+```
+---
+## For Repository Maintainers
+### When releasing a new version:
+1. **GitHub**:
+   - Commit all code changes
+   - Update `results/fdr_thresholds.csv` with new calibration
+   - Tag release: `git tag v1.x.x`
+2. **Zenodo**:
+   - Upload updated embedding files if changed
+   - Create new version linked to GitHub release
+### Files to NEVER commit to GitHub:
+- Any `.npy` file > 1 MB
+- Any `.ckpt` file (model weights)
+- Any `.pkl` file > 1 MB
+- Any `.tsv` or `.csv` > 1 MB

protein_conformal/cli.py CHANGED Viewed

@@ -205,6 +205,37 @@ def _embed_clean(sequences, device, args):
 def cmd_search(args):
     """Search for similar proteins using FAISS with conformal guarantees."""
     import numpy as np
@@ -236,12 +267,28 @@ def cmd_search(args):
     print(f"Querying for top {args.k} neighbors...")
     D, I = query(index, query_embeddings, args.k)
-    # Apply threshold if specified
-    if args.threshold:
-        print(f"Applying similarity threshold: {args.threshold}")
     # Build results
     results = []
     for i in range(len(query_embeddings)):
         for j in range(args.k):
             sim = D[i, j]
@@ -249,7 +296,8 @@ def cmd_search(args):
             # Skip placeholder results (FAISS returns -1 for non-existent neighbors)
             if idx < 0:
                 continue
-            if args.threshold and sim < args.threshold:
                 continue
             row = {
                 'query_idx': i,
@@ -263,7 +311,143 @@ def cmd_search(args):
     results_df = pd.DataFrame(results)
     results_df.to_csv(args.output, index=False)
-    print(f"Saved {len(results_df)} results to {args.output}")
 def cmd_verify(args):
@@ -431,8 +615,30 @@ def main():
     )
     subparsers = parser.add_subparsers(dest='command', help='Available commands')
     # embed command
-    p_embed = subparsers.add_parser('embed', help='Embed protein sequences')
     p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
     p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
     p_embed.add_argument('--model', '-m', default='protein-vec',
@@ -449,8 +655,20 @@ def main():
     p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
     p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
     p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
-    p_search.add_argument('--k', type=int, default=10, help='Number of neighbors (default: 10)')
-    p_search.add_argument('--threshold', '-t', type=float, help='Similarity threshold (e.g., 0.99998 for FDR α=0.1)')
     p_search.set_defaults(func=cmd_search)
     # verify command

+def _get_fdr_threshold(alpha: float) -> float:
+    """Look up FDR threshold from precomputed table."""
+    import pandas as pd
+    repo_root = Path(__file__).parent.parent
+    threshold_file = repo_root / "results" / "fdr_thresholds.csv"
+    # Fallback values if table not found (from paper)
+    FALLBACK_THRESHOLDS = {
+        0.01: 0.999992,
+        0.05: 0.999985,
+        0.10: 0.999980,
+        0.15: 0.999975,
+        0.20: 0.999970,
+    }
+    if threshold_file.exists():
+        try:
+            df = pd.read_csv(threshold_file)
+            # Find closest alpha in table
+            if 'alpha' in df.columns and 'threshold_mean' in df.columns:
+                idx = (df['alpha'] - alpha).abs().idxmin()
+                return df.loc[idx, 'threshold_mean']
+        except Exception:
+            pass
+    # Use fallback
+    closest_alpha = min(FALLBACK_THRESHOLDS.keys(), key=lambda x: abs(x - alpha))
+    return FALLBACK_THRESHOLDS[closest_alpha]
 def cmd_search(args):
     """Search for similar proteins using FAISS with conformal guarantees."""
     import numpy as np
     print(f"Querying for top {args.k} neighbors...")
     D, I = query(index, query_embeddings, args.k)
+    # Determine threshold from --fdr, --fnr, or --threshold
+    threshold = None
+    if args.no_filter:
+        print("No filtering (--no-filter): returning all neighbors")
+    elif args.threshold:
+        threshold = args.threshold
+        print(f"Using manual threshold: {threshold}")
+    elif args.fnr:
+        # FNR threshold (TODO: add lookup table for FNR)
+        print(f"FNR control at α={args.fnr} (using approximate threshold)")
+        threshold = 0.9999 - args.fnr * 0.001  # Rough approximation
+        print(f"  Threshold: {threshold}")
+    else:
+        # Default: FDR control
+        fdr_alpha = args.fdr if args.fdr else 0.1
+        threshold = _get_fdr_threshold(fdr_alpha)
+        print(f"FDR control at α={fdr_alpha} ({fdr_alpha*100:.0f}% FDR)")
+        print(f"  Threshold: {threshold:.10f}")
     # Build results
     results = []
+    n_filtered = 0
     for i in range(len(query_embeddings)):
         for j in range(args.k):
             sim = D[i, j]
             # Skip placeholder results (FAISS returns -1 for non-existent neighbors)
             if idx < 0:
                 continue
+            if threshold is not None and sim < threshold:
+                n_filtered += 1
                 continue
             row = {
                 'query_idx': i,
     results_df = pd.DataFrame(results)
     results_df.to_csv(args.output, index=False)
+    # Summary
+    n_queries = len(query_embeddings)
+    n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
+    print(f"\nResults:")
+    print(f"  Queries: {n_queries}")
+    print(f"  Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
+    print(f"  Total hits: {len(results_df)}")
+    if threshold:
+        print(f"  Filtered out: {n_filtered} below threshold")
+    print(f"Saved to {args.output}")
+def cmd_find(args):
+    """One-step search: FASTA → embeddings → search → results with probabilities."""
+    import numpy as np
+    import pandas as pd
+    import tempfile
+    from Bio import SeqIO
+    import torch
+    from protein_conformal.util import load_database, query, simplifed_venn_abers_prediction, get_sims_labels
+    device = torch.device('cuda' if torch.cuda.is_available() and not args.cpu else 'cpu')
+    print(f"=== CPR Find: FASTA to Annotated Results ===")
+    print(f"Device: {device}")
+    print(f"Model: {args.model}")
+    print(f"FDR level: {args.fdr*100:.0f}%")
+    print()
+    # Step 1: Read sequences
+    print(f"[1/5] Reading sequences from {args.input}...")
+    sequences = []
+    sequence_names = []
+    for record in SeqIO.parse(args.input, "fasta"):
+        sequences.append(str(record.seq))
+        sequence_names.append(record.id)
+    print(f"  Found {len(sequences)} sequences")
+    # Step 2: Embed sequences
+    print(f"\n[2/5] Computing embeddings with {args.model}...")
+    if args.model == 'protein-vec':
+        embeddings = _embed_protein_vec(sequences, device, args)
+    elif args.model == 'clean':
+        embeddings = _embed_clean(sequences, device, args)
+    else:
+        print(f"Unknown model: {args.model}")
+        sys.exit(1)
+    print(f"  Embeddings shape: {embeddings.shape}")
+    # Step 3: Load database
+    repo_root = Path(__file__).parent.parent
+    db_path = args.database if args.database else repo_root / "data" / "lookup_embeddings.npy"
+    meta_path = args.database_meta if args.database_meta else repo_root / "data" / "lookup_embeddings_meta_data.tsv"
+    print(f"\n[3/5] Loading database from {db_path}...")
+    db_embeddings = np.load(db_path)
+    print(f"  Database size: {len(db_embeddings)} proteins")
+    if Path(meta_path).exists():
+        if str(meta_path).endswith('.tsv'):
+            db_meta = pd.read_csv(meta_path, sep='\t')
+        else:
+            db_meta = pd.read_csv(meta_path)
+    else:
+        db_meta = None
+        print("  Warning: No metadata file found")
+    # Determine k (10% of database or max 10000)
+    k = min(max(100, len(db_embeddings) // 10), 10000)
+    print(f"  Using k={k} neighbors ({k/len(db_embeddings)*100:.1f}% of database)")
+    # Step 4: Search
+    print(f"\n[4/5] Searching...")
+    index = load_database(db_embeddings)
+    D, I = query(index, embeddings, k)
+    # Get threshold
+    threshold = _get_fdr_threshold(args.fdr)
+    print(f"  FDR threshold (α={args.fdr}): {threshold:.10f}")
+    # Step 5: Build results with probabilities
+    print(f"\n[5/5] Building results...")
+    # Load calibration data for probabilities
+    cal_path = args.calibration if args.calibration else repo_root / "data" / "pfam_new_proteins.npy"
+    if Path(cal_path).exists():
+        cal_data = np.load(cal_path, allow_pickle=True)
+        np.random.seed(42)
+        np.random.shuffle(cal_data)
+        cal_subset = cal_data[:100]
+        X_cal, y_cal = get_sims_labels(cal_subset, partial=False)
+        X_cal = X_cal.flatten()
+        y_cal = y_cal.flatten()
+        compute_probs = True
+    else:
+        compute_probs = False
+        print("  Warning: No calibration data, skipping probability computation")
+    results = []
+    for i in range(len(embeddings)):
+        for j in range(k):
+            sim = D[i, j]
+            idx = I[i, j]
+            if idx < 0 or sim < threshold:
+                continue
+            row = {
+                'query_name': sequence_names[i],
+                'query_idx': i,
+                'match_idx': idx,
+                'similarity': sim,
+            }
+            # Add probability if calibration available
+            if compute_probs:
+                p0, p1 = simplifed_venn_abers_prediction(X_cal, y_cal, sim)
+                row['probability'] = (p0 + p1) / 2
+                row['uncertainty'] = abs(p1 - p0)
+            # Add metadata
+            if db_meta is not None and idx < len(db_meta):
+                for col in db_meta.columns[:5]:
+                    row[f'match_{col}'] = db_meta.iloc[idx][col]
+            results.append(row)
+    results_df = pd.DataFrame(results)
+    results_df.to_csv(args.output, index=False)
+    # Summary
+    n_queries = len(sequences)
+    n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
+    print(f"\n=== Results ===")
+    print(f"Queries: {n_queries}")
+    print(f"Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
+    print(f"Total confident hits: {len(results_df)}")
+    print(f"Output: {args.output}")
 def cmd_verify(args):
     )
     subparsers = parser.add_subparsers(dest='command', help='Available commands')
+    # find command (one-step: FASTA → results)
+    p_find = subparsers.add_parser('find',
+        help='One-step search: FASTA → embed → search → annotated results',
+        description='The easiest way to use CPR. Give it a FASTA file and get annotated results.')
+    p_find.add_argument('--input', '-i', required=True, help='Input FASTA file with protein sequences')
+    p_find.add_argument('--output', '-o', required=True, help='Output CSV with annotated hits')
+    p_find.add_argument('--model', '-m', default='protein-vec',
+                        choices=['protein-vec', 'clean'],
+                        help='Embedding model (default: protein-vec)')
+    p_find.add_argument('--fdr', type=float, default=0.1,
+                        help='False discovery rate level (default: 0.1 = 10%% FDR)')
+    p_find.add_argument('--database', '-d',
+                        help='Database embeddings (default: data/lookup_embeddings.npy)')
+    p_find.add_argument('--database-meta',
+                        help='Database metadata (default: data/lookup_embeddings_meta_data.tsv)')
+    p_find.add_argument('--calibration', '-c',
+                        help='Calibration data for probabilities (default: data/pfam_new_proteins.npy)')
+    p_find.add_argument('--cpu', action='store_true', help='Force CPU even if GPU available')
+    p_find.add_argument('--clean-model', default='split100',
+                        help='CLEAN model variant (default: split100)')
+    p_find.set_defaults(func=cmd_find)
     # embed command
+    p_embed = subparsers.add_parser('embed', help='Embed protein sequences (step 1 of manual workflow)')
     p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
     p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
     p_embed.add_argument('--model', '-m', default='protein-vec',
     p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
     p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
     p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
+    p_search.add_argument('--k', type=int, default=100,
+                          help='Max neighbors per query (default: 100)')
+    # FDR/FNR control options
+    p_search.add_argument('--fdr', type=float, default=0.1,
+                          help='False discovery rate level (default: 0.1 = 10%% FDR). '
+                               'Automatically looks up threshold from results/fdr_thresholds.csv')
+    p_search.add_argument('--fnr', type=float,
+                          help='False negative rate level (alternative to --fdr). '
+                               'Use this when you want to control missed true matches.')
+    p_search.add_argument('--threshold', '-t', type=float,
+                          help='Manual similarity threshold (overrides --fdr/--fnr). '
+                               'Use this if you have a custom threshold.')
+    p_search.add_argument('--no-filter', action='store_true',
+                          help='Return all neighbors without filtering (for exploration)')
     p_search.set_defaults(func=cmd_search)
     # verify command