File size: 14,915 Bytes
7453ae1
 
b6ba05f
7453ae1
b6ba05f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
 
 
 
 
 
 
b6ba05f
7453ae1
aae26ca
 
7453ae1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6ba05f
7453ae1
b6ba05f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
b6ba05f
 
 
 
 
 
 
7453ae1
0d63974
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
 
 
 
0d63974
7453ae1
 
 
b6ba05f
7453ae1
aae26ca
7453ae1
aae26ca
7453ae1
 
aae26ca
 
7453ae1
aae26ca
 
 
7453ae1
aae26ca
 
 
 
 
 
7453ae1
aae26ca
7453ae1
 
b6ba05f
aae26ca
b6ba05f
 
aae26ca
 
 
 
b6ba05f
 
aae26ca
7453ae1
 
b6ba05f
 
 
 
 
 
 
7453ae1
b6ba05f
7453ae1
 
b6ba05f
 
7453ae1
b6ba05f
 
7453ae1
 
b6ba05f
7453ae1
 
b6ba05f
 
7453ae1
 
73421ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6ba05f
7453ae1
b6ba05f
7453ae1
dd5ecfc
7453ae1
dd5ecfc
11112f0
dd5ecfc
7453ae1
dd5ecfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
b6ba05f
 
 
7453ae1
b6ba05f
7453ae1
b6ba05f
7453ae1
 
ab34d07
b6ba05f
7453ae1
ab34d07
 
 
 
 
7453ae1
ab34d07
 
 
7453ae1
 
ab34d07
 
b6ba05f
 
 
ab34d07
b6ba05f
7453ae1
ab34d07
aae26ca
b6ba05f
7453ae1
b6ba05f
7453ae1
b6ba05f
 
7453ae1
b6ba05f
 
 
 
7453ae1
 
 
b6ba05f
9f07ba7
b6ba05f
9f07ba7
 
 
 
 
 
 
 
b6ba05f
9f07ba7
 
 
 
 
b6ba05f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab34d07
7453ae1
ab34d07
7453ae1
ab34d07
7453ae1
ab34d07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
 
ab34d07
 
 
 
7453ae1
 
ab34d07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
 
 
 
 
b6ba05f
7453ae1
 
b6ba05f
7453ae1
 
b6ba05f
7453ae1
 
b6ba05f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7453ae1
 
 
 
 
 
 
 
 
 
 
 
b6ba05f
7453ae1
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
# Getting Started with CPR

This guide will get you from zero to running protein searches with conformal guarantees.

## Statistical Guarantees

CPR provides rigorous statistical guarantees based on conformal prediction:

| Guarantee | Meaning | How to Use |
|-----------|---------|------------|
| **Expected Marginal FDR ≤ α** | On average, at most α fraction of your hits are false positives | Use `--fdr 0.1` for 10% expected FDR |
| **FNR Control** | Controls the expected fraction of true matches you miss | Use `--fnr 0.1` to miss ≤10% of true hits |
| **Calibrated Probabilities** | Venn-Abers calibration provides valid probability estimates | Output includes `probability` column |

**Key insight**: Unlike p-values or arbitrary thresholds, our FDR guarantees are *marginal* guarantees that hold across all queries in expectation. See the [paper](https://doi.org/10.1038/s41467-024-55676-y) for theoretical details.

---

## Quick Start

```bash
# 1. Clone and install
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
pip install -e .

# 2. Download required data (see wget commands below)

# 3. Search with your sequences (FASTA or embeddings)
cpr search --input your_sequences.fasta --output results.csv --fdr 0.1
```

---

## What You Need

### Already Included (GitHub clone)

| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
| `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
| `protein_conformal/*.py` | ~100 KB | All the code |

### Download from Zenodo (Required)

**Zenodo URL**: https://zenodo.org/records/14272215

```bash
# Download all required files with wget
cd data/

# Database embeddings (1.1 GB) - 540K UniProt protein embeddings
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy

# Database metadata (535 MB) - protein names, Pfam domains, etc.
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv

# Calibration data (2.4 GB) - Pfam data for FDR/probability computation
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy

# Verify downloads
ls -lh lookup_embeddings.npy lookup_embeddings_meta_data.tsv pfam_new_proteins.npy
# Expected: 1.1G, 535M, 2.4G
```

Or with curl:
```bash
cd data/
curl -L -o lookup_embeddings.npy "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1"
curl -L -o lookup_embeddings_meta_data.tsv "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1"
curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
```

### Protein-Vec Model Weights (Required for embedding new sequences)

If you want to embed new FASTA sequences (not just use pre-computed embeddings), download the model weights:

**Zenodo URL**: https://zenodo.org/records/18478696

```bash
# Download and extract Protein-Vec model weights (2.9 GB compressed)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz

# Extract to protein_vec_models/ directory
tar -xzf protein_vec_models.gz

# Verify extraction
ls protein_vec_models/
# Expected: protein_vec.ckpt, protein_vec_params.json, aspect_vec_*.ckpt, etc.
```

Or with curl:
```bash
curl -L -o protein_vec_models.gz "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1"
tar -xzf protein_vec_models.gz
```

### Other Optional Downloads

| File | Size | When you need it |
|------|------|------------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
| CLEAN model weights | ~1 GB | Enzyme classification with CLEAN |

---

## CLI Commands

### `cpr search` - Search with Conformal Guarantees

The main command for protein search. Accepts both FASTA files and pre-computed embeddings:

```bash
# From FASTA (embeds automatically using Protein-Vec)
cpr search --input proteins.fasta --output results.csv --fdr 0.1

# From pre-computed embeddings
cpr search --input embeddings.npy --output results.csv --fdr 0.1
```

When given a FASTA file, `cpr search` will:
1. Embed your sequences using Protein-Vec (or CLEAN with `--model clean`)
2. Search the UniProt database (540K proteins)
3. Filter to confident hits at your specified FDR
4. Add calibrated probability estimates
5. Include Pfam/functional annotations

**More examples:**

```bash
# With FNR control instead (control false negatives)
cpr search --input proteins.fasta --output results.csv --fnr 0.1

# With a specific threshold you've computed
cpr search --input proteins.fasta --output results.csv --threshold 0.999980

# Use CLEAN model for enzyme classification
cpr search --input enzymes.fasta --output results.csv --model clean --fdr 0.1

# Exploratory: get all neighbors without filtering
cpr search --input proteins.fasta --output results.csv --no-filter
```

**Threshold options** (mutually exclusive):
- `--fdr ALPHA`: Look up threshold for target FDR level (e.g., `--fdr 0.1` for 10% FDR)
- `--fnr ALPHA`: Look up threshold for target FNR level
- `--threshold VALUE`: Use a specific similarity threshold you provide
- `--no-filter`: Return all k nearest neighbors without filtering

### `cpr embed` - Generate Embeddings

Convert FASTA sequences to embeddings:

```bash
# Using Protein-Vec (default, general-purpose)
cpr embed --input proteins.fasta --output embeddings.npy --model protein-vec

# Using CLEAN (enzyme-specific)
cpr embed --input enzymes.fasta --output embeddings.npy --model clean
```

### `cpr verify` - Verify Paper Results

```bash
cpr verify --check syn30    # Verify JCVI Syn3.0 result (39.6% annotation)
cpr verify --check all      # Run all verification checks
```

### Test with Included Data

The repo includes JCVI Syn3.0 sequences for testing:

```bash
# Test search with included FASTA (requires Zenodo data downloaded)
cpr search --input data/gene_unknown/unknown_aa_seqs.fasta --output test_results.csv --fdr 0.1

# Or use pre-computed embeddings (faster, no model weights needed)
cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
           --database data/lookup_embeddings.npy \
           --output test_results.csv --fdr 0.1

# Expected: ~59 hits (39.6% of 149 sequences)
```

---

## FDR/FNR Threshold Reference

These thresholds control the trade-off between hits and false positives.

### FDR Thresholds (False Discovery Rate)

Controls the expected fraction of hits that are false positives.

| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |

**Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.

### FNR Thresholds (False Negative Rate) - Exact Match

Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.

| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
| 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
| 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
| 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
| 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
| **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
| 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
| 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |

### FNR Thresholds - Partial Match

"Partial match" requires at least one Pfam domain to match (more permissive).

| α Level | Threshold (λ) | Std Dev | Use Case |
|---------|---------------|---------|----------|
| 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
| 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
| 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
| 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
| 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
| **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
| 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
| 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |

Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.

---

## CLEAN Enzyme Classification

For enzyme-specific searches with EC number predictions:

### Setup

```bash
# 1. Clone CLEAN repository with pretrained weights
git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo

# 2. Install CLEAN and dependencies
cd CLEAN_repo
pip install -e .
pip install fair-esm>=2.0.0
cd ..

# 3. Verify weights are present
ls CLEAN_repo/app/data/pretrained/
# Expected: 100.pt (123 MB), 70.pt (40 MB), split100.pth, split70.pth
```

**Note**: CLEAN uses ESM-1b embeddings internally (computed automatically). The model produces 128-dimensional embeddings (vs 1024 for Protein-Vec).

### Usage with CPR

```bash
# Generate CLEAN embeddings (128-dim) - requires GPU
cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean

# Search with CLEAN model
cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
```

### Verify CLEAN Results (Paper Tables 1-2)

```bash
python scripts/verify_clean.py

# Expected output:
# Mean test loss: 0.97 ± 0.XX
# ✓ VERIFICATION PASSED - Risk controlled at α=1.0
```

---

## DALI Structural Prefiltering

For structural homology search (DALI + AFDB), we use z-score thresholds:

| Metric | Value | Description |
|--------|-------|-------------|
| **elbow_z** | **~5.1** | Z-score threshold for prefiltering |
| TPR | 81.8% | True Positive Rate at elbow threshold |
| FNR | 18.2% | False Negative Rate (miss rate) |
| DB Reduction | 31.5% | Fraction of database filtered out |

Pre-computed results in `results/dali_thresholds.csv` (73 trials from paper experiments).

**Usage**: When running DALI, filter candidates with z-score ≥ 5.1 to achieve ~82% TPR while reducing database size by ~31%.

---

## Legacy Scripts

These scripts from the original paper analysis can be used for advanced workflows:

### FDR/FNR Threshold Computation

```bash
# Compute FDR thresholds at custom alpha levels
python scripts/compute_fdr_table.py \
    --calibration data/pfam_new_proteins.npy \
    --output results/my_fdr_thresholds.csv \
    --n-trials 100 \
    --alpha-levels 0.01,0.05,0.1,0.2

# Compute FNR thresholds
python scripts/compute_fnr_table.py \
    --calibration data/pfam_new_proteins.npy \
    --output results/my_fnr_thresholds.csv \
    --n-trials 100

# Use partial matches (at least one Pfam domain matches)
python scripts/compute_fdr_table.py --partial ...
```

### Verification Scripts

```bash
# Verify JCVI Syn3.0 annotation (Paper Figure 2A)
python scripts/verify_syn30.py

# Verify DALI prefiltering (Paper Tables 4-6)
python scripts/verify_dali.py

# Verify CLEAN enzyme classification (Paper Tables 1-2)
python scripts/verify_clean.py

# Verify FDR algorithm correctness
python scripts/verify_fdr_algorithm.py
```

### Probability Computation

```bash
# Precompute SVA probabilities for a database
python scripts/precompute_SVA_probs.py \
    --calibration data/pfam_new_proteins.npy \
    --output data/sva_probabilities.csv

# Get probabilities for search results
python scripts/get_probs.py \
    --input results.csv \
    --calibration data/pfam_new_proteins.npy \
    --output results_with_probs.csv
```

### Original Paper Scripts (in `scripts/pfam/`)

```bash
# Original FDR threshold generation (paper methodology)
python scripts/pfam/generate_fdr.py

# Original FNR threshold generation
python scripts/pfam/generate_fnr.py

# SVA reliability analysis
python scripts/pfam/sva_results.py
```

---

## Docker / Container Usage

Run CPR without installing dependencies locally:

### Docker

```bash
# Build the image
docker build -t cpr:latest .

# Run with your data mounted
docker run -it --rm \
    -v $(pwd)/data:/workspace/data \
    -v $(pwd)/protein_vec_models:/workspace/protein_vec_models \
    -v $(pwd)/results:/workspace/results \
    cpr:latest bash

# Inside container: run searches
cpr search --input data/your_sequences.fasta --output results/hits.csv --fdr 0.1

# Or launch the Gradio web interface
docker run -p 7860:7860 \
    -v $(pwd)/data:/workspace/data \
    cpr:latest
# Then open http://localhost:7860
```

### Docker Compose

```bash
# Start the Gradio web interface
docker-compose up

# Access at http://localhost:7860
```

### Apptainer (HPC clusters)

```bash
# Build the container
apptainer build cpr.sif apptainer.def

# Run a search
apptainer exec --nv cpr.sif cpr search \
    --input data/sequences.fasta \
    --output results/hits.csv \
    --fdr 0.1

# Interactive shell
apptainer shell --nv cpr.sif
```

**Note**: Use `--nv` flag for GPU support on NVIDIA systems.

---

## Troubleshooting

### "FileNotFoundError: data/lookup_embeddings.npy"
→ Download from Zenodo (see wget commands above)

### "ModuleNotFoundError: No module named 'faiss'"
→ Install FAISS: `pip install faiss-cpu` (or `conda install faiss-gpu` for GPU)

### "Got 58 hits, expected 59"
→ This is expected! See `docs/REPRODUCIBILITY.md` - varies by ±1 due to threshold boundary effects.

### "CUDA out of memory"
→ Use CPU: `--cpu` flag or reduce batch size

### "ModuleNotFoundError: No module named 'fair_esm'"
→ For CLEAN embeddings: `pip install fair-esm`

---

## Output Columns

Search results include:

| Column | Description |
|--------|-------------|
| `query_name` | Your sequence ID from FASTA |
| `similarity` | Cosine similarity score |
| `probability` | Calibrated probability of functional match |
| `uncertainty` | Venn-Abers uncertainty interval |
| `match_name` | Matched protein name |
| `match_pfam` | Pfam domain annotations |

---

## What's Next?

- **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
- **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
- **Run verification**: `cpr verify --check all` tests all paper claims
- **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues

---

## Files Checklist

| Source | Files | Size | Status |
|--------|-------|------|--------|
| **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
| **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
| **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
| **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
| **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
| **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |