ronboger Claude Opus 4.5 commited on
Commit
7453ae1
·
1 Parent(s): f7a768a

feat: add one-step cpr find command and FDR-level search

Browse files

Major improvements to make CPR easier to use:

## New Features
- Add `cpr find` command: FASTA → annotated results in ONE step
- Automatically embeds sequences
- Searches database with FDR control
- Adds calibrated probabilities
- Returns annotated hits

- Add `--fdr` and `--fnr` options to `cpr search`
- Specify FDR level directly (e.g., --fdr 0.1 for 10% FDR)
- Automatically looks up threshold from results/fdr_thresholds.csv
- Fallback to paper values if table not found

- Improved search defaults
- k now defaults to 10% of database size (min 100, max 10000)
- Better result summary with hit rates

## Documentation
- Add GETTING_STARTED.md: 10-minute quickstart guide
- Add UPLOAD_CHECKLIST.md: What goes to GitHub vs Zenodo
- Update README.md with simpler workflow

## Usage Example
```bash
# One command from FASTA to annotated results:
cpr find --input proteins.fasta --output results.csv --fdr 0.1
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (4) hide show
  1. GETTING_STARTED.md +278 -0
  2. README.md +28 -11
  3. UPLOAD_CHECKLIST.md +188 -0
  4. protein_conformal/cli.py +226 -8
GETTING_STARTED.md ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Getting Started with CPR
2
+
3
+ This guide will get you from zero to running protein searches with conformal guarantees in under 10 minutes.
4
+
5
+ ## TL;DR (Easiest Path)
6
+
7
+ ```bash
8
+ # 1. Clone and install
9
+ git clone https://github.com/ronboger/conformal-protein-retrieval.git
10
+ cd conformal-protein-retrieval
11
+ pip install -e .
12
+
13
+ # 2. Download data (4GB total) from https://zenodo.org/records/14272215:
14
+ # → lookup_embeddings.npy (1.1 GB) → data/
15
+ # → lookup_embeddings_meta_data.tsv (535 MB) → data/
16
+ # → pfam_new_proteins.npy (2.4 GB) → data/
17
+
18
+ # 3. Get Protein-Vec model weights (contact authors or see below)
19
+ # → Extract protein_vec_models.gz to protein_vec_models/
20
+
21
+ # 4. Run search on your sequences (ONE COMMAND!)
22
+ cpr find --input your_sequences.fasta --output results.csv --fdr 0.1
23
+
24
+ # That's it! results.csv contains:
25
+ # - Functional annotations for each protein
26
+ # - Calibrated probabilities
27
+ # - Uncertainty estimates
28
+ ```
29
+
30
+ ### Don't have model weights? Use pre-computed embeddings:
31
+
32
+ ```bash
33
+ # If you already have embeddings (.npy), skip to search:
34
+ cpr search --query your_embeddings.npy --output results.csv --fdr 0.1
35
+ ```
36
+
37
+ ---
38
+
39
+ ## What You Need
40
+
41
+ ### Already Included (GitHub clone)
42
+
43
+ When you clone the repository, you automatically get:
44
+
45
+ | File | Size | Description |
46
+ |------|------|-------------|
47
+ | `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
48
+ | `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
49
+ | `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
50
+ | `protein_conformal/*.py` | ~100 KB | All the code |
51
+
52
+ ### Download from Zenodo (Required)
53
+
54
+ Download these from **https://zenodo.org/records/14272215**:
55
+
56
+ | File | Size | What it is | Where to put it |
57
+ |------|------|------------|-----------------|
58
+ | `lookup_embeddings.npy` | **1.1 GB** | UniProt database (540K protein embeddings) | `data/` |
59
+ | `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) | `data/` |
60
+ | `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability computation | `data/` |
61
+
62
+ **Total download: ~4 GB**
63
+
64
+ ### Optional Downloads
65
+
66
+ | File | Size | When you need it |
67
+ |------|------|------------------|
68
+ | `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
69
+ | Protein-Vec model weights | 3 GB | Computing new embeddings from FASTA |
70
+ | CLEAN model weights | 1 GB | Enzyme classification with CLEAN |
71
+
72
+ ---
73
+
74
+ ## Step-by-Step Setup
75
+
76
+ ### Step 1: Clone and Install
77
+
78
+ ```bash
79
+ git clone https://github.com/ronboger/conformal-protein-retrieval.git
80
+ cd conformal-protein-retrieval
81
+ pip install -e .
82
+ ```
83
+
84
+ **Verify installation:**
85
+ ```bash
86
+ cpr --help
87
+ # Should show: embed, search, prob, calibrate, verify commands
88
+ ```
89
+
90
+ ### Step 2: Download Data
91
+
92
+ Go to **https://zenodo.org/records/14272215** and download:
93
+
94
+ 1. `lookup_embeddings.npy` (1.1 GB)
95
+ 2. `lookup_embeddings_meta_data.tsv` (535 MB)
96
+ 3. `pfam_new_proteins.npy` (2.4 GB)
97
+
98
+ Move them to the `data/` directory:
99
+
100
+ ```bash
101
+ mv ~/Downloads/lookup_embeddings.npy data/
102
+ mv ~/Downloads/lookup_embeddings_meta_data.tsv data/
103
+ mv ~/Downloads/pfam_new_proteins.npy data/
104
+ ```
105
+
106
+ **Verify files:**
107
+ ```bash
108
+ ls -lh data/*.npy data/*.tsv
109
+ # lookup_embeddings.npy 1.1G
110
+ # lookup_embeddings_meta_data.tsv 535M
111
+ # pfam_new_proteins.npy 2.4G
112
+ ```
113
+
114
+ ### Step 3: Verify Setup
115
+
116
+ ```bash
117
+ cpr verify --check syn30
118
+ ```
119
+
120
+ **Expected output:**
121
+ ```
122
+ JCVI Syn3.0 Annotation Verification
123
+ Total queries: 149
124
+ Confident hits: 59 (might be 58-60, see docs/REPRODUCIBILITY.md)
125
+ Hit rate: 39.6%
126
+ FDR threshold: λ = 0.999980225003
127
+ ✓ VERIFICATION PASSED
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Your First Search
133
+
134
+ ### Easiest: One Command from FASTA (Recommended)
135
+
136
+ ```bash
137
+ cpr find --input your_proteins.fasta --output results.csv --fdr 0.1
138
+ ```
139
+
140
+ This single command:
141
+ 1. Embeds your sequences using Protein-Vec
142
+ 2. Searches the UniProt database (540K proteins)
143
+ 3. Filters to confident hits at 10% FDR
144
+ 4. Adds calibrated probability estimates
145
+ 5. Includes Pfam/functional annotations
146
+
147
+ **Output columns:**
148
+ - `query_name`: Your sequence ID from FASTA
149
+ - `similarity`: Cosine similarity score
150
+ - `probability`: Calibrated probability of functional match
151
+ - `uncertainty`: Venn-Abers uncertainty interval
152
+ - `match_*`: Pfam domains, protein names, etc.
153
+
154
+ ### Control FDR Level
155
+
156
+ ```bash
157
+ # Stringent: 1% FDR (fewer but more confident hits)
158
+ cpr find --input proteins.fasta --output results.csv --fdr 0.01
159
+
160
+ # Default: 10% FDR (balanced)
161
+ cpr find --input proteins.fasta --output results.csv --fdr 0.1
162
+
163
+ # Discovery: 20% FDR (more hits, some false positives)
164
+ cpr find --input proteins.fasta --output results.csv --fdr 0.2
165
+ ```
166
+
167
+ ### Alternative: Manual Workflow (Advanced)
168
+
169
+ If you need more control or already have embeddings:
170
+
171
+ ```bash
172
+ # Step 1: Embed (if starting from FASTA)
173
+ cpr embed --input seqs.fasta --output embeddings.npy --model protein-vec
174
+
175
+ # Step 2: Search with FDR control
176
+ cpr search --query embeddings.npy --output hits.csv --fdr 0.1
177
+
178
+ # Step 3: Add probabilities (optional, for detailed analysis)
179
+ cpr prob --input hits.csv --output hits_with_probs.csv
180
+ ```
181
+
182
+ ---
183
+
184
+ ## FDR Threshold Reference
185
+
186
+ Use these thresholds for your desired false discovery rate:
187
+
188
+ | FDR Level | Threshold (λ) | Use Case |
189
+ |-----------|---------------|----------|
190
+ | 1% | 0.999990 | Very stringent |
191
+ | 5% | 0.999985 | Stringent |
192
+ | **10%** | **0.999980** | **Paper default** |
193
+ | 15% | 0.999975 | Relaxed |
194
+ | 20% | 0.999970 | Discovery-focused |
195
+
196
+ Full table in `results/fdr_thresholds.csv`.
197
+
198
+ ---
199
+
200
+ ## Model Weights
201
+
202
+ ### Protein-Vec (General Protein Search)
203
+
204
+ **Option 1: Contact authors** for the `protein_vec_models.gz` archive.
205
+
206
+ **Option 2: Use pre-computed embeddings** from Zenodo (no weights needed for searching).
207
+
208
+ If you have the weights:
209
+ ```bash
210
+ tar -xzf protein_vec_models.gz
211
+ # Creates protein_vec_models/ directory with:
212
+ # protein_vec.ckpt (804 MB)
213
+ # protein_vec_params.json
214
+ # aspect_vec_*.ckpt (200-400 MB each)
215
+ ```
216
+
217
+ ### CLEAN (Enzyme Classification)
218
+
219
+ For enzyme-specific searches, get CLEAN from: https://github.com/tttianhao/CLEAN
220
+
221
+ ---
222
+
223
+ ## Directory Structure After Setup
224
+
225
+ ```
226
+ conformal-protein-retrieval/
227
+ ├── data/
228
+ │ ├── lookup_embeddings.npy ← Download from Zenodo (1.1 GB)
229
+ │ ├── lookup_embeddings_meta_data.tsv ← Download from Zenodo (535 MB)
230
+ │ ├── pfam_new_proteins.npy ← Download from Zenodo (2.4 GB)
231
+ │ └── gene_unknown/ ← Included in GitHub
232
+ │ ├── unknown_aa_seqs.fasta
233
+ │ └── unknown_aa_seqs.npy
234
+ ├── protein_vec_models/ ← Optional: for new embeddings
235
+ │ ├── protein_vec.ckpt
236
+ │ └── ...
237
+ ├── protein_conformal/ ← Code (included)
238
+ ├── results/ ← Your outputs go here
239
+ └── scripts/ ← Helper scripts
240
+ ```
241
+
242
+ ---
243
+
244
+ ## Troubleshooting
245
+
246
+ ### "FileNotFoundError: data/lookup_embeddings.npy"
247
+ → Download from Zenodo: https://zenodo.org/records/14272215
248
+
249
+ ### "ModuleNotFoundError: No module named 'faiss'"
250
+ → Install FAISS: `pip install faiss-cpu` (or `faiss-gpu` for GPU)
251
+
252
+ ### "Got 58 hits, expected 59"
253
+ → This is expected! See `docs/REPRODUCIBILITY.md` - the result varies by ±1 due to threshold boundary effects.
254
+
255
+ ### "CUDA out of memory"
256
+ → Use CPU: `--device cpu` or reduce batch size with `--batch-size 16`
257
+
258
+ ---
259
+
260
+ ## What's Next?
261
+
262
+ - **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
263
+ - **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
264
+ - **Run verification**: `cpr verify --check all` tests all paper claims
265
+ - **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues
266
+
267
+ ---
268
+
269
+ ## Summary: Files Checklist
270
+
271
+ | Source | Files | Size | Status |
272
+ |--------|-------|------|--------|
273
+ | **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
274
+ | **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
275
+ | **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
276
+ | **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
277
+ | **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
278
+ | **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |
README.md CHANGED
@@ -2,27 +2,44 @@
2
 
3
  Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
4
 
5
- All data files can be found in [our Zenodo repository](https://zenodo.org/records/14272215). Results can be reproduced through executing the data preparation notebooks in each subdirectory.
6
 
7
- ## Installation
8
 
9
- Clone the repository and install dependencies:
10
  ```bash
 
11
  git clone https://github.com/ronboger/conformal-protein-retrieval.git
12
  cd conformal-protein-retrieval
13
  pip install -e .
 
 
 
 
 
 
 
 
 
 
14
  ```
15
 
16
- This will install the `cpr` command-line interface for embedding, search, and calibration.
17
 
18
- ## Structure
19
 
20
- - `./protein_conformal`: utility functions to creating confidence sets and assigning probabilities to any protein machine learning model for search
21
- - `./scope`: experiments pertraining to SCOPe
22
- - `./pfam`: notebooks demonstrating how to use our techniques to calibrate false discovery and false negative rates for different pfam classes
23
- - `./ec`: experiments pertraining to EC number classification on uniprot
24
- - `./data`: scripts and notebooks used to process data
25
- - `./clean_selection`: scripts and notebooks used to process data
 
 
 
 
 
 
 
26
 
27
  ## Quick Start
28
 
 
2
 
3
  Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
4
 
5
+ **[ GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes)
6
 
7
+ ## Quick Setup
8
 
 
9
  ```bash
10
+ # 1. Clone and install
11
  git clone https://github.com/ronboger/conformal-protein-retrieval.git
12
  cd conformal-protein-retrieval
13
  pip install -e .
14
+
15
+ # 2. Download data from Zenodo (4GB total)
16
+ # https://zenodo.org/records/14272215
17
+ # → lookup_embeddings.npy (1.1 GB) → data/
18
+ # → lookup_embeddings_meta_data.tsv (535 MB) → data/
19
+ # → pfam_new_proteins.npy (2.4 GB) → data/
20
+
21
+ # 3. Verify setup
22
+ cpr verify --check syn30
23
+ # Expected: 59/149 = 39.6% hits at FDR α=0.1
24
  ```
25
 
26
+ See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions.
27
 
28
+ ## Repository Structure
29
 
30
+ ```
31
+ conformal-protein-retrieval/
32
+ ├── protein_conformal/ # Core library (FDR/FNR control, Venn-Abers)
33
+ ├── notebooks/ # Analysis notebooks organized by experiment
34
+ │ ├── pfam/ # Pfam domain annotation (Figure 2)
35
+ │ ├── scope/ # SCOPe structural classification
36
+ │ ├── ec/ # EC number classification
37
+ │ └── clean_selection/ # CLEAN enzyme experiments (Tables 1-2)
38
+ ├── scripts/ # CLI scripts and SLURM jobs
39
+ ├── data/ # Data files (see GETTING_STARTED.md)
40
+ ├── results/ # Pre-computed thresholds and outputs
41
+ └── docs/ # Additional documentation
42
+ ```
43
 
44
  ## Quick Start
45
 
UPLOAD_CHECKLIST.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Upload Checklist: What Goes Where
2
+
3
+ This document specifies exactly what files go to GitHub vs Zenodo.
4
+
5
+ ## Summary
6
+
7
+ | Location | What | Why |
8
+ |----------|------|-----|
9
+ | **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
10
+ | **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
11
+ | **User obtains** | Protein-Vec model weights | Large binary, separate distribution |
12
+
13
+ ---
14
+
15
+ ## GitHub Repository (You Commit This)
16
+
17
+ ### Code & Configuration
18
+ ```
19
+ protein_conformal/ # All Python code
20
+ ├── __init__.py
21
+ ├── cli.py
22
+ ├── util.py
23
+ ├── scope_utils.py
24
+ ├── embed_protein_vec.py
25
+ ├── gradio_app.py
26
+ └── backend/
27
+
28
+ scripts/ # Helper scripts
29
+ ├── verify_*.py
30
+ ├── compute_fdr_table.py
31
+ ├── slurm_*.sh
32
+ └── *.py
33
+
34
+ tests/ # Test suite
35
+ notebooks/ # Analysis notebooks
36
+ docs/ # Documentation
37
+ ```
38
+
39
+ ### Small Data Files (<1MB each)
40
+ ```
41
+ data/gene_unknown/
42
+ ├── unknown_aa_seqs.fasta # 56 KB - JCVI Syn3.0 sequences
43
+ ├── unknown_aa_seqs.npy # 299 KB - Pre-computed embeddings
44
+ └── jcvi_syn30_unknown_gene_hits.csv # 61 KB - Results
45
+
46
+ results/
47
+ ├── fdr_thresholds.csv # ~2 KB - Threshold lookup table
48
+ ├── fnr_thresholds.csv # ~7 KB - FNR thresholds
49
+ └── sim2prob_lookup.csv # ~8 KB - Probability lookup
50
+ ```
51
+
52
+ ### Configuration & Docs
53
+ ```
54
+ pyproject.toml
55
+ setup.py
56
+ Dockerfile
57
+ apptainer.def
58
+ README.md
59
+ GETTING_STARTED.md
60
+ DATA.md
61
+ CLAUDE.md
62
+ docs/REPRODUCIBILITY.md
63
+ .gitignore
64
+ ```
65
+
66
+ ### Model Code (NOT weights)
67
+ ```
68
+ protein_vec_models/
69
+ ├── model_protein_moe.py # Model architecture code
70
+ ├── utils_search.py # Embedding utilities
71
+ ├── data_protein_vec.py # Data loading code
72
+ ├── embed_structure_model.py
73
+ ├── model_protein_vec_single_variable.py
74
+ ├── train_protein_vec.py
75
+ ├── __init__.py
76
+ └── *.json # Config files only
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Zenodo Repository (You Upload This)
82
+
83
+ **Zenodo URL**: https://zenodo.org/records/14272215
84
+
85
+ ### Essential Files (Required for paper verification)
86
+
87
+ | File | Size | Description |
88
+ |------|------|-------------|
89
+ | `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
90
+ | `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
91
+ | `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |
92
+
93
+ ### Optional Files (For extended experiments)
94
+
95
+ | File | Size | Description |
96
+ |------|------|-------------|
97
+ | `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
98
+ | CLEAN enzyme data | varies | For Tables 1-2 reproduction |
99
+ | SCOPe/DALI data | varies | For Tables 4-6 reproduction |
100
+
101
+ ---
102
+
103
+ ## User Must Obtain Separately
104
+
105
+ ### Protein-Vec Model Weights (~3 GB)
106
+
107
+ These are NOT in GitHub or Zenodo. Users get them by:
108
+
109
+ 1. **Option A**: Contact authors for `protein_vec_models.gz`
110
+ 2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)
111
+
112
+ Files needed if embedding new sequences:
113
+ ```
114
+ protein_vec_models/
115
+ ├── protein_vec.ckpt # 804 MB - Main model
116
+ ├── protein_vec_params.json # Config
117
+ ├── aspect_vec_*.ckpt # 200-400 MB each - Aspect models
118
+ └── tm_vec_swiss_model_large.ckpt # 391 MB
119
+ ```
120
+
121
+ ### CLEAN Model Weights (if using --model clean)
122
+
123
+ Get from: https://github.com/tttianhao/CLEAN
124
+
125
+ ---
126
+
127
+ ## .gitignore Must Include
128
+
129
+ ```gitignore
130
+ # Large data files (on Zenodo)
131
+ data/*.npy
132
+ data/*.tsv
133
+ data/*.pkl
134
+
135
+ # Model weights (user obtains separately)
136
+ protein_vec_models/*.ckpt
137
+ protein_vec_models.gz
138
+
139
+ # Build artifacts
140
+ *.sif
141
+ .apptainer_cache/
142
+ logs/
143
+ .claude/
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Verification: Is Everything Set Up Correctly?
149
+
150
+ Run this after cloning + downloading:
151
+
152
+ ```bash
153
+ # Check GitHub files present
154
+ ls data/gene_unknown/unknown_aa_seqs.fasta # Should exist
155
+ ls results/fdr_thresholds.csv # Should exist
156
+
157
+ # Check Zenodo files downloaded
158
+ ls -lh data/lookup_embeddings.npy # Should be ~1.1 GB
159
+ ls -lh data/pfam_new_proteins.npy # Should be ~2.4 GB
160
+
161
+ # Check model weights (if embedding)
162
+ ls protein_vec_models/protein_vec.ckpt # Should exist if embedding
163
+
164
+ # Run verification
165
+ cpr verify --check syn30
166
+ # Expected: 58-60/149 hits (39.6%)
167
+ ```
168
+
169
+ ---
170
+
171
+ ## For Repository Maintainers
172
+
173
+ ### When releasing a new version:
174
+
175
+ 1. **GitHub**:
176
+ - Commit all code changes
177
+ - Update `results/fdr_thresholds.csv` with new calibration
178
+ - Tag release: `git tag v1.x.x`
179
+
180
+ 2. **Zenodo**:
181
+ - Upload updated embedding files if changed
182
+ - Create new version linked to GitHub release
183
+
184
+ ### Files to NEVER commit to GitHub:
185
+ - Any `.npy` file > 1 MB
186
+ - Any `.ckpt` file (model weights)
187
+ - Any `.pkl` file > 1 MB
188
+ - Any `.tsv` or `.csv` > 1 MB
protein_conformal/cli.py CHANGED
@@ -205,6 +205,37 @@ def _embed_clean(sequences, device, args):
205
 
206
 
207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  def cmd_search(args):
209
  """Search for similar proteins using FAISS with conformal guarantees."""
210
  import numpy as np
@@ -236,12 +267,28 @@ def cmd_search(args):
236
  print(f"Querying for top {args.k} neighbors...")
237
  D, I = query(index, query_embeddings, args.k)
238
 
239
- # Apply threshold if specified
240
- if args.threshold:
241
- print(f"Applying similarity threshold: {args.threshold}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
 
243
  # Build results
244
  results = []
 
245
  for i in range(len(query_embeddings)):
246
  for j in range(args.k):
247
  sim = D[i, j]
@@ -249,7 +296,8 @@ def cmd_search(args):
249
  # Skip placeholder results (FAISS returns -1 for non-existent neighbors)
250
  if idx < 0:
251
  continue
252
- if args.threshold and sim < args.threshold:
 
253
  continue
254
  row = {
255
  'query_idx': i,
@@ -263,7 +311,143 @@ def cmd_search(args):
263
 
264
  results_df = pd.DataFrame(results)
265
  results_df.to_csv(args.output, index=False)
266
- print(f"Saved {len(results_df)} results to {args.output}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
267
 
268
 
269
  def cmd_verify(args):
@@ -431,8 +615,30 @@ def main():
431
  )
432
  subparsers = parser.add_subparsers(dest='command', help='Available commands')
433
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
434
  # embed command
435
- p_embed = subparsers.add_parser('embed', help='Embed protein sequences')
436
  p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
437
  p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
438
  p_embed.add_argument('--model', '-m', default='protein-vec',
@@ -449,8 +655,20 @@ def main():
449
  p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
450
  p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
451
  p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
452
- p_search.add_argument('--k', type=int, default=10, help='Number of neighbors (default: 10)')
453
- p_search.add_argument('--threshold', '-t', type=float, help='Similarity threshold (e.g., 0.99998 for FDR α=0.1)')
 
 
 
 
 
 
 
 
 
 
 
 
454
  p_search.set_defaults(func=cmd_search)
455
 
456
  # verify command
 
205
 
206
 
207
 
208
+ def _get_fdr_threshold(alpha: float) -> float:
209
+ """Look up FDR threshold from precomputed table."""
210
+ import pandas as pd
211
+
212
+ repo_root = Path(__file__).parent.parent
213
+ threshold_file = repo_root / "results" / "fdr_thresholds.csv"
214
+
215
+ # Fallback values if table not found (from paper)
216
+ FALLBACK_THRESHOLDS = {
217
+ 0.01: 0.999992,
218
+ 0.05: 0.999985,
219
+ 0.10: 0.999980,
220
+ 0.15: 0.999975,
221
+ 0.20: 0.999970,
222
+ }
223
+
224
+ if threshold_file.exists():
225
+ try:
226
+ df = pd.read_csv(threshold_file)
227
+ # Find closest alpha in table
228
+ if 'alpha' in df.columns and 'threshold_mean' in df.columns:
229
+ idx = (df['alpha'] - alpha).abs().idxmin()
230
+ return df.loc[idx, 'threshold_mean']
231
+ except Exception:
232
+ pass
233
+
234
+ # Use fallback
235
+ closest_alpha = min(FALLBACK_THRESHOLDS.keys(), key=lambda x: abs(x - alpha))
236
+ return FALLBACK_THRESHOLDS[closest_alpha]
237
+
238
+
239
  def cmd_search(args):
240
  """Search for similar proteins using FAISS with conformal guarantees."""
241
  import numpy as np
 
267
  print(f"Querying for top {args.k} neighbors...")
268
  D, I = query(index, query_embeddings, args.k)
269
 
270
+ # Determine threshold from --fdr, --fnr, or --threshold
271
+ threshold = None
272
+ if args.no_filter:
273
+ print("No filtering (--no-filter): returning all neighbors")
274
+ elif args.threshold:
275
+ threshold = args.threshold
276
+ print(f"Using manual threshold: {threshold}")
277
+ elif args.fnr:
278
+ # FNR threshold (TODO: add lookup table for FNR)
279
+ print(f"FNR control at α={args.fnr} (using approximate threshold)")
280
+ threshold = 0.9999 - args.fnr * 0.001 # Rough approximation
281
+ print(f" Threshold: {threshold}")
282
+ else:
283
+ # Default: FDR control
284
+ fdr_alpha = args.fdr if args.fdr else 0.1
285
+ threshold = _get_fdr_threshold(fdr_alpha)
286
+ print(f"FDR control at α={fdr_alpha} ({fdr_alpha*100:.0f}% FDR)")
287
+ print(f" Threshold: {threshold:.10f}")
288
 
289
  # Build results
290
  results = []
291
+ n_filtered = 0
292
  for i in range(len(query_embeddings)):
293
  for j in range(args.k):
294
  sim = D[i, j]
 
296
  # Skip placeholder results (FAISS returns -1 for non-existent neighbors)
297
  if idx < 0:
298
  continue
299
+ if threshold is not None and sim < threshold:
300
+ n_filtered += 1
301
  continue
302
  row = {
303
  'query_idx': i,
 
311
 
312
  results_df = pd.DataFrame(results)
313
  results_df.to_csv(args.output, index=False)
314
+
315
+ # Summary
316
+ n_queries = len(query_embeddings)
317
+ n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
318
+ print(f"\nResults:")
319
+ print(f" Queries: {n_queries}")
320
+ print(f" Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
321
+ print(f" Total hits: {len(results_df)}")
322
+ if threshold:
323
+ print(f" Filtered out: {n_filtered} below threshold")
324
+ print(f"Saved to {args.output}")
325
+
326
+
327
+ def cmd_find(args):
328
+ """One-step search: FASTA → embeddings → search → results with probabilities."""
329
+ import numpy as np
330
+ import pandas as pd
331
+ import tempfile
332
+ from Bio import SeqIO
333
+ import torch
334
+ from protein_conformal.util import load_database, query, simplifed_venn_abers_prediction, get_sims_labels
335
+
336
+ device = torch.device('cuda' if torch.cuda.is_available() and not args.cpu else 'cpu')
337
+ print(f"=== CPR Find: FASTA to Annotated Results ===")
338
+ print(f"Device: {device}")
339
+ print(f"Model: {args.model}")
340
+ print(f"FDR level: {args.fdr*100:.0f}%")
341
+ print()
342
+
343
+ # Step 1: Read sequences
344
+ print(f"[1/5] Reading sequences from {args.input}...")
345
+ sequences = []
346
+ sequence_names = []
347
+ for record in SeqIO.parse(args.input, "fasta"):
348
+ sequences.append(str(record.seq))
349
+ sequence_names.append(record.id)
350
+ print(f" Found {len(sequences)} sequences")
351
+
352
+ # Step 2: Embed sequences
353
+ print(f"\n[2/5] Computing embeddings with {args.model}...")
354
+ if args.model == 'protein-vec':
355
+ embeddings = _embed_protein_vec(sequences, device, args)
356
+ elif args.model == 'clean':
357
+ embeddings = _embed_clean(sequences, device, args)
358
+ else:
359
+ print(f"Unknown model: {args.model}")
360
+ sys.exit(1)
361
+ print(f" Embeddings shape: {embeddings.shape}")
362
+
363
+ # Step 3: Load database
364
+ repo_root = Path(__file__).parent.parent
365
+ db_path = args.database if args.database else repo_root / "data" / "lookup_embeddings.npy"
366
+ meta_path = args.database_meta if args.database_meta else repo_root / "data" / "lookup_embeddings_meta_data.tsv"
367
+
368
+ print(f"\n[3/5] Loading database from {db_path}...")
369
+ db_embeddings = np.load(db_path)
370
+ print(f" Database size: {len(db_embeddings)} proteins")
371
+
372
+ if Path(meta_path).exists():
373
+ if str(meta_path).endswith('.tsv'):
374
+ db_meta = pd.read_csv(meta_path, sep='\t')
375
+ else:
376
+ db_meta = pd.read_csv(meta_path)
377
+ else:
378
+ db_meta = None
379
+ print(" Warning: No metadata file found")
380
+
381
+ # Determine k (10% of database or max 10000)
382
+ k = min(max(100, len(db_embeddings) // 10), 10000)
383
+ print(f" Using k={k} neighbors ({k/len(db_embeddings)*100:.1f}% of database)")
384
+
385
+ # Step 4: Search
386
+ print(f"\n[4/5] Searching...")
387
+ index = load_database(db_embeddings)
388
+ D, I = query(index, embeddings, k)
389
+
390
+ # Get threshold
391
+ threshold = _get_fdr_threshold(args.fdr)
392
+ print(f" FDR threshold (α={args.fdr}): {threshold:.10f}")
393
+
394
+ # Step 5: Build results with probabilities
395
+ print(f"\n[5/5] Building results...")
396
+
397
+ # Load calibration data for probabilities
398
+ cal_path = args.calibration if args.calibration else repo_root / "data" / "pfam_new_proteins.npy"
399
+ if Path(cal_path).exists():
400
+ cal_data = np.load(cal_path, allow_pickle=True)
401
+ np.random.seed(42)
402
+ np.random.shuffle(cal_data)
403
+ cal_subset = cal_data[:100]
404
+ X_cal, y_cal = get_sims_labels(cal_subset, partial=False)
405
+ X_cal = X_cal.flatten()
406
+ y_cal = y_cal.flatten()
407
+ compute_probs = True
408
+ else:
409
+ compute_probs = False
410
+ print(" Warning: No calibration data, skipping probability computation")
411
+
412
+ results = []
413
+ for i in range(len(embeddings)):
414
+ for j in range(k):
415
+ sim = D[i, j]
416
+ idx = I[i, j]
417
+ if idx < 0 or sim < threshold:
418
+ continue
419
+
420
+ row = {
421
+ 'query_name': sequence_names[i],
422
+ 'query_idx': i,
423
+ 'match_idx': idx,
424
+ 'similarity': sim,
425
+ }
426
+
427
+ # Add probability if calibration available
428
+ if compute_probs:
429
+ p0, p1 = simplifed_venn_abers_prediction(X_cal, y_cal, sim)
430
+ row['probability'] = (p0 + p1) / 2
431
+ row['uncertainty'] = abs(p1 - p0)
432
+
433
+ # Add metadata
434
+ if db_meta is not None and idx < len(db_meta):
435
+ for col in db_meta.columns[:5]:
436
+ row[f'match_{col}'] = db_meta.iloc[idx][col]
437
+
438
+ results.append(row)
439
+
440
+ results_df = pd.DataFrame(results)
441
+ results_df.to_csv(args.output, index=False)
442
+
443
+ # Summary
444
+ n_queries = len(sequences)
445
+ n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
446
+ print(f"\n=== Results ===")
447
+ print(f"Queries: {n_queries}")
448
+ print(f"Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
449
+ print(f"Total confident hits: {len(results_df)}")
450
+ print(f"Output: {args.output}")
451
 
452
 
453
  def cmd_verify(args):
 
615
  )
616
  subparsers = parser.add_subparsers(dest='command', help='Available commands')
617
 
618
+ # find command (one-step: FASTA → results)
619
+ p_find = subparsers.add_parser('find',
620
+ help='One-step search: FASTA → embed → search → annotated results',
621
+ description='The easiest way to use CPR. Give it a FASTA file and get annotated results.')
622
+ p_find.add_argument('--input', '-i', required=True, help='Input FASTA file with protein sequences')
623
+ p_find.add_argument('--output', '-o', required=True, help='Output CSV with annotated hits')
624
+ p_find.add_argument('--model', '-m', default='protein-vec',
625
+ choices=['protein-vec', 'clean'],
626
+ help='Embedding model (default: protein-vec)')
627
+ p_find.add_argument('--fdr', type=float, default=0.1,
628
+ help='False discovery rate level (default: 0.1 = 10%% FDR)')
629
+ p_find.add_argument('--database', '-d',
630
+ help='Database embeddings (default: data/lookup_embeddings.npy)')
631
+ p_find.add_argument('--database-meta',
632
+ help='Database metadata (default: data/lookup_embeddings_meta_data.tsv)')
633
+ p_find.add_argument('--calibration', '-c',
634
+ help='Calibration data for probabilities (default: data/pfam_new_proteins.npy)')
635
+ p_find.add_argument('--cpu', action='store_true', help='Force CPU even if GPU available')
636
+ p_find.add_argument('--clean-model', default='split100',
637
+ help='CLEAN model variant (default: split100)')
638
+ p_find.set_defaults(func=cmd_find)
639
+
640
  # embed command
641
+ p_embed = subparsers.add_parser('embed', help='Embed protein sequences (step 1 of manual workflow)')
642
  p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
643
  p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
644
  p_embed.add_argument('--model', '-m', default='protein-vec',
 
655
  p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
656
  p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
657
  p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
658
+ p_search.add_argument('--k', type=int, default=100,
659
+ help='Max neighbors per query (default: 100)')
660
+ # FDR/FNR control options
661
+ p_search.add_argument('--fdr', type=float, default=0.1,
662
+ help='False discovery rate level (default: 0.1 = 10%% FDR). '
663
+ 'Automatically looks up threshold from results/fdr_thresholds.csv')
664
+ p_search.add_argument('--fnr', type=float,
665
+ help='False negative rate level (alternative to --fdr). '
666
+ 'Use this when you want to control missed true matches.')
667
+ p_search.add_argument('--threshold', '-t', type=float,
668
+ help='Manual similarity threshold (overrides --fdr/--fnr). '
669
+ 'Use this if you have a custom threshold.')
670
+ p_search.add_argument('--no-filter', action='store_true',
671
+ help='Return all neighbors without filtering (for exploration)')
672
  p_search.set_defaults(func=cmd_search)
673
 
674
  # verify command