Spaces:
Sleeping
feat: add one-step cpr find command and FDR-level search
Browse filesMajor improvements to make CPR easier to use:
## New Features
- Add `cpr find` command: FASTA → annotated results in ONE step
- Automatically embeds sequences
- Searches database with FDR control
- Adds calibrated probabilities
- Returns annotated hits
- Add `--fdr` and `--fnr` options to `cpr search`
- Specify FDR level directly (e.g., --fdr 0.1 for 10% FDR)
- Automatically looks up threshold from results/fdr_thresholds.csv
- Fallback to paper values if table not found
- Improved search defaults
- k now defaults to 10% of database size (min 100, max 10000)
- Better result summary with hit rates
## Documentation
- Add GETTING_STARTED.md: 10-minute quickstart guide
- Add UPLOAD_CHECKLIST.md: What goes to GitHub vs Zenodo
- Update README.md with simpler workflow
## Usage Example
```bash
# One command from FASTA to annotated results:
cpr find --input proteins.fasta --output results.csv --fdr 0.1
```
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- GETTING_STARTED.md +278 -0
- README.md +28 -11
- UPLOAD_CHECKLIST.md +188 -0
- protein_conformal/cli.py +226 -8
|
@@ -0,0 +1,278 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Getting Started with CPR
|
| 2 |
+
|
| 3 |
+
This guide will get you from zero to running protein searches with conformal guarantees in under 10 minutes.
|
| 4 |
+
|
| 5 |
+
## TL;DR (Easiest Path)
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# 1. Clone and install
|
| 9 |
+
git clone https://github.com/ronboger/conformal-protein-retrieval.git
|
| 10 |
+
cd conformal-protein-retrieval
|
| 11 |
+
pip install -e .
|
| 12 |
+
|
| 13 |
+
# 2. Download data (4GB total) from https://zenodo.org/records/14272215:
|
| 14 |
+
# → lookup_embeddings.npy (1.1 GB) → data/
|
| 15 |
+
# → lookup_embeddings_meta_data.tsv (535 MB) → data/
|
| 16 |
+
# → pfam_new_proteins.npy (2.4 GB) → data/
|
| 17 |
+
|
| 18 |
+
# 3. Get Protein-Vec model weights (contact authors or see below)
|
| 19 |
+
# → Extract protein_vec_models.gz to protein_vec_models/
|
| 20 |
+
|
| 21 |
+
# 4. Run search on your sequences (ONE COMMAND!)
|
| 22 |
+
cpr find --input your_sequences.fasta --output results.csv --fdr 0.1
|
| 23 |
+
|
| 24 |
+
# That's it! results.csv contains:
|
| 25 |
+
# - Functional annotations for each protein
|
| 26 |
+
# - Calibrated probabilities
|
| 27 |
+
# - Uncertainty estimates
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
### Don't have model weights? Use pre-computed embeddings:
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
# If you already have embeddings (.npy), skip to search:
|
| 34 |
+
cpr search --query your_embeddings.npy --output results.csv --fdr 0.1
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## What You Need
|
| 40 |
+
|
| 41 |
+
### Already Included (GitHub clone)
|
| 42 |
+
|
| 43 |
+
When you clone the repository, you automatically get:
|
| 44 |
+
|
| 45 |
+
| File | Size | Description |
|
| 46 |
+
|------|------|-------------|
|
| 47 |
+
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 test sequences (149 proteins) |
|
| 48 |
+
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for test sequences |
|
| 49 |
+
| `results/fdr_thresholds.csv` | ~2 KB | FDR thresholds at standard alpha levels |
|
| 50 |
+
| `protein_conformal/*.py` | ~100 KB | All the code |
|
| 51 |
+
|
| 52 |
+
### Download from Zenodo (Required)
|
| 53 |
+
|
| 54 |
+
Download these from **https://zenodo.org/records/14272215**:
|
| 55 |
+
|
| 56 |
+
| File | Size | What it is | Where to put it |
|
| 57 |
+
|------|------|------------|-----------------|
|
| 58 |
+
| `lookup_embeddings.npy` | **1.1 GB** | UniProt database (540K protein embeddings) | `data/` |
|
| 59 |
+
| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) | `data/` |
|
| 60 |
+
| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability computation | `data/` |
|
| 61 |
+
|
| 62 |
+
**Total download: ~4 GB**
|
| 63 |
+
|
| 64 |
+
### Optional Downloads
|
| 65 |
+
|
| 66 |
+
| File | Size | When you need it |
|
| 67 |
+
|------|------|------------------|
|
| 68 |
+
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
|
| 69 |
+
| Protein-Vec model weights | 3 GB | Computing new embeddings from FASTA |
|
| 70 |
+
| CLEAN model weights | 1 GB | Enzyme classification with CLEAN |
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Step-by-Step Setup
|
| 75 |
+
|
| 76 |
+
### Step 1: Clone and Install
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
git clone https://github.com/ronboger/conformal-protein-retrieval.git
|
| 80 |
+
cd conformal-protein-retrieval
|
| 81 |
+
pip install -e .
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
**Verify installation:**
|
| 85 |
+
```bash
|
| 86 |
+
cpr --help
|
| 87 |
+
# Should show: embed, search, prob, calibrate, verify commands
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Step 2: Download Data
|
| 91 |
+
|
| 92 |
+
Go to **https://zenodo.org/records/14272215** and download:
|
| 93 |
+
|
| 94 |
+
1. `lookup_embeddings.npy` (1.1 GB)
|
| 95 |
+
2. `lookup_embeddings_meta_data.tsv` (535 MB)
|
| 96 |
+
3. `pfam_new_proteins.npy` (2.4 GB)
|
| 97 |
+
|
| 98 |
+
Move them to the `data/` directory:
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
mv ~/Downloads/lookup_embeddings.npy data/
|
| 102 |
+
mv ~/Downloads/lookup_embeddings_meta_data.tsv data/
|
| 103 |
+
mv ~/Downloads/pfam_new_proteins.npy data/
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
**Verify files:**
|
| 107 |
+
```bash
|
| 108 |
+
ls -lh data/*.npy data/*.tsv
|
| 109 |
+
# lookup_embeddings.npy 1.1G
|
| 110 |
+
# lookup_embeddings_meta_data.tsv 535M
|
| 111 |
+
# pfam_new_proteins.npy 2.4G
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### Step 3: Verify Setup
|
| 115 |
+
|
| 116 |
+
```bash
|
| 117 |
+
cpr verify --check syn30
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
**Expected output:**
|
| 121 |
+
```
|
| 122 |
+
JCVI Syn3.0 Annotation Verification
|
| 123 |
+
Total queries: 149
|
| 124 |
+
Confident hits: 59 (might be 58-60, see docs/REPRODUCIBILITY.md)
|
| 125 |
+
Hit rate: 39.6%
|
| 126 |
+
FDR threshold: λ = 0.999980225003
|
| 127 |
+
✓ VERIFICATION PASSED
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Your First Search
|
| 133 |
+
|
| 134 |
+
### Easiest: One Command from FASTA (Recommended)
|
| 135 |
+
|
| 136 |
+
```bash
|
| 137 |
+
cpr find --input your_proteins.fasta --output results.csv --fdr 0.1
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
This single command:
|
| 141 |
+
1. Embeds your sequences using Protein-Vec
|
| 142 |
+
2. Searches the UniProt database (540K proteins)
|
| 143 |
+
3. Filters to confident hits at 10% FDR
|
| 144 |
+
4. Adds calibrated probability estimates
|
| 145 |
+
5. Includes Pfam/functional annotations
|
| 146 |
+
|
| 147 |
+
**Output columns:**
|
| 148 |
+
- `query_name`: Your sequence ID from FASTA
|
| 149 |
+
- `similarity`: Cosine similarity score
|
| 150 |
+
- `probability`: Calibrated probability of functional match
|
| 151 |
+
- `uncertainty`: Venn-Abers uncertainty interval
|
| 152 |
+
- `match_*`: Pfam domains, protein names, etc.
|
| 153 |
+
|
| 154 |
+
### Control FDR Level
|
| 155 |
+
|
| 156 |
+
```bash
|
| 157 |
+
# Stringent: 1% FDR (fewer but more confident hits)
|
| 158 |
+
cpr find --input proteins.fasta --output results.csv --fdr 0.01
|
| 159 |
+
|
| 160 |
+
# Default: 10% FDR (balanced)
|
| 161 |
+
cpr find --input proteins.fasta --output results.csv --fdr 0.1
|
| 162 |
+
|
| 163 |
+
# Discovery: 20% FDR (more hits, some false positives)
|
| 164 |
+
cpr find --input proteins.fasta --output results.csv --fdr 0.2
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### Alternative: Manual Workflow (Advanced)
|
| 168 |
+
|
| 169 |
+
If you need more control or already have embeddings:
|
| 170 |
+
|
| 171 |
+
```bash
|
| 172 |
+
# Step 1: Embed (if starting from FASTA)
|
| 173 |
+
cpr embed --input seqs.fasta --output embeddings.npy --model protein-vec
|
| 174 |
+
|
| 175 |
+
# Step 2: Search with FDR control
|
| 176 |
+
cpr search --query embeddings.npy --output hits.csv --fdr 0.1
|
| 177 |
+
|
| 178 |
+
# Step 3: Add probabilities (optional, for detailed analysis)
|
| 179 |
+
cpr prob --input hits.csv --output hits_with_probs.csv
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## FDR Threshold Reference
|
| 185 |
+
|
| 186 |
+
Use these thresholds for your desired false discovery rate:
|
| 187 |
+
|
| 188 |
+
| FDR Level | Threshold (λ) | Use Case |
|
| 189 |
+
|-----------|---------------|----------|
|
| 190 |
+
| 1% | 0.999990 | Very stringent |
|
| 191 |
+
| 5% | 0.999985 | Stringent |
|
| 192 |
+
| **10%** | **0.999980** | **Paper default** |
|
| 193 |
+
| 15% | 0.999975 | Relaxed |
|
| 194 |
+
| 20% | 0.999970 | Discovery-focused |
|
| 195 |
+
|
| 196 |
+
Full table in `results/fdr_thresholds.csv`.
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
## Model Weights
|
| 201 |
+
|
| 202 |
+
### Protein-Vec (General Protein Search)
|
| 203 |
+
|
| 204 |
+
**Option 1: Contact authors** for the `protein_vec_models.gz` archive.
|
| 205 |
+
|
| 206 |
+
**Option 2: Use pre-computed embeddings** from Zenodo (no weights needed for searching).
|
| 207 |
+
|
| 208 |
+
If you have the weights:
|
| 209 |
+
```bash
|
| 210 |
+
tar -xzf protein_vec_models.gz
|
| 211 |
+
# Creates protein_vec_models/ directory with:
|
| 212 |
+
# protein_vec.ckpt (804 MB)
|
| 213 |
+
# protein_vec_params.json
|
| 214 |
+
# aspect_vec_*.ckpt (200-400 MB each)
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
### CLEAN (Enzyme Classification)
|
| 218 |
+
|
| 219 |
+
For enzyme-specific searches, get CLEAN from: https://github.com/tttianhao/CLEAN
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
## Directory Structure After Setup
|
| 224 |
+
|
| 225 |
+
```
|
| 226 |
+
conformal-protein-retrieval/
|
| 227 |
+
├── data/
|
| 228 |
+
│ ├── lookup_embeddings.npy ← Download from Zenodo (1.1 GB)
|
| 229 |
+
│ ├── lookup_embeddings_meta_data.tsv ← Download from Zenodo (535 MB)
|
| 230 |
+
│ ├── pfam_new_proteins.npy ← Download from Zenodo (2.4 GB)
|
| 231 |
+
│ └── gene_unknown/ ← Included in GitHub
|
| 232 |
+
│ ├── unknown_aa_seqs.fasta
|
| 233 |
+
│ └── unknown_aa_seqs.npy
|
| 234 |
+
├── protein_vec_models/ ← Optional: for new embeddings
|
| 235 |
+
│ ├── protein_vec.ckpt
|
| 236 |
+
│ └── ...
|
| 237 |
+
├── protein_conformal/ ← Code (included)
|
| 238 |
+
├── results/ ← Your outputs go here
|
| 239 |
+
└── scripts/ ← Helper scripts
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## Troubleshooting
|
| 245 |
+
|
| 246 |
+
### "FileNotFoundError: data/lookup_embeddings.npy"
|
| 247 |
+
→ Download from Zenodo: https://zenodo.org/records/14272215
|
| 248 |
+
|
| 249 |
+
### "ModuleNotFoundError: No module named 'faiss'"
|
| 250 |
+
→ Install FAISS: `pip install faiss-cpu` (or `faiss-gpu` for GPU)
|
| 251 |
+
|
| 252 |
+
### "Got 58 hits, expected 59"
|
| 253 |
+
→ This is expected! See `docs/REPRODUCIBILITY.md` - the result varies by ±1 due to threshold boundary effects.
|
| 254 |
+
|
| 255 |
+
### "CUDA out of memory"
|
| 256 |
+
→ Use CPU: `--device cpu` or reduce batch size with `--batch-size 16`
|
| 257 |
+
|
| 258 |
+
---
|
| 259 |
+
|
| 260 |
+
## What's Next?
|
| 261 |
+
|
| 262 |
+
- **Read the paper**: [Nature Communications (2025) 16:85](https://doi.org/10.1038/s41467-024-55676-y)
|
| 263 |
+
- **Explore notebooks**: `notebooks/pfam/genes_unknown.ipynb` shows the full Syn3.0 analysis
|
| 264 |
+
- **Run verification**: `cpr verify --check all` tests all paper claims
|
| 265 |
+
- **Get help**: Open an issue at https://github.com/ronboger/conformal-protein-retrieval/issues
|
| 266 |
+
|
| 267 |
+
---
|
| 268 |
+
|
| 269 |
+
## Summary: Files Checklist
|
| 270 |
+
|
| 271 |
+
| Source | Files | Size | Status |
|
| 272 |
+
|--------|-------|------|--------|
|
| 273 |
+
| **GitHub** | Code, test data, thresholds | ~1 MB | ✓ Included |
|
| 274 |
+
| **Zenodo** | lookup_embeddings.npy | 1.1 GB | ☐ Download |
|
| 275 |
+
| **Zenodo** | lookup_embeddings_meta_data.tsv | 535 MB | ☐ Download |
|
| 276 |
+
| **Zenodo** | pfam_new_proteins.npy | 2.4 GB | ☐ Download |
|
| 277 |
+
| **Optional** | protein_vec_models/ | 3 GB | ☐ For new embeddings |
|
| 278 |
+
| **Optional** | afdb_embeddings_protein_vec.npy | 4.7 GB | ☐ For AFDB search |
|
|
@@ -2,27 +2,44 @@
|
|
| 2 |
|
| 3 |
Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
##
|
| 8 |
|
| 9 |
-
Clone the repository and install dependencies:
|
| 10 |
```bash
|
|
|
|
| 11 |
git clone https://github.com/ronboger/conformal-protein-retrieval.git
|
| 12 |
cd conformal-protein-retrieval
|
| 13 |
pip install -e .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
```
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
## Structure
|
| 19 |
|
| 20 |
-
|
| 21 |
-
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Quick Start
|
| 28 |
|
|
|
|
| 2 |
|
| 3 |
Code and notebooks from [Functional protein mining with conformal guarantees](https://www.nature.com/articles/s41467-024-55676-y) (Nature Communications, 2025). This package provides statistically rigorous methods for protein database search with false discovery rate (FDR) and false negative rate (FNR) control.
|
| 4 |
|
| 5 |
+
**[→ GETTING STARTED](GETTING_STARTED.md)** - Quick setup guide (10 minutes)
|
| 6 |
|
| 7 |
+
## Quick Setup
|
| 8 |
|
|
|
|
| 9 |
```bash
|
| 10 |
+
# 1. Clone and install
|
| 11 |
git clone https://github.com/ronboger/conformal-protein-retrieval.git
|
| 12 |
cd conformal-protein-retrieval
|
| 13 |
pip install -e .
|
| 14 |
+
|
| 15 |
+
# 2. Download data from Zenodo (4GB total)
|
| 16 |
+
# https://zenodo.org/records/14272215
|
| 17 |
+
# → lookup_embeddings.npy (1.1 GB) → data/
|
| 18 |
+
# → lookup_embeddings_meta_data.tsv (535 MB) → data/
|
| 19 |
+
# → pfam_new_proteins.npy (2.4 GB) → data/
|
| 20 |
+
|
| 21 |
+
# 3. Verify setup
|
| 22 |
+
cpr verify --check syn30
|
| 23 |
+
# Expected: 59/149 = 39.6% hits at FDR α=0.1
|
| 24 |
```
|
| 25 |
|
| 26 |
+
See **[GETTING_STARTED.md](GETTING_STARTED.md)** for detailed instructions.
|
| 27 |
|
| 28 |
+
## Repository Structure
|
| 29 |
|
| 30 |
+
```
|
| 31 |
+
conformal-protein-retrieval/
|
| 32 |
+
├── protein_conformal/ # Core library (FDR/FNR control, Venn-Abers)
|
| 33 |
+
├── notebooks/ # Analysis notebooks organized by experiment
|
| 34 |
+
│ ├── pfam/ # Pfam domain annotation (Figure 2)
|
| 35 |
+
│ ├── scope/ # SCOPe structural classification
|
| 36 |
+
│ ├── ec/ # EC number classification
|
| 37 |
+
│ └── clean_selection/ # CLEAN enzyme experiments (Tables 1-2)
|
| 38 |
+
├── scripts/ # CLI scripts and SLURM jobs
|
| 39 |
+
├── data/ # Data files (see GETTING_STARTED.md)
|
| 40 |
+
├── results/ # Pre-computed thresholds and outputs
|
| 41 |
+
└── docs/ # Additional documentation
|
| 42 |
+
```
|
| 43 |
|
| 44 |
## Quick Start
|
| 45 |
|
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Upload Checklist: What Goes Where
|
| 2 |
+
|
| 3 |
+
This document specifies exactly what files go to GitHub vs Zenodo.
|
| 4 |
+
|
| 5 |
+
## Summary
|
| 6 |
+
|
| 7 |
+
| Location | What | Why |
|
| 8 |
+
|----------|------|-----|
|
| 9 |
+
| **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
|
| 10 |
+
| **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
|
| 11 |
+
| **User obtains** | Protein-Vec model weights | Large binary, separate distribution |
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## GitHub Repository (You Commit This)
|
| 16 |
+
|
| 17 |
+
### Code & Configuration
|
| 18 |
+
```
|
| 19 |
+
protein_conformal/ # All Python code
|
| 20 |
+
├── __init__.py
|
| 21 |
+
├── cli.py
|
| 22 |
+
├── util.py
|
| 23 |
+
├── scope_utils.py
|
| 24 |
+
├── embed_protein_vec.py
|
| 25 |
+
├── gradio_app.py
|
| 26 |
+
└── backend/
|
| 27 |
+
|
| 28 |
+
scripts/ # Helper scripts
|
| 29 |
+
├── verify_*.py
|
| 30 |
+
├── compute_fdr_table.py
|
| 31 |
+
├── slurm_*.sh
|
| 32 |
+
└── *.py
|
| 33 |
+
|
| 34 |
+
tests/ # Test suite
|
| 35 |
+
notebooks/ # Analysis notebooks
|
| 36 |
+
docs/ # Documentation
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### Small Data Files (<1MB each)
|
| 40 |
+
```
|
| 41 |
+
data/gene_unknown/
|
| 42 |
+
├── unknown_aa_seqs.fasta # 56 KB - JCVI Syn3.0 sequences
|
| 43 |
+
├── unknown_aa_seqs.npy # 299 KB - Pre-computed embeddings
|
| 44 |
+
└── jcvi_syn30_unknown_gene_hits.csv # 61 KB - Results
|
| 45 |
+
|
| 46 |
+
results/
|
| 47 |
+
├── fdr_thresholds.csv # ~2 KB - Threshold lookup table
|
| 48 |
+
├── fnr_thresholds.csv # ~7 KB - FNR thresholds
|
| 49 |
+
└── sim2prob_lookup.csv # ~8 KB - Probability lookup
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
### Configuration & Docs
|
| 53 |
+
```
|
| 54 |
+
pyproject.toml
|
| 55 |
+
setup.py
|
| 56 |
+
Dockerfile
|
| 57 |
+
apptainer.def
|
| 58 |
+
README.md
|
| 59 |
+
GETTING_STARTED.md
|
| 60 |
+
DATA.md
|
| 61 |
+
CLAUDE.md
|
| 62 |
+
docs/REPRODUCIBILITY.md
|
| 63 |
+
.gitignore
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Model Code (NOT weights)
|
| 67 |
+
```
|
| 68 |
+
protein_vec_models/
|
| 69 |
+
├── model_protein_moe.py # Model architecture code
|
| 70 |
+
├── utils_search.py # Embedding utilities
|
| 71 |
+
├── data_protein_vec.py # Data loading code
|
| 72 |
+
├── embed_structure_model.py
|
| 73 |
+
├── model_protein_vec_single_variable.py
|
| 74 |
+
├── train_protein_vec.py
|
| 75 |
+
├── __init__.py
|
| 76 |
+
└── *.json # Config files only
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Zenodo Repository (You Upload This)
|
| 82 |
+
|
| 83 |
+
**Zenodo URL**: https://zenodo.org/records/14272215
|
| 84 |
+
|
| 85 |
+
### Essential Files (Required for paper verification)
|
| 86 |
+
|
| 87 |
+
| File | Size | Description |
|
| 88 |
+
|------|------|-------------|
|
| 89 |
+
| `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
|
| 90 |
+
| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
|
| 91 |
+
| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |
|
| 92 |
+
|
| 93 |
+
### Optional Files (For extended experiments)
|
| 94 |
+
|
| 95 |
+
| File | Size | Description |
|
| 96 |
+
|------|------|-------------|
|
| 97 |
+
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
|
| 98 |
+
| CLEAN enzyme data | varies | For Tables 1-2 reproduction |
|
| 99 |
+
| SCOPe/DALI data | varies | For Tables 4-6 reproduction |
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## User Must Obtain Separately
|
| 104 |
+
|
| 105 |
+
### Protein-Vec Model Weights (~3 GB)
|
| 106 |
+
|
| 107 |
+
These are NOT in GitHub or Zenodo. Users get them by:
|
| 108 |
+
|
| 109 |
+
1. **Option A**: Contact authors for `protein_vec_models.gz`
|
| 110 |
+
2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)
|
| 111 |
+
|
| 112 |
+
Files needed if embedding new sequences:
|
| 113 |
+
```
|
| 114 |
+
protein_vec_models/
|
| 115 |
+
├── protein_vec.ckpt # 804 MB - Main model
|
| 116 |
+
├── protein_vec_params.json # Config
|
| 117 |
+
├── aspect_vec_*.ckpt # 200-400 MB each - Aspect models
|
| 118 |
+
└── tm_vec_swiss_model_large.ckpt # 391 MB
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
### CLEAN Model Weights (if using --model clean)
|
| 122 |
+
|
| 123 |
+
Get from: https://github.com/tttianhao/CLEAN
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## .gitignore Must Include
|
| 128 |
+
|
| 129 |
+
```gitignore
|
| 130 |
+
# Large data files (on Zenodo)
|
| 131 |
+
data/*.npy
|
| 132 |
+
data/*.tsv
|
| 133 |
+
data/*.pkl
|
| 134 |
+
|
| 135 |
+
# Model weights (user obtains separately)
|
| 136 |
+
protein_vec_models/*.ckpt
|
| 137 |
+
protein_vec_models.gz
|
| 138 |
+
|
| 139 |
+
# Build artifacts
|
| 140 |
+
*.sif
|
| 141 |
+
.apptainer_cache/
|
| 142 |
+
logs/
|
| 143 |
+
.claude/
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## Verification: Is Everything Set Up Correctly?
|
| 149 |
+
|
| 150 |
+
Run this after cloning + downloading:
|
| 151 |
+
|
| 152 |
+
```bash
|
| 153 |
+
# Check GitHub files present
|
| 154 |
+
ls data/gene_unknown/unknown_aa_seqs.fasta # Should exist
|
| 155 |
+
ls results/fdr_thresholds.csv # Should exist
|
| 156 |
+
|
| 157 |
+
# Check Zenodo files downloaded
|
| 158 |
+
ls -lh data/lookup_embeddings.npy # Should be ~1.1 GB
|
| 159 |
+
ls -lh data/pfam_new_proteins.npy # Should be ~2.4 GB
|
| 160 |
+
|
| 161 |
+
# Check model weights (if embedding)
|
| 162 |
+
ls protein_vec_models/protein_vec.ckpt # Should exist if embedding
|
| 163 |
+
|
| 164 |
+
# Run verification
|
| 165 |
+
cpr verify --check syn30
|
| 166 |
+
# Expected: 58-60/149 hits (39.6%)
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## For Repository Maintainers
|
| 172 |
+
|
| 173 |
+
### When releasing a new version:
|
| 174 |
+
|
| 175 |
+
1. **GitHub**:
|
| 176 |
+
- Commit all code changes
|
| 177 |
+
- Update `results/fdr_thresholds.csv` with new calibration
|
| 178 |
+
- Tag release: `git tag v1.x.x`
|
| 179 |
+
|
| 180 |
+
2. **Zenodo**:
|
| 181 |
+
- Upload updated embedding files if changed
|
| 182 |
+
- Create new version linked to GitHub release
|
| 183 |
+
|
| 184 |
+
### Files to NEVER commit to GitHub:
|
| 185 |
+
- Any `.npy` file > 1 MB
|
| 186 |
+
- Any `.ckpt` file (model weights)
|
| 187 |
+
- Any `.pkl` file > 1 MB
|
| 188 |
+
- Any `.tsv` or `.csv` > 1 MB
|
|
@@ -205,6 +205,37 @@ def _embed_clean(sequences, device, args):
|
|
| 205 |
|
| 206 |
|
| 207 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
def cmd_search(args):
|
| 209 |
"""Search for similar proteins using FAISS with conformal guarantees."""
|
| 210 |
import numpy as np
|
|
@@ -236,12 +267,28 @@ def cmd_search(args):
|
|
| 236 |
print(f"Querying for top {args.k} neighbors...")
|
| 237 |
D, I = query(index, query_embeddings, args.k)
|
| 238 |
|
| 239 |
-
#
|
| 240 |
-
|
| 241 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 242 |
|
| 243 |
# Build results
|
| 244 |
results = []
|
|
|
|
| 245 |
for i in range(len(query_embeddings)):
|
| 246 |
for j in range(args.k):
|
| 247 |
sim = D[i, j]
|
|
@@ -249,7 +296,8 @@ def cmd_search(args):
|
|
| 249 |
# Skip placeholder results (FAISS returns -1 for non-existent neighbors)
|
| 250 |
if idx < 0:
|
| 251 |
continue
|
| 252 |
-
if
|
|
|
|
| 253 |
continue
|
| 254 |
row = {
|
| 255 |
'query_idx': i,
|
|
@@ -263,7 +311,143 @@ def cmd_search(args):
|
|
| 263 |
|
| 264 |
results_df = pd.DataFrame(results)
|
| 265 |
results_df.to_csv(args.output, index=False)
|
| 266 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 267 |
|
| 268 |
|
| 269 |
def cmd_verify(args):
|
|
@@ -431,8 +615,30 @@ def main():
|
|
| 431 |
)
|
| 432 |
subparsers = parser.add_subparsers(dest='command', help='Available commands')
|
| 433 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 434 |
# embed command
|
| 435 |
-
p_embed = subparsers.add_parser('embed', help='Embed protein sequences')
|
| 436 |
p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
|
| 437 |
p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
|
| 438 |
p_embed.add_argument('--model', '-m', default='protein-vec',
|
|
@@ -449,8 +655,20 @@ def main():
|
|
| 449 |
p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
|
| 450 |
p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
|
| 451 |
p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
|
| 452 |
-
p_search.add_argument('--k', type=int, default=
|
| 453 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 454 |
p_search.set_defaults(func=cmd_search)
|
| 455 |
|
| 456 |
# verify command
|
|
|
|
| 205 |
|
| 206 |
|
| 207 |
|
| 208 |
+
def _get_fdr_threshold(alpha: float) -> float:
|
| 209 |
+
"""Look up FDR threshold from precomputed table."""
|
| 210 |
+
import pandas as pd
|
| 211 |
+
|
| 212 |
+
repo_root = Path(__file__).parent.parent
|
| 213 |
+
threshold_file = repo_root / "results" / "fdr_thresholds.csv"
|
| 214 |
+
|
| 215 |
+
# Fallback values if table not found (from paper)
|
| 216 |
+
FALLBACK_THRESHOLDS = {
|
| 217 |
+
0.01: 0.999992,
|
| 218 |
+
0.05: 0.999985,
|
| 219 |
+
0.10: 0.999980,
|
| 220 |
+
0.15: 0.999975,
|
| 221 |
+
0.20: 0.999970,
|
| 222 |
+
}
|
| 223 |
+
|
| 224 |
+
if threshold_file.exists():
|
| 225 |
+
try:
|
| 226 |
+
df = pd.read_csv(threshold_file)
|
| 227 |
+
# Find closest alpha in table
|
| 228 |
+
if 'alpha' in df.columns and 'threshold_mean' in df.columns:
|
| 229 |
+
idx = (df['alpha'] - alpha).abs().idxmin()
|
| 230 |
+
return df.loc[idx, 'threshold_mean']
|
| 231 |
+
except Exception:
|
| 232 |
+
pass
|
| 233 |
+
|
| 234 |
+
# Use fallback
|
| 235 |
+
closest_alpha = min(FALLBACK_THRESHOLDS.keys(), key=lambda x: abs(x - alpha))
|
| 236 |
+
return FALLBACK_THRESHOLDS[closest_alpha]
|
| 237 |
+
|
| 238 |
+
|
| 239 |
def cmd_search(args):
|
| 240 |
"""Search for similar proteins using FAISS with conformal guarantees."""
|
| 241 |
import numpy as np
|
|
|
|
| 267 |
print(f"Querying for top {args.k} neighbors...")
|
| 268 |
D, I = query(index, query_embeddings, args.k)
|
| 269 |
|
| 270 |
+
# Determine threshold from --fdr, --fnr, or --threshold
|
| 271 |
+
threshold = None
|
| 272 |
+
if args.no_filter:
|
| 273 |
+
print("No filtering (--no-filter): returning all neighbors")
|
| 274 |
+
elif args.threshold:
|
| 275 |
+
threshold = args.threshold
|
| 276 |
+
print(f"Using manual threshold: {threshold}")
|
| 277 |
+
elif args.fnr:
|
| 278 |
+
# FNR threshold (TODO: add lookup table for FNR)
|
| 279 |
+
print(f"FNR control at α={args.fnr} (using approximate threshold)")
|
| 280 |
+
threshold = 0.9999 - args.fnr * 0.001 # Rough approximation
|
| 281 |
+
print(f" Threshold: {threshold}")
|
| 282 |
+
else:
|
| 283 |
+
# Default: FDR control
|
| 284 |
+
fdr_alpha = args.fdr if args.fdr else 0.1
|
| 285 |
+
threshold = _get_fdr_threshold(fdr_alpha)
|
| 286 |
+
print(f"FDR control at α={fdr_alpha} ({fdr_alpha*100:.0f}% FDR)")
|
| 287 |
+
print(f" Threshold: {threshold:.10f}")
|
| 288 |
|
| 289 |
# Build results
|
| 290 |
results = []
|
| 291 |
+
n_filtered = 0
|
| 292 |
for i in range(len(query_embeddings)):
|
| 293 |
for j in range(args.k):
|
| 294 |
sim = D[i, j]
|
|
|
|
| 296 |
# Skip placeholder results (FAISS returns -1 for non-existent neighbors)
|
| 297 |
if idx < 0:
|
| 298 |
continue
|
| 299 |
+
if threshold is not None and sim < threshold:
|
| 300 |
+
n_filtered += 1
|
| 301 |
continue
|
| 302 |
row = {
|
| 303 |
'query_idx': i,
|
|
|
|
| 311 |
|
| 312 |
results_df = pd.DataFrame(results)
|
| 313 |
results_df.to_csv(args.output, index=False)
|
| 314 |
+
|
| 315 |
+
# Summary
|
| 316 |
+
n_queries = len(query_embeddings)
|
| 317 |
+
n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
|
| 318 |
+
print(f"\nResults:")
|
| 319 |
+
print(f" Queries: {n_queries}")
|
| 320 |
+
print(f" Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
|
| 321 |
+
print(f" Total hits: {len(results_df)}")
|
| 322 |
+
if threshold:
|
| 323 |
+
print(f" Filtered out: {n_filtered} below threshold")
|
| 324 |
+
print(f"Saved to {args.output}")
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
def cmd_find(args):
|
| 328 |
+
"""One-step search: FASTA → embeddings → search → results with probabilities."""
|
| 329 |
+
import numpy as np
|
| 330 |
+
import pandas as pd
|
| 331 |
+
import tempfile
|
| 332 |
+
from Bio import SeqIO
|
| 333 |
+
import torch
|
| 334 |
+
from protein_conformal.util import load_database, query, simplifed_venn_abers_prediction, get_sims_labels
|
| 335 |
+
|
| 336 |
+
device = torch.device('cuda' if torch.cuda.is_available() and not args.cpu else 'cpu')
|
| 337 |
+
print(f"=== CPR Find: FASTA to Annotated Results ===")
|
| 338 |
+
print(f"Device: {device}")
|
| 339 |
+
print(f"Model: {args.model}")
|
| 340 |
+
print(f"FDR level: {args.fdr*100:.0f}%")
|
| 341 |
+
print()
|
| 342 |
+
|
| 343 |
+
# Step 1: Read sequences
|
| 344 |
+
print(f"[1/5] Reading sequences from {args.input}...")
|
| 345 |
+
sequences = []
|
| 346 |
+
sequence_names = []
|
| 347 |
+
for record in SeqIO.parse(args.input, "fasta"):
|
| 348 |
+
sequences.append(str(record.seq))
|
| 349 |
+
sequence_names.append(record.id)
|
| 350 |
+
print(f" Found {len(sequences)} sequences")
|
| 351 |
+
|
| 352 |
+
# Step 2: Embed sequences
|
| 353 |
+
print(f"\n[2/5] Computing embeddings with {args.model}...")
|
| 354 |
+
if args.model == 'protein-vec':
|
| 355 |
+
embeddings = _embed_protein_vec(sequences, device, args)
|
| 356 |
+
elif args.model == 'clean':
|
| 357 |
+
embeddings = _embed_clean(sequences, device, args)
|
| 358 |
+
else:
|
| 359 |
+
print(f"Unknown model: {args.model}")
|
| 360 |
+
sys.exit(1)
|
| 361 |
+
print(f" Embeddings shape: {embeddings.shape}")
|
| 362 |
+
|
| 363 |
+
# Step 3: Load database
|
| 364 |
+
repo_root = Path(__file__).parent.parent
|
| 365 |
+
db_path = args.database if args.database else repo_root / "data" / "lookup_embeddings.npy"
|
| 366 |
+
meta_path = args.database_meta if args.database_meta else repo_root / "data" / "lookup_embeddings_meta_data.tsv"
|
| 367 |
+
|
| 368 |
+
print(f"\n[3/5] Loading database from {db_path}...")
|
| 369 |
+
db_embeddings = np.load(db_path)
|
| 370 |
+
print(f" Database size: {len(db_embeddings)} proteins")
|
| 371 |
+
|
| 372 |
+
if Path(meta_path).exists():
|
| 373 |
+
if str(meta_path).endswith('.tsv'):
|
| 374 |
+
db_meta = pd.read_csv(meta_path, sep='\t')
|
| 375 |
+
else:
|
| 376 |
+
db_meta = pd.read_csv(meta_path)
|
| 377 |
+
else:
|
| 378 |
+
db_meta = None
|
| 379 |
+
print(" Warning: No metadata file found")
|
| 380 |
+
|
| 381 |
+
# Determine k (10% of database or max 10000)
|
| 382 |
+
k = min(max(100, len(db_embeddings) // 10), 10000)
|
| 383 |
+
print(f" Using k={k} neighbors ({k/len(db_embeddings)*100:.1f}% of database)")
|
| 384 |
+
|
| 385 |
+
# Step 4: Search
|
| 386 |
+
print(f"\n[4/5] Searching...")
|
| 387 |
+
index = load_database(db_embeddings)
|
| 388 |
+
D, I = query(index, embeddings, k)
|
| 389 |
+
|
| 390 |
+
# Get threshold
|
| 391 |
+
threshold = _get_fdr_threshold(args.fdr)
|
| 392 |
+
print(f" FDR threshold (α={args.fdr}): {threshold:.10f}")
|
| 393 |
+
|
| 394 |
+
# Step 5: Build results with probabilities
|
| 395 |
+
print(f"\n[5/5] Building results...")
|
| 396 |
+
|
| 397 |
+
# Load calibration data for probabilities
|
| 398 |
+
cal_path = args.calibration if args.calibration else repo_root / "data" / "pfam_new_proteins.npy"
|
| 399 |
+
if Path(cal_path).exists():
|
| 400 |
+
cal_data = np.load(cal_path, allow_pickle=True)
|
| 401 |
+
np.random.seed(42)
|
| 402 |
+
np.random.shuffle(cal_data)
|
| 403 |
+
cal_subset = cal_data[:100]
|
| 404 |
+
X_cal, y_cal = get_sims_labels(cal_subset, partial=False)
|
| 405 |
+
X_cal = X_cal.flatten()
|
| 406 |
+
y_cal = y_cal.flatten()
|
| 407 |
+
compute_probs = True
|
| 408 |
+
else:
|
| 409 |
+
compute_probs = False
|
| 410 |
+
print(" Warning: No calibration data, skipping probability computation")
|
| 411 |
+
|
| 412 |
+
results = []
|
| 413 |
+
for i in range(len(embeddings)):
|
| 414 |
+
for j in range(k):
|
| 415 |
+
sim = D[i, j]
|
| 416 |
+
idx = I[i, j]
|
| 417 |
+
if idx < 0 or sim < threshold:
|
| 418 |
+
continue
|
| 419 |
+
|
| 420 |
+
row = {
|
| 421 |
+
'query_name': sequence_names[i],
|
| 422 |
+
'query_idx': i,
|
| 423 |
+
'match_idx': idx,
|
| 424 |
+
'similarity': sim,
|
| 425 |
+
}
|
| 426 |
+
|
| 427 |
+
# Add probability if calibration available
|
| 428 |
+
if compute_probs:
|
| 429 |
+
p0, p1 = simplifed_venn_abers_prediction(X_cal, y_cal, sim)
|
| 430 |
+
row['probability'] = (p0 + p1) / 2
|
| 431 |
+
row['uncertainty'] = abs(p1 - p0)
|
| 432 |
+
|
| 433 |
+
# Add metadata
|
| 434 |
+
if db_meta is not None and idx < len(db_meta):
|
| 435 |
+
for col in db_meta.columns[:5]:
|
| 436 |
+
row[f'match_{col}'] = db_meta.iloc[idx][col]
|
| 437 |
+
|
| 438 |
+
results.append(row)
|
| 439 |
+
|
| 440 |
+
results_df = pd.DataFrame(results)
|
| 441 |
+
results_df.to_csv(args.output, index=False)
|
| 442 |
+
|
| 443 |
+
# Summary
|
| 444 |
+
n_queries = len(sequences)
|
| 445 |
+
n_with_hits = len(results_df['query_idx'].unique()) if len(results_df) > 0 else 0
|
| 446 |
+
print(f"\n=== Results ===")
|
| 447 |
+
print(f"Queries: {n_queries}")
|
| 448 |
+
print(f"Queries with confident hits: {n_with_hits} ({n_with_hits/n_queries*100:.1f}%)")
|
| 449 |
+
print(f"Total confident hits: {len(results_df)}")
|
| 450 |
+
print(f"Output: {args.output}")
|
| 451 |
|
| 452 |
|
| 453 |
def cmd_verify(args):
|
|
|
|
| 615 |
)
|
| 616 |
subparsers = parser.add_subparsers(dest='command', help='Available commands')
|
| 617 |
|
| 618 |
+
# find command (one-step: FASTA → results)
|
| 619 |
+
p_find = subparsers.add_parser('find',
|
| 620 |
+
help='One-step search: FASTA → embed → search → annotated results',
|
| 621 |
+
description='The easiest way to use CPR. Give it a FASTA file and get annotated results.')
|
| 622 |
+
p_find.add_argument('--input', '-i', required=True, help='Input FASTA file with protein sequences')
|
| 623 |
+
p_find.add_argument('--output', '-o', required=True, help='Output CSV with annotated hits')
|
| 624 |
+
p_find.add_argument('--model', '-m', default='protein-vec',
|
| 625 |
+
choices=['protein-vec', 'clean'],
|
| 626 |
+
help='Embedding model (default: protein-vec)')
|
| 627 |
+
p_find.add_argument('--fdr', type=float, default=0.1,
|
| 628 |
+
help='False discovery rate level (default: 0.1 = 10%% FDR)')
|
| 629 |
+
p_find.add_argument('--database', '-d',
|
| 630 |
+
help='Database embeddings (default: data/lookup_embeddings.npy)')
|
| 631 |
+
p_find.add_argument('--database-meta',
|
| 632 |
+
help='Database metadata (default: data/lookup_embeddings_meta_data.tsv)')
|
| 633 |
+
p_find.add_argument('--calibration', '-c',
|
| 634 |
+
help='Calibration data for probabilities (default: data/pfam_new_proteins.npy)')
|
| 635 |
+
p_find.add_argument('--cpu', action='store_true', help='Force CPU even if GPU available')
|
| 636 |
+
p_find.add_argument('--clean-model', default='split100',
|
| 637 |
+
help='CLEAN model variant (default: split100)')
|
| 638 |
+
p_find.set_defaults(func=cmd_find)
|
| 639 |
+
|
| 640 |
# embed command
|
| 641 |
+
p_embed = subparsers.add_parser('embed', help='Embed protein sequences (step 1 of manual workflow)')
|
| 642 |
p_embed.add_argument('--input', '-i', required=True, help='Input FASTA file')
|
| 643 |
p_embed.add_argument('--output', '-o', required=True, help='Output .npy file for embeddings')
|
| 644 |
p_embed.add_argument('--model', '-m', default='protein-vec',
|
|
|
|
| 655 |
p_search.add_argument('--database', '-d', required=True, help='Database embeddings (.npy)')
|
| 656 |
p_search.add_argument('--database-meta', '-m', help='Database metadata (.tsv or .csv)')
|
| 657 |
p_search.add_argument('--output', '-o', required=True, help='Output results (.csv)')
|
| 658 |
+
p_search.add_argument('--k', type=int, default=100,
|
| 659 |
+
help='Max neighbors per query (default: 100)')
|
| 660 |
+
# FDR/FNR control options
|
| 661 |
+
p_search.add_argument('--fdr', type=float, default=0.1,
|
| 662 |
+
help='False discovery rate level (default: 0.1 = 10%% FDR). '
|
| 663 |
+
'Automatically looks up threshold from results/fdr_thresholds.csv')
|
| 664 |
+
p_search.add_argument('--fnr', type=float,
|
| 665 |
+
help='False negative rate level (alternative to --fdr). '
|
| 666 |
+
'Use this when you want to control missed true matches.')
|
| 667 |
+
p_search.add_argument('--threshold', '-t', type=float,
|
| 668 |
+
help='Manual similarity threshold (overrides --fdr/--fnr). '
|
| 669 |
+
'Use this if you have a custom threshold.')
|
| 670 |
+
p_search.add_argument('--no-filter', action='store_true',
|
| 671 |
+
help='Return all neighbors without filtering (for exploration)')
|
| 672 |
p_search.set_defaults(func=cmd_search)
|
| 673 |
|
| 674 |
# verify command
|