ronboger Claude Opus 4.5 commited on
Commit
174c120
·
1 Parent(s): 0a03591

docs: clarify cpr search accepts both FASTA and embeddings

Browse files

- Update examples to show --input flag (not --query)
- Add examples for --fdr, --fnr, --threshold, --no-filter options
- Simplify workflow example

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +22 -29
README.md CHANGED
@@ -57,15 +57,23 @@ cpr embed --input sequences.fasta --output embeddings.npy --model clean
57
 
58
  ### 2. Search for similar proteins with conformal guarantees
59
 
 
 
60
  ```bash
61
- # Search with FDR control at α=0.1 (threshold λ ≈ 0.99998 for Protein-Vec)
62
- cpr search \
63
- --query query_embeddings.npy \
64
- --database data/lookup_embeddings.npy \
65
- --database-meta data/lookup_embeddings_meta_data.tsv \
66
- --output results.csv \
67
- --k 1000 \
68
- --threshold 0.99998
 
 
 
 
 
 
69
  ```
70
 
71
  ### 3. Convert similarity scores to calibrated probabilities
@@ -141,27 +149,12 @@ To calibrate FDR/FNR thresholds for your own protein search tasks:
141
  Here's a full example searching viral domains against the Pfam database with FDR control:
142
 
143
  ```bash
144
- # Step 1: Embed query sequences
145
- cpr embed \
146
- --input viral_domains.fasta \
147
- --output viral_embeddings.npy \
148
- --model protein-vec
149
-
150
- # Step 2: Search with FDR α=0.1 (λ ≈ 0.99998 from calibration)
151
- cpr search \
152
- --query viral_embeddings.npy \
153
- --database data/lookup_embeddings.npy \
154
- --database-meta data/lookup_embeddings_meta_data.tsv \
155
- --output viral_hits.csv \
156
- --k 1000 \
157
- --threshold 0.99998
158
-
159
- # Step 3: Add calibrated probabilities for each hit
160
- cpr prob \
161
- --input viral_hits.csv \
162
- --calibration data/pfam_new_proteins.npy \
163
- --output viral_hits_with_probs.csv \
164
- --n-calib 1000
165
  ```
166
 
167
  The output CSV will contain:
 
57
 
58
  ### 2. Search for similar proteins with conformal guarantees
59
 
60
+ The `cpr search` command accepts **both FASTA files and pre-computed embeddings**:
61
+
62
  ```bash
63
+ # From FASTA file (auto-embeds with Protein-Vec)
64
+ cpr search --input sequences.fasta --output results.csv --fdr 0.1
65
+
66
+ # From pre-computed embeddings
67
+ cpr search --input embeddings.npy --output results.csv --fdr 0.1
68
+
69
+ # With FNR control instead of FDR
70
+ cpr search --input sequences.fasta --output results.csv --fnr 0.1
71
+
72
+ # With explicit threshold
73
+ cpr search --input sequences.fasta --output results.csv --threshold 0.99998
74
+
75
+ # Exploratory mode (no filtering, return all k neighbors)
76
+ cpr search --input sequences.fasta --output results.csv --no-filter
77
  ```
78
 
79
  ### 3. Convert similarity scores to calibrated probabilities
 
149
  Here's a full example searching viral domains against the Pfam database with FDR control:
150
 
151
  ```bash
152
+ # Option A: One-step search from FASTA (embeds automatically)
153
+ cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1
154
+
155
+ # Option B: Two-step with explicit embedding
156
+ cpr embed --input viral_domains.fasta --output viral_embeddings.npy
157
+ cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  ```
159
 
160
  The output CSV will contain: