Spaces:
Sleeping
Sleeping
docs: clarify cpr search accepts both FASTA and embeddings
Browse files- Update examples to show --input flag (not --query)
- Add examples for --fdr, --fnr, --threshold, --no-filter options
- Simplify workflow example
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
README.md
CHANGED
|
@@ -57,15 +57,23 @@ cpr embed --input sequences.fasta --output embeddings.npy --model clean
|
|
| 57 |
|
| 58 |
### 2. Search for similar proteins with conformal guarantees
|
| 59 |
|
|
|
|
|
|
|
| 60 |
```bash
|
| 61 |
-
#
|
| 62 |
-
cpr search
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
### 3. Convert similarity scores to calibrated probabilities
|
|
@@ -141,27 +149,12 @@ To calibrate FDR/FNR thresholds for your own protein search tasks:
|
|
| 141 |
Here's a full example searching viral domains against the Pfam database with FDR control:
|
| 142 |
|
| 143 |
```bash
|
| 144 |
-
#
|
| 145 |
-
cpr
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
# Step 2: Search with FDR α=0.1 (λ ≈ 0.99998 from calibration)
|
| 151 |
-
cpr search \
|
| 152 |
-
--query viral_embeddings.npy \
|
| 153 |
-
--database data/lookup_embeddings.npy \
|
| 154 |
-
--database-meta data/lookup_embeddings_meta_data.tsv \
|
| 155 |
-
--output viral_hits.csv \
|
| 156 |
-
--k 1000 \
|
| 157 |
-
--threshold 0.99998
|
| 158 |
-
|
| 159 |
-
# Step 3: Add calibrated probabilities for each hit
|
| 160 |
-
cpr prob \
|
| 161 |
-
--input viral_hits.csv \
|
| 162 |
-
--calibration data/pfam_new_proteins.npy \
|
| 163 |
-
--output viral_hits_with_probs.csv \
|
| 164 |
-
--n-calib 1000
|
| 165 |
```
|
| 166 |
|
| 167 |
The output CSV will contain:
|
|
|
|
| 57 |
|
| 58 |
### 2. Search for similar proteins with conformal guarantees
|
| 59 |
|
| 60 |
+
The `cpr search` command accepts **both FASTA files and pre-computed embeddings**:
|
| 61 |
+
|
| 62 |
```bash
|
| 63 |
+
# From FASTA file (auto-embeds with Protein-Vec)
|
| 64 |
+
cpr search --input sequences.fasta --output results.csv --fdr 0.1
|
| 65 |
+
|
| 66 |
+
# From pre-computed embeddings
|
| 67 |
+
cpr search --input embeddings.npy --output results.csv --fdr 0.1
|
| 68 |
+
|
| 69 |
+
# With FNR control instead of FDR
|
| 70 |
+
cpr search --input sequences.fasta --output results.csv --fnr 0.1
|
| 71 |
+
|
| 72 |
+
# With explicit threshold
|
| 73 |
+
cpr search --input sequences.fasta --output results.csv --threshold 0.99998
|
| 74 |
+
|
| 75 |
+
# Exploratory mode (no filtering, return all k neighbors)
|
| 76 |
+
cpr search --input sequences.fasta --output results.csv --no-filter
|
| 77 |
```
|
| 78 |
|
| 79 |
### 3. Convert similarity scores to calibrated probabilities
|
|
|
|
| 149 |
Here's a full example searching viral domains against the Pfam database with FDR control:
|
| 150 |
|
| 151 |
```bash
|
| 152 |
+
# Option A: One-step search from FASTA (embeds automatically)
|
| 153 |
+
cpr search --input viral_domains.fasta --output viral_hits.csv --fdr 0.1
|
| 154 |
+
|
| 155 |
+
# Option B: Two-step with explicit embedding
|
| 156 |
+
cpr embed --input viral_domains.fasta --output viral_embeddings.npy
|
| 157 |
+
cpr search --input viral_embeddings.npy --output viral_hits.csv --fdr 0.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
```
|
| 159 |
|
| 160 |
The output CSV will contain:
|