ronboger Claude Opus 4.5 commited on
Commit
dd5ecfc
·
1 Parent(s): 68a03a2

chore: clean up redundant scripts and consolidate thresholds

Browse files

- Archive 16 redundant/one-off scripts to scripts/archive/
- Archive 3 duplicate Python files from notebooks/pfam/
- Remove "simple" CSV files, add full threshold tables to GETTING_STARTED.md
- Keep 4 clean SLURM scripts: build, embed, fdr, fnr
- Update .gitignore for archive directories

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.gitignore CHANGED
@@ -211,5 +211,7 @@ test_clean_output/
211
  protein_vec_models.gz
212
  CLEAN_repo/
213
 
214
- # Archived legacy notebooks (moved from root dirs)
215
  notebooks_archive/
 
 
 
211
  protein_vec_models.gz
212
  CLEAN_repo/
213
 
214
+ # Archived legacy code (redundant/one-off scripts)
215
  notebooks_archive/
216
+ scripts/archive/
217
+ notebooks/*/archive/
GETTING_STARTED.md CHANGED
@@ -165,20 +165,49 @@ cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
165
 
166
  ## FDR/FNR Threshold Reference
167
 
168
- These thresholds control the trade-off between hits and false positives:
169
 
170
- | α Level | FDR Threshold (λ) | FNR Threshold (λ) | Use Case |
171
- |---------|-------------------|-------------------|----------|
172
- | 0.001 | (see csv) | 0.9997904 | Ultra-stringent |
173
- | 0.01 | (see csv) | 0.9998495 | Very stringent |
174
- | 0.05 | (see csv) | 0.9998899 | Stringent |
175
- | **0.1** | **0.9999802** | **0.9999076** | **Paper default** |
176
- | 0.15 | (see csv) | 0.9999174 | Relaxed |
177
- | 0.2 | (see csv) | 0.9999245 | Discovery-focused |
178
 
179
- **Note**: FDR threshold at α=0.1 is the paper-verified value. Other FDR values require running `scripts/compute_fdr_table.py`.
180
 
181
- Full computed tables in `results/fdr_thresholds.csv` and `results/fnr_thresholds.csv`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
  ---
184
 
 
165
 
166
  ## FDR/FNR Threshold Reference
167
 
168
+ These thresholds control the trade-off between hits and false positives.
169
 
170
+ ### FDR Thresholds (False Discovery Rate)
 
 
 
 
 
 
 
171
 
172
+ Controls the expected fraction of hits that are false positives.
173
 
174
+ | α Level | Threshold (λ) | Std Dev | Use Case |
175
+ |---------|---------------|---------|----------|
176
+ | **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |
177
+
178
+ **Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.
179
+
180
+ ### FNR Thresholds (False Negative Rate) - Exact Match
181
+
182
+ Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.
183
+
184
+ | α Level | Threshold (λ) | Std Dev | Use Case |
185
+ |---------|---------------|---------|----------|
186
+ | 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
187
+ | 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
188
+ | 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
189
+ | 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
190
+ | 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
191
+ | **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
192
+ | 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
193
+ | 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |
194
+
195
+ ### FNR Thresholds - Partial Match
196
+
197
+ "Partial match" requires at least one Pfam domain to match (more permissive).
198
+
199
+ | α Level | Threshold (λ) | Std Dev | Use Case |
200
+ |---------|---------------|---------|----------|
201
+ | 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
202
+ | 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
203
+ | 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
204
+ | 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
205
+ | 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
206
+ | **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
207
+ | 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
208
+ | 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |
209
+
210
+ Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.
211
 
212
  ---
213
 
notebooks/pfam/generate_fdr.py DELETED
@@ -1,64 +0,0 @@
1
- import datetime
2
- import numpy as np
3
- import pandas as pd
4
- import argparse
5
- from tqdm import tqdm
6
- from protein_conformal.util import *
7
-
8
- def main():
9
- parser = argparse.ArgumentParser(description='Process some integers.')
10
- parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
11
- parser.add_argument('--partial', type=bool, default=False, help='Partial hits')
12
- parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
13
- parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
14
- parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
15
- parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/pfam_fdr.npy', help='Output file for the results')
16
- parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
17
- args = parser.parse_args()
18
- alpha = args.alpha
19
- num_trials = args.num_trials
20
- n_calib = args.n_calib
21
- delta = args.delta
22
- partial = args.partial
23
- # Load the data
24
- # data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
25
- data = np.load('/data/ron/protein-conformal/data/pfam_new_proteins.npy', allow_pickle=True)
26
-
27
- risks = []
28
- tprs = []
29
- lhats = []
30
- fdr_cals = []
31
- # alpha = 0.1
32
- # num_trials = 100
33
- # n_calib = 1000
34
- for trial in tqdm(range(num_trials)):
35
- np.random.shuffle(data)
36
- cal_data = data[:n_calib]
37
- test_data = data[n_calib:]
38
- X_cal, y_cal = get_sims_labels(cal_data, partial=partial)
39
- X_test, y_test_exact = get_sims_labels(test_data, partial=partial)
40
- # sims, labels = get_sims_labels(cal_data, partial=False)
41
- lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha, delta, N=100)
42
- lhats.append(lhat)
43
- fdr_cals.append(fdr_cal)
44
- # print(X_test.shape)
45
- # print(y_test_exact.shape)
46
- risks.append(risk(X_test, y_test_exact, lhat))
47
- tprs.append(calculate_true_positives(X_test, y_test_exact, lhat))
48
-
49
- print("Risk: ", np.mean(risks))
50
- print("TPR: ", np.mean(tprs))
51
- print("Lhat: ", np.mean(lhats))
52
- print("FDR Cal: ", np.mean(fdr_cals))
53
-
54
- output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '' + '.npy')
55
-
56
- np.save(output_file,
57
- {'risks': risks,
58
- 'tprs': tprs,
59
- 'lhats': lhats,
60
- 'fdr_cals': fdr_cals})
61
-
62
- if __name__ == "__main__":
63
- # add code for command line arguments
64
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebooks/pfam/generate_fnr.py DELETED
@@ -1,22 +0,0 @@
1
- import numpy as np
2
- import pandas as pd
3
- import argparse
4
- from tqdm import tqdm
5
- from protein_conformal.util import *
6
-
7
- def main():
8
- parser = argparse.ArgumentParser()
9
- parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
10
- parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
11
- parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
12
- args = parser.parse_args()
13
- alpha = args.alpha
14
- num_trials = args.num_trials
15
- n_calib = args.n_calib
16
-
17
- # Load the data
18
- data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
19
-
20
- if __name__ == "__main__":
21
- # add code for command line arguments
22
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
notebooks/pfam/sva_results.py DELETED
@@ -1,59 +0,0 @@
1
-
2
- import numpy as np
3
- import pandas as pd
4
- import argparse
5
- from tqdm import tqdm
6
- from protein_conformal.util import *
7
- import datetime
8
-
9
- def run_trial(data, n_calib, args):
10
- np.random.shuffle(data)
11
- cal_data = data[:n_calib]
12
- test_data = data[n_calib:3*n_calib]
13
- X_cal, y_cal = get_sims_labels(cal_data, partial=False)
14
- X_test, y_test_exact = get_sims_labels(test_data, partial=False)
15
- # flatten the data
16
- X_cal = X_cal.flatten()
17
- y_cal = y_cal.flatten()
18
- X_test = X_test.flatten()
19
- y_test_exact = y_test_exact.flatten()
20
-
21
- # generate random indices in the test set
22
- i = np.random.randint(0, len(X_test))
23
- # i_s = np.random.randint(0, len(X_test), int(len(X_test) * args.percent_sva_test))
24
-
25
- p_0, p_1 = simplifed_venn_abers_prediction(X_cal, y_cal, X_test[i])
26
- result = (np.mean([p_0, p_1]), X_test[i], y_test_exact[i])
27
- return result
28
-
29
- def main(args):
30
- data = np.load(args.input, allow_pickle=True)
31
- n_calib = args.n_calib # Number of calibration data points
32
-
33
- sva_results = []
34
- for trial in tqdm(range(args.num_trials)):
35
- # print(f'Running trial {i+1} of {args.num_trials}')
36
- sva_results.append(run_trial(data, n_calib, args))
37
-
38
- df_sva = pd.DataFrame(sva_results, columns=['p', 'x', 'y'])
39
- output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '') + '.csv'
40
- print(f'Saving results to {output_file}')
41
- df_sva.to_csv(output_file, index=False)
42
- # make bins for p
43
- df_sva['p_bin'] = pd.cut(df_sva['p'], bins=10)
44
- print(df_sva.groupby('p_bin')['y'].mean())
45
-
46
- if __name__ == "__main__":
47
- # add code for command line arguments
48
- parser = argparse.ArgumentParser()
49
- parser.add_argument('--input', type=str, default='/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', help='Input file for the data')
50
- # '/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy'
51
- # parser.add_argument('--percent_sva_test', type=float, default=.1, help='percent of data to use for SVA testing')
52
- # parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
53
- parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
54
- parser.add_argument('--n_calib', type=int, default=50, help='Number of calibration data points')
55
- # parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
56
- parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/sva_results', help='Output file for the results')
57
- parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
58
- args = parser.parse_args()
59
- main(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
results/fdr_thresholds.csv CHANGED
@@ -1 +1,2 @@
1
- alpha,lambda_threshold,exact_fdr,partial_fdr
 
 
1
+ alpha,threshold_mean,threshold_std,threshold_min,threshold_max,empirical_fdr_mean,empirical_fdr_std
2
+ 0.1,0.99998005881454,1.7455746588230029e-06,0.9999783761573561,0.9999823510044753,0.08709266350751729,0.011992918257283684
scripts/embed_fasta.sh DELETED
@@ -1,63 +0,0 @@
1
- #!/bin/bash
2
-
3
- # Set the CUDA device
4
- CUDA_DEVICE=3
5
-
6
- # Specify the directory with all fasta files (default is current directory), assuming
7
- # all protein fasta files are named *_prot.fasta
8
- INPUT_DIR="${1:-.}"
9
-
10
- # Create the "emb" subfolder if it doesn't exist
11
- EMB_DIR="$INPUT_DIR/emb"
12
- mkdir -p "$EMB_DIR"
13
-
14
- # Loop through all *_prot.fasta files, embed
15
- for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
16
- # Ensure the file exists
17
- if [[ -f "$fasta_file" ]]; then
18
- # Extract base filename without extension
19
- base_name=$(basename "$fasta_file" "_prot.fasta")
20
-
21
- # Set output file name inside emb/ folder
22
- output_file="$EMB_DIR/${base_name}_emb.npy"
23
-
24
- # Run embedding command
25
- echo "Processing: $fasta_file -> $output_file"
26
- cd /home/yangk/proteins/protein-vec/src_run
27
- CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
28
- fi
29
- done
30
-
31
- echo "All files processed. Embeddings saved in $EMB_DIR."
32
-
33
- #!/bin/bash
34
-
35
- # Set the CUDA device
36
- CUDA_DEVICE=3
37
-
38
- # Specify the directory with all fasta files (default is current directory), assuming
39
- # all protein fasta files are named *_prot.fasta
40
- INPUT_DIR="${1:-.}"
41
-
42
- # Create the "emb" subfolder if it doesn't exist
43
- EMB_DIR="$INPUT_DIR/emb"
44
- mkdir -p "$EMB_DIR"
45
-
46
- # Loop through all *_prot.fasta files, embed
47
- for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
48
- # Ensure the file exists
49
- if [[ -f "$fasta_file" ]]; then
50
- # Extract base filename without extension
51
- base_name=$(basename "$fasta_file" "_prot.fasta")
52
-
53
- # Set output file name inside emb/ folder
54
- output_file="$EMB_DIR/${base_name}_emb.npy"
55
-
56
- # Run embedding command
57
- echo "Processing: $fasta_file -> $output_file"
58
- cd /home/yangk/proteins/protein-vec/src_run
59
- CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
60
- fi
61
- done
62
-
63
- echo "All files processed. Embeddings saved in $EMB_DIR."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/investigate_fdr.py DELETED
@@ -1,106 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Investigate FDR calibration discrepancy between datasets.
4
- Checks for data leakage and compares calibration results.
5
- """
6
-
7
- import numpy as np
8
- import sys
9
- sys.path.insert(0, '.')
10
- from protein_conformal.util import get_sims_labels, get_thresh_FDR
11
-
12
- print("=" * 60)
13
- print("FDR Calibration Dataset Investigation")
14
- print("=" * 60)
15
- print()
16
-
17
- # Load both calibration datasets
18
- print("Loading datasets...")
19
- pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
20
- backup_data = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
21
-
22
- print(f'pfam_new_proteins.npy: {len(pfam_new)} samples')
23
- print(f'backup dataset: {len(backup_data)} samples')
24
- print()
25
-
26
- # Check for overlap (potential leakage)
27
- print("Checking for overlap between datasets...")
28
- # Meta is an array, so convert to tuple for hashing
29
- pfam_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in pfam_new)
30
- backup_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in backup_data)
31
- overlap = pfam_metas & backup_metas
32
- print(f" Unique query sets in pfam_new: {len(pfam_metas)}")
33
- print(f" Unique query sets in backup: {len(backup_metas)}")
34
- print(f" Overlap: {len(overlap)} ({len(overlap)/len(pfam_metas)*100:.1f}% of pfam_new)")
35
- print()
36
-
37
- # Compare similarity distributions
38
- print("Similarity score distributions:")
39
- sims_new, labels_new = get_sims_labels(pfam_new[:500], partial=False)
40
- sims_backup, labels_backup = get_sims_labels(backup_data[:500], partial=False)
41
-
42
- print(f" pfam_new (500 samples):")
43
- print(f" Similarity: min={sims_new.min():.6f}, max={sims_new.max():.6f}, mean={sims_new.mean():.6f}")
44
- print(f" Labels: {labels_new.sum()}/{labels_new.size} positive ({labels_new.mean()*100:.1f}%)")
45
- print()
46
- print(f" backup (500 samples):")
47
- print(f" Similarity: min={sims_backup.min():.6f}, max={sims_backup.max():.6f}, mean={sims_backup.mean():.6f}")
48
- print(f" Labels: {labels_backup.sum()}/{labels_backup.size} positive ({labels_backup.mean()*100:.1f}%)")
49
- print()
50
-
51
- # Run FDR calibration on both with same parameters
52
- print("Running FDR calibration (alpha=0.1, n_calib=1000, 10 trials)...")
53
- print()
54
-
55
- def run_fdr_trials(data, name, n_trials=10, n_calib=1000):
56
- lhats = []
57
- risks = []
58
- tprs = []
59
-
60
- for trial in range(n_trials):
61
- np.random.seed(42 + trial)
62
- np.random.shuffle(data)
63
- cal_data = data[:n_calib]
64
- test_data = data[n_calib:n_calib+500]
65
-
66
- X_cal, y_cal = get_sims_labels(cal_data, partial=False)
67
- X_test, y_test = get_sims_labels(test_data, partial=False)
68
-
69
- lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha=0.1, delta=0.5, N=100)
70
- lhats.append(lhat)
71
-
72
- # Calculate test risk and TPR
73
- preds = (X_test >= lhat).astype(int)
74
- tp = np.sum((preds == 1) & (y_test == 1))
75
- fp = np.sum((preds == 1) & (y_test == 0))
76
- fn = np.sum((preds == 0) & (y_test == 1))
77
-
78
- tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
79
- risk = fp / (fp + tp) if (fp + tp) > 0 else 0
80
-
81
- tprs.append(tpr)
82
- risks.append(risk)
83
-
84
- print(f"{name}:")
85
- print(f" λ (threshold): {np.mean(lhats):.10f} ± {np.std(lhats):.10f}")
86
- print(f" Risk (FDR): {np.mean(risks):.4f} ± {np.std(risks):.4f}")
87
- print(f" TPR: {np.mean(tprs)*100:.1f}% ± {np.std(tprs)*100:.1f}%")
88
- print()
89
- return lhats, risks, tprs
90
-
91
- lhats_new, risks_new, tprs_new = run_fdr_trials(pfam_new.copy(), "pfam_new_proteins")
92
- lhats_backup, risks_backup, tprs_backup = run_fdr_trials(backup_data.copy(), "backup_dataset")
93
-
94
- print("=" * 60)
95
- print("CONCLUSION")
96
- print("=" * 60)
97
- if abs(np.mean(lhats_new) - np.mean(lhats_backup)) < 0.00001:
98
- print("✓ Thresholds are similar - datasets likely compatible")
99
- else:
100
- print("⚠ Thresholds differ significantly!")
101
- print(f" Difference: {abs(np.mean(lhats_new) - np.mean(lhats_backup)):.10f}")
102
-
103
- if len(overlap) > len(pfam_metas) * 0.5:
104
- print("⚠ High overlap between datasets - potential data source")
105
- else:
106
- print("✓ Low overlap - datasets appear independent")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/precompute_fdr_thresholds.sh DELETED
@@ -1,96 +0,0 @@
1
- #!/bin/bash
2
-
3
- # Get the directory where this script is located
4
- SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
5
- PROJECT_ROOT="$( cd "$SCRIPT_DIR/.." &> /dev/null && pwd )"
6
-
7
- # Parameters
8
- MIN_ALPHA=0.01
9
- MAX_ALPHA=1.0
10
- NUM_ALPHA_VALUES=100
11
- NUM_TRIALS=100
12
- N_CALIB=1000
13
- DELTA=0.5
14
- OUTPUT_DIR="$PROJECT_ROOT/results"
15
- TEMP_DIR="$SCRIPT_DIR/temp_fdr_results"
16
- CSV_OUTPUT="$OUTPUT_DIR/fdr_thresholds.csv"
17
-
18
- echo "Script directory: $SCRIPT_DIR"
19
- echo "Project root: $PROJECT_ROOT"
20
- echo "Output directory: $OUTPUT_DIR"
21
-
22
-
23
- mkdir -p "$OUTPUT_DIR"
24
- mkdir -p "$TEMP_DIR"
25
-
26
- # Initialize CSV file with header
27
- echo "alpha,lambda_threshold,exact_fdr,partial_fdr" > "$CSV_OUTPUT"
28
-
29
- # Generate alpha values using Python
30
- ALPHA_VALUES=$(python -c "
31
- import numpy as np
32
- alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
33
- print(' '.join([str(a) for a in alphas]))
34
- ")
35
-
36
- # Counter for progress tracking
37
- counter=0
38
- total=$NUM_ALPHA_VALUES
39
-
40
- # Loop over alpha values
41
- for alpha in $ALPHA_VALUES; do
42
- counter=$((counter + 1))
43
-
44
- # Run FDR generation for exact matches
45
- python "$PROJECT_ROOT/pfam/generate_fdr.py" \
46
- --alpha "$alpha" \
47
- --partial false \
48
- --num_trials "$NUM_TRIALS" \
49
- --n_calib "$N_CALIB" \
50
- --delta "$DELTA" \
51
- --output "$TEMP_DIR/fdr_exact_$alpha" \
52
- --add_date false
53
-
54
- # Run FDR generation for partial matches
55
- echo " Running partial matches..."
56
- python "$PROJECT_ROOT/pfam/generate_fdr.py" \
57
- --alpha "$alpha" \
58
- --partial true \
59
- --num_trials "$NUM_TRIALS" \
60
- --n_calib "$N_CALIB" \
61
- --delta "$DELTA" \
62
- --output "$TEMP_DIR/fdr_partial_$alpha" \
63
- --add_date false
64
-
65
- # Extract results and append to CSV using Python
66
- python -c "
67
- import numpy as np
68
- import sys
69
-
70
- try:
71
- # Load exact match results
72
- exact_data = np.load('$TEMP_DIR/fdr_exact_$alpha.npy', allow_pickle=True).item()
73
- exact_lhat = np.mean(exact_data['lhats'])
74
- exact_fdr = np.mean(exact_data['risks'])
75
-
76
- # Load partial match results
77
- partial_data = np.load('$TEMP_DIR/fdr_partial_$alpha.npy', allow_pickle=True).item()
78
- partial_fdr = np.mean(partial_data['risks'])
79
-
80
- # Write to CSV
81
- with open('$CSV_OUTPUT', 'a') as f:
82
- f.write(f'$alpha,{exact_lhat},{exact_fdr},{partial_fdr}\n')
83
-
84
- print(f' Results: lambda={exact_lhat:.6f}, exact_fdr={exact_fdr:.6f}, partial_fdr={partial_fdr:.6f}')
85
-
86
- except Exception as e:
87
- print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
88
- sys.exit(1)
89
- "
90
- done
91
-
92
- # Clean up temporary files
93
- rm -rf "$TEMP_DIR"
94
- echo "Results saved to: $CSV_OUTPUT"
95
- echo "Total alpha values processed: $total"
96
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/precompute_fnr_thresholds.sh DELETED
@@ -1,106 +0,0 @@
1
- #!/bin/bash
2
-
3
- # Script to precompute FNR thresholds for different alpha values
4
- # Usage: ./precompute_fnr_thresholds.sh [OPTIONS]
5
-
6
- # Default parameters - can be modified as needed
7
- MIN_ALPHA=0.01
8
- MAX_ALPHA=1.0
9
- NUM_ALPHA_VALUES=100
10
- NUM_TRIALS=100
11
- N_CALIB=1000
12
- OUTPUT_DIR="../results"
13
- TEMP_DIR="./temp_fnr_results"
14
- CSV_OUTPUT="$OUTPUT_DIR/fnr_thresholds.csv"
15
-
16
- mkdir -p "$OUTPUT_DIR"
17
- mkdir -p "$TEMP_DIR"
18
-
19
- # Initialize CSV file with header
20
- echo "alpha,lambda_threshold,exact_fnr,partial_fnr" > "$CSV_OUTPUT"
21
-
22
- echo "Precomputing FNR thresholds..."
23
- echo "Alpha range: $MIN_ALPHA to $MAX_ALPHA"
24
- echo "Number of alpha values: $NUM_ALPHA_VALUES"
25
- echo "Trials per alpha: $NUM_TRIALS"
26
- echo "Calibration set size: $N_CALIB"
27
- echo "Output file: $CSV_OUTPUT"
28
- echo ""
29
-
30
- # Generate alpha values using Python
31
- ALPHA_VALUES=$(python -c "
32
- import numpy as np
33
- alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
34
- print(' '.join([str(a) for a in alphas]))
35
- ")
36
-
37
- # Counter for progress tracking
38
- counter=0
39
- total=$NUM_ALPHA_VALUES
40
-
41
- # Loop over alpha values
42
- for alpha in $ALPHA_VALUES; do
43
- counter=$((counter + 1))
44
- echo "Processing alpha=$alpha ($counter/$total)..."
45
-
46
- # Run FNR generation for exact matches
47
- echo " Running exact matches..."
48
- python ../pfam/generate_fnr.py \
49
- --alpha "$alpha" \
50
- --partial false \
51
- --num_trials "$NUM_TRIALS" \
52
- --n_calib "$N_CALIB" \
53
- --output "$TEMP_DIR/fnr_exact_$alpha" \
54
- --add_date false
55
-
56
- # Run FNR generation for partial matches
57
- echo " Running partial matches..."
58
- python ../pfam/generate_fnr.py \
59
- --alpha "$alpha" \
60
- --partial true \
61
- --num_trials "$NUM_TRIALS" \
62
- --n_calib "$N_CALIB" \
63
- --output "$TEMP_DIR/fnr_partial_$alpha" \
64
- --add_date false
65
-
66
- # Extract results and append to CSV using Python
67
- python -c "
68
- import numpy as np
69
- import sys
70
-
71
- try:
72
- # Load exact match results
73
- exact_data = np.load('$TEMP_DIR/fnr_exact_$alpha.npy', allow_pickle=True).item()
74
- exact_lhat = np.mean(exact_data['lhats'])
75
- exact_fnr = np.mean(exact_data['fnrs'])
76
-
77
- # Load partial match results
78
- partial_data = np.load('$TEMP_DIR/fnr_partial_$alpha.npy', allow_pickle=True).item()
79
- partial_fnr = np.mean(partial_data['fnrs'])
80
-
81
- # Write to CSV
82
- with open('$CSV_OUTPUT', 'a') as f:
83
- f.write(f'$alpha,{exact_lhat},{exact_fnr},{partial_fnr}\n')
84
-
85
- print(f' Results: lambda={exact_lhat:.6f}, exact_fnr={exact_fnr:.6f}, partial_fnr={partial_fnr:.6f}')
86
-
87
- except Exception as e:
88
- print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
89
- sys.exit(1)
90
- "
91
-
92
- if [ $? -ne 0 ]; then
93
- echo "Error processing alpha=$alpha" >&2
94
- exit 1
95
- fi
96
- done
97
-
98
- # Clean up temporary files
99
- echo ""
100
- echo "Cleaning up temporary files..."
101
- rm -rf "$TEMP_DIR"
102
-
103
- echo "FNR threshold precomputation completed!"
104
- echo "Results saved to: $CSV_OUTPUT"
105
- echo "Total alpha values processed: $total"
106
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/quick_fdr_check.py DELETED
@@ -1,58 +0,0 @@
1
- #!/usr/bin/env python
2
- """Quick FDR calibration check - compare the two datasets."""
3
- import numpy as np
4
- import sys
5
- sys.path.insert(0, '.')
6
- from protein_conformal.util import get_sims_labels, get_thresh_FDR
7
-
8
- print("Quick FDR Calibration Check")
9
- print("=" * 50)
10
-
11
- # Load both datasets
12
- pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
13
- backup = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
14
-
15
- print(f"pfam_new: {len(pfam_new)} samples")
16
- print(f"backup: {len(backup)} samples")
17
- print()
18
-
19
- # Compare similarity and label distributions
20
- sims1, labels1 = get_sims_labels(pfam_new[:500], partial=False)
21
- sims2, labels2 = get_sims_labels(backup[:500], partial=False)
22
-
23
- print("Stats from first 500 samples:")
24
- print(f" pfam_new - positives: {labels1.sum()}/{labels1.size} ({100*labels1.mean():.2f}%)")
25
- print(f" backup - positives: {labels2.sum()}/{labels2.size} ({100*labels2.mean():.2f}%)")
26
- print()
27
-
28
- # Run a single FDR calibration on each
29
- print("Single FDR trial (n_calib=1000, alpha=0.1):")
30
- np.random.seed(42)
31
-
32
- # pfam_new
33
- np.random.shuffle(pfam_new)
34
- X1, y1 = get_sims_labels(pfam_new[:1000], partial=False)
35
- lhat1, _ = get_thresh_FDR(y1, X1, alpha=0.1, delta=0.5, N=100)
36
-
37
- # backup
38
- np.random.shuffle(backup)
39
- X2, y2 = get_sims_labels(backup[:1000], partial=False)
40
- lhat2, _ = get_thresh_FDR(y2, X2, alpha=0.1, delta=0.5, N=100)
41
-
42
- print(f" pfam_new λ: {lhat1:.10f}")
43
- print(f" backup λ: {lhat2:.10f}")
44
- print(f" Paper λ: 0.9999802250 (from pfam_fdr_2024-06-25.npy)")
45
- print()
46
-
47
- # Which is closer to paper?
48
- diff1 = abs(lhat1 - 0.9999802250)
49
- diff2 = abs(lhat2 - 0.9999802250)
50
- print(f"Difference from paper threshold:")
51
- print(f" pfam_new: {diff1:.10f}")
52
- print(f" backup: {diff2:.10f}")
53
- print()
54
-
55
- if diff1 < diff2:
56
- print("→ pfam_new is closer to paper threshold")
57
- else:
58
- print("→ backup is closer to paper threshold")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/search_and_assign_probs.sh DELETED
@@ -1,123 +0,0 @@
1
- #!/bin/bash
2
-
3
- # Define the lookup directory
4
- LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
5
-
6
- # Navigate to the conformal-protein-retrieval directory
7
- cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
8
-
9
- # Specify the directory containing FASTA and embedding files
10
- INPUT_DIR="${1:-.}"
11
-
12
- # Create an output directory for the results
13
- OUTPUT_DIR="$INPUT_DIR/output"
14
- mkdir -p "$OUTPUT_DIR"
15
-
16
- # Precomputed probabilities file for get_probs.py
17
- PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
18
-
19
- # Loop through all *_prot.fasta files
20
- for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
21
- if [[ -f "$fasta_file" ]]; then
22
- # Extract base name
23
- base_name=$(basename "$fasta_file" "_prot.fasta")
24
-
25
- # Corresponding embedding file
26
- emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
27
-
28
- # Ensure embedding file exists
29
- if [[ -f "$emb_file" ]]; then
30
- # Output CSV file for search.py results
31
- search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
32
-
33
- # Run the search script, with pfam partial FDR control as filtering threshold
34
- echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
35
- python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
36
- --output "$search_output_csv" \
37
- --query_embedding "$emb_file" \
38
- --query_fasta "$fasta_file" \
39
- --lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
40
- --lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
41
-
42
- # Ensure search output file exists before proceeding
43
- if [[ -f "$search_output_csv" ]]; then
44
- # Output CSV file for get_probs.py results
45
- probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
46
-
47
- # Run get_probs.py
48
- echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
49
- python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
50
- --input "$search_output_csv" --output "$probs_output_csv" --partial
51
- else
52
- echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
53
- fi
54
- else
55
- echo "Warning: No matching embedding found for $fasta_file. Skipping..."
56
- fi
57
- fi
58
- done
59
-
60
- echo "Pipeline completed. Results saved in $OUTPUT_DIR."
61
-
62
-
63
- #!/bin/bash
64
-
65
- # Define the lookup directory
66
- LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
67
-
68
- # Navigate to the conformal-protein-retrieval directory
69
- cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
70
-
71
- # Specify the directory containing FASTA and embedding files
72
- INPUT_DIR="${1:-.}"
73
-
74
- # Create an output directory for the results
75
- OUTPUT_DIR="$INPUT_DIR/output"
76
- mkdir -p "$OUTPUT_DIR"
77
-
78
- # Precomputed probabilities file for get_probs.py
79
- PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
80
-
81
- # Loop through all *_prot.fasta files
82
- for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
83
- if [[ -f "$fasta_file" ]]; then
84
- # Extract base name
85
- base_name=$(basename "$fasta_file" "_prot.fasta")
86
-
87
- # Corresponding embedding file
88
- emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
89
-
90
- # Ensure embedding file exists
91
- if [[ -f "$emb_file" ]]; then
92
- # Output CSV file for search.py results
93
- search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
94
-
95
- # Run the search script, with pfam partial FDR control as filtering threshold
96
- echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
97
- python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
98
- --output "$search_output_csv" \
99
- --query_embedding "$emb_file" \
100
- --query_fasta "$fasta_file" \
101
- --lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
102
- --lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
103
-
104
- # Ensure search output file exists before proceeding
105
- if [[ -f "$search_output_csv" ]]; then
106
- # Output CSV file for get_probs.py results
107
- probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
108
-
109
- # Run get_probs.py
110
- echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
111
- python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
112
- --input "$search_output_csv" --output "$probs_output_csv" --partial
113
- else
114
- echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
115
- fi
116
- else
117
- echo "Warning: No matching embedding found for $fasta_file. Skipping..."
118
- fi
119
- fi
120
- done
121
-
122
- echo "Pipeline completed. Results saved in $OUTPUT_DIR."
123
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_calibrate_fdr.sh DELETED
@@ -1,59 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=cpr-calibrate-fdr
3
- #SBATCH --output=logs/cpr-calibrate-fdr-%j.out
4
- #SBATCH --error=logs/cpr-calibrate-fdr-%j.err
5
- #SBATCH --time=2:00:00
6
- #SBATCH --mem=32G
7
- #SBATCH --cpus-per-task=4
8
-
9
- # CPR FDR Calibration - Reproduces paper threshold computation
10
- # Usage: sbatch scripts/slurm_calibrate_fdr.sh
11
-
12
- set -e
13
- mkdir -p logs results
14
-
15
- source ~/.bashrc
16
- eval "$(conda shell.bash hook)"
17
- conda activate conformal-s
18
-
19
- echo "========================================"
20
- echo "CPR FDR Calibration"
21
- echo "Date: $(date)"
22
- echo "Node: $(hostname)"
23
- echo "========================================"
24
-
25
- # IMPORTANT: Use pfam_new_proteins.npy - the CORRECT calibration dataset
26
- # The backup dataset (conformal_pfam_with_lookup_dataset.npy) has DATA LEAKAGE:
27
- # - First 50 samples all have same Pfam family "PF01266;" repeated
28
- # - Positive rate is 3.00% vs 0.22% in correct dataset
29
- # - Results in different FDR threshold (~0.999965 vs paper's ~0.999980)
30
- # See: scripts/quick_fdr_check.py for verification
31
- CALIB_DATA="data/pfam_new_proteins.npy"
32
-
33
- # Check if data exists
34
- if [ ! -f "$CALIB_DATA" ]; then
35
- echo "ERROR: Calibration data not found at $CALIB_DATA"
36
- echo "Download from Zenodo: https://zenodo.org/records/14272215"
37
- exit 1
38
- fi
39
-
40
- echo "Using calibration data: $CALIB_DATA"
41
- echo "NOTE: Using pfam_new_proteins.npy (correct dataset without leakage)"
42
- echo ""
43
-
44
- # Run calibration using the ORIGINAL generate_fdr.py script (LTT method)
45
- # This matches what was used to generate the paper's threshold
46
- python scripts/pfam/generate_fdr.py \
47
- --data_path "$CALIB_DATA" \
48
- --alpha 0.1 \
49
- --num_trials 100 \
50
- --n_calib 1000 \
51
- --delta 0.5 \
52
- --output results/pfam_fdr \
53
- --add_date True
54
-
55
- echo ""
56
- echo "========================================"
57
- echo "Expected result: mean lhat ≈ 0.999980"
58
- echo "Completed: $(date)"
59
- echo "========================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_calibrate_fnr.sh DELETED
@@ -1,47 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=cpr-calibrate-fnr
3
- #SBATCH --output=logs/cpr-calibrate-fnr-%j.out
4
- #SBATCH --error=logs/cpr-calibrate-fnr-%j.err
5
- #SBATCH --time=2:00:00
6
- #SBATCH --mem=32G
7
- #SBATCH --cpus-per-task=4
8
-
9
- # CPR FNR Calibration - Computes FNR thresholds
10
- # Usage: sbatch scripts/slurm_calibrate_fnr.sh
11
-
12
- set -e
13
- mkdir -p logs results
14
-
15
- source ~/.bashrc
16
- eval "$(conda shell.bash hook)"
17
- conda activate conformal-s
18
-
19
- echo "========================================"
20
- echo "CPR FNR Calibration"
21
- echo "Date: $(date)"
22
- echo "Node: $(hostname)"
23
- echo "========================================"
24
-
25
- # Use the ORIGINAL calibration dataset from backup
26
- CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy"
27
-
28
- if [ ! -f "$CALIB_DATA" ]; then
29
- echo "ERROR: Calibration data not found at $CALIB_DATA"
30
- exit 1
31
- fi
32
-
33
- echo "Using calibration data: $CALIB_DATA"
34
- echo ""
35
-
36
- python scripts/pfam/generate_fnr.py \
37
- --data_path "$CALIB_DATA" \
38
- --alpha 0.1 \
39
- --num_trials 100 \
40
- --n_calib 1000 \
41
- --output results/pfam_fnr \
42
- --add_date True
43
-
44
- echo ""
45
- echo "========================================"
46
- echo "Completed: $(date)"
47
- echo "========================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_compute_fdr_all.sh DELETED
@@ -1,65 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=fdr-all
3
- #SBATCH --partition=standard
4
- #SBATCH --nodes=1
5
- #SBATCH --ntasks=1
6
- #SBATCH --cpus-per-task=4
7
- #SBATCH --mem=32G
8
- #SBATCH --time=24:00:00
9
- #SBATCH --output=logs/fdr_all_%j.out
10
- #SBATCH --error=logs/fdr_all_%j.err
11
-
12
- # Compute FDR thresholds at all alpha levels (both exact and partial matches)
13
- # This uses the FIXED compute_fdr_table.py with correct argument order
14
-
15
- set -e
16
-
17
- echo "=== FDR Threshold Computation (All Alpha Levels) ==="
18
- echo "Job ID: $SLURM_JOB_ID"
19
- echo "Started: $(date)"
20
- echo ""
21
-
22
- # Setup environment
23
- eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
24
- conda activate conformal-s
25
-
26
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
27
-
28
- # Calibration data
29
- CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/pfam_new_proteins.npy"
30
-
31
- # Alpha levels to compute
32
- ALPHA_LEVELS="0.001,0.005,0.01,0.02,0.05,0.1,0.15,0.2"
33
-
34
- echo "Calibration data: $CALIB_DATA"
35
- echo "Alpha levels: $ALPHA_LEVELS"
36
- echo ""
37
-
38
- # Exact match FDR thresholds
39
- echo "=== Computing EXACT match FDR thresholds ==="
40
- python scripts/compute_fdr_table.py \
41
- --calibration "$CALIB_DATA" \
42
- --output results/fdr_thresholds.csv \
43
- --n-trials 100 \
44
- --n-calib 1000 \
45
- --seed 42 \
46
- --alpha-levels "$ALPHA_LEVELS"
47
-
48
- echo ""
49
- echo "=== Computing PARTIAL match FDR thresholds ==="
50
- python scripts/compute_fdr_table.py \
51
- --calibration "$CALIB_DATA" \
52
- --output results/fdr_thresholds_partial.csv \
53
- --n-trials 100 \
54
- --n-calib 1000 \
55
- --seed 42 \
56
- --alpha-levels "$ALPHA_LEVELS" \
57
- --partial
58
-
59
- echo ""
60
- echo "=== FDR Computation Complete ==="
61
- echo "Results:"
62
- echo " - results/fdr_thresholds.csv (exact match)"
63
- echo " - results/fdr_thresholds_partial.csv (partial match)"
64
- echo ""
65
- echo "Finished: $(date)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_compute_fnr_partial.sh DELETED
@@ -1,38 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=fnr-partial
3
- #SBATCH --partition=standard
4
- #SBATCH --nodes=1
5
- #SBATCH --ntasks=1
6
- #SBATCH --cpus-per-task=4
7
- #SBATCH --mem=32G
8
- #SBATCH --time=12:00:00
9
- #SBATCH --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.log
10
- #SBATCH --error=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.err
11
-
12
- # Compute FNR thresholds for PARTIAL matches only
13
- # Use this if the main FNR job timed out before completing partial match computation
14
-
15
- eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
16
- conda activate conformal-s
17
-
18
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
19
-
20
- echo "============================================"
21
- echo "Computing FNR Thresholds (Partial Match Only)"
22
- echo "============================================"
23
- echo "Start time: $(date)"
24
- echo "Node: $(hostname)"
25
- echo ""
26
-
27
- python scripts/compute_fnr_table.py \
28
- --calibration data/pfam_new_proteins.npy \
29
- --output results/fnr_thresholds_partial.csv \
30
- --n-trials 100 \
31
- --n-calib 1000 \
32
- --seed 42 \
33
- --partial
34
-
35
- echo ""
36
- echo "============================================"
37
- echo "Completed: $(date)"
38
- echo "============================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_investigate.sh DELETED
@@ -1,36 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=cpr-investigate
3
- #SBATCH --output=logs/cpr-investigate-%j.out
4
- #SBATCH --error=logs/cpr-investigate-%j.err
5
- #SBATCH --time=1:00:00
6
- #SBATCH --mem=32G
7
- #SBATCH --cpus-per-task=4
8
-
9
- # CPR Investigation - FDR calibration and precomputed probability verification
10
- set -e
11
- mkdir -p logs data
12
-
13
- source ~/.bashrc
14
- eval "$(conda shell.bash hook)"
15
- conda activate conformal-s
16
-
17
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
18
-
19
- echo "========================================"
20
- echo "CPR Investigation"
21
- echo "Date: $(date)"
22
- echo "Node: $(hostname)"
23
- echo "========================================"
24
- echo ""
25
-
26
- echo "=== Part 1: FDR Calibration Investigation ==="
27
- python scripts/investigate_fdr.py
28
- echo ""
29
-
30
- echo "=== Part 2: Precomputed Probability Verification ==="
31
- python scripts/test_precomputed_probs.py
32
- echo ""
33
-
34
- echo "========================================"
35
- echo "Completed: $(date)"
36
- echo "========================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_test_clean_embed.sh DELETED
@@ -1,108 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=test_clean_embed
3
- #SBATCH --partition=gpu
4
- #SBATCH --nodes=1
5
- #SBATCH --ntasks=1
6
- #SBATCH --cpus-per-task=4
7
- #SBATCH --gres=gpu:1
8
- #SBATCH --time=01:00:00
9
- #SBATCH --output=logs/test_clean_embed_%j.out
10
- #SBATCH --error=logs/test_clean_embed_%j.err
11
-
12
- # Test CLEAN embedding with the CPR CLI
13
- # This script:
14
- # 1. Runs CLI tests
15
- # 2. Tests CLEAN embedding on a small FASTA file
16
-
17
- set -e
18
-
19
- echo "=== CPR CLEAN Embedding Test ==="
20
- echo "Date: $(date)"
21
- echo "Node: $(hostname)"
22
- echo "Job ID: $SLURM_JOB_ID"
23
-
24
- # Create logs directory if it doesn't exist
25
- mkdir -p logs
26
-
27
- # Activate conda environment
28
- eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
29
- conda activate conformal-s
30
-
31
- # Print environment info
32
- echo ""
33
- echo "=== Environment Info ==="
34
- which python
35
- python --version
36
- python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
37
- python -c "import faiss; print(f'FAISS: {faiss.__version__}')"
38
-
39
- # Change to repo directory
40
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
41
-
42
- # 1. Run CLI tests
43
- echo ""
44
- echo "=== Running CLI Tests ==="
45
- python -m pytest tests/test_cli.py -v --tb=short 2>&1 || echo "Note: Some tests may fail if dependencies are missing"
46
-
47
- # 2. Create a small test FASTA file
48
- echo ""
49
- echo "=== Creating Test FASTA ==="
50
- TEST_DIR="test_clean_output"
51
- mkdir -p "$TEST_DIR"
52
-
53
- cat > "$TEST_DIR/test_sequences.fasta" << 'EOF'
54
- >seq1_test_enzyme
55
- MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK
56
- >seq2_test_enzyme
57
- MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
58
- >seq3_test_enzyme
59
- MTSKGECFVTVTYKNLFPPEQWSPKQYLFHNASDKGFVPTHICTHGCLSPKQLQEFDLVNQADQEGWSGDYTCQCNCTQQALCGFPVFLGCEACTFTPDCHGECVCKFPFGEYFVCDCDGSPDCG
60
- EOF
61
-
62
- echo "Created test FASTA with 3 sequences"
63
-
64
- # 3. Test CLEAN embedding (requires GPU)
65
- echo ""
66
- echo "=== Testing CLEAN Embedding ==="
67
- echo "Checking CLEAN installation..."
68
- python -c "from CLEAN.model import LayerNormNet; print('CLEAN model import OK')" 2>&1 || {
69
- echo "CLEAN not installed, installing..."
70
- cd CLEAN_repo/app
71
- python build.py install
72
- cd ../..
73
- }
74
-
75
- echo ""
76
- echo "Running cpr embed with CLEAN model..."
77
- time python -m protein_conformal.cli embed \
78
- --input "$TEST_DIR/test_sequences.fasta" \
79
- --output "$TEST_DIR/test_clean_embeddings.npy" \
80
- --model clean
81
-
82
- # 4. Verify output
83
- echo ""
84
- echo "=== Verifying Output ==="
85
- if [ -f "$TEST_DIR/test_clean_embeddings.npy" ]; then
86
- python -c "
87
- import numpy as np
88
- emb = np.load('$TEST_DIR/test_clean_embeddings.npy')
89
- print(f'Embeddings shape: {emb.shape}')
90
- print(f'Expected: (3, 128)')
91
- assert emb.shape == (3, 128), f'Shape mismatch: expected (3, 128), got {emb.shape}'
92
- print('SUCCESS: CLEAN embedding test passed!')
93
- "
94
- else
95
- echo "ERROR: Output file not created"
96
- exit 1
97
- fi
98
-
99
- # 5. Optional: Compare with reference (if exists)
100
- echo ""
101
- echo "=== Test Complete ==="
102
- echo "Output saved to: $TEST_DIR/test_clean_embeddings.npy"
103
- echo ""
104
-
105
- # Cleanup (optional - uncomment to remove test files)
106
- # rm -rf "$TEST_DIR"
107
-
108
- echo "Done at $(date)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_verify.sh DELETED
@@ -1,97 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=cpr-verify
3
- #SBATCH --output=logs/cpr-verify-%j.out
4
- #SBATCH --error=logs/cpr-verify-%j.err
5
- #SBATCH --time=1:00:00
6
- #SBATCH --mem=32G
7
- #SBATCH --cpus-per-task=4
8
-
9
- # CPR Verification - Reproduces paper results
10
- # Usage:
11
- # sbatch scripts/slurm_verify.sh syn30 # Verify Syn3.0 (Figure 2A)
12
- # sbatch scripts/slurm_verify.sh fdr # Verify FDR algorithm
13
- # sbatch scripts/slurm_verify.sh dali # Verify DALI prefiltering (Tables 4-6)
14
- # sbatch scripts/slurm_verify.sh clean # Verify CLEAN enzyme (Tables 1-2)
15
- # sbatch scripts/slurm_verify.sh probs # Verify precomputed probability lookup
16
- # sbatch scripts/slurm_verify.sh all # Run all verifications
17
-
18
- set -e
19
- mkdir -p logs results
20
-
21
- source ~/.bashrc
22
- eval "$(conda shell.bash hook)"
23
- conda activate conformal-s
24
-
25
- CHECK="${1:-all}"
26
-
27
- echo "========================================"
28
- echo "CPR Verification"
29
- echo "Date: $(date)"
30
- echo "Node: $(hostname)"
31
- echo "Check: $CHECK"
32
- echo "========================================"
33
- echo ""
34
-
35
- run_syn30() {
36
- echo "--- Syn3.0 Verification (Paper Figure 2A) ---"
37
- python scripts/verify_syn30.py
38
- echo ""
39
- }
40
-
41
- run_fdr() {
42
- echo "--- FDR Algorithm Verification ---"
43
- python scripts/verify_fdr_algorithm.py
44
- echo ""
45
- }
46
-
47
- run_dali() {
48
- echo "--- DALI Prefiltering Verification (Tables 4-6) ---"
49
- python scripts/verify_dali.py
50
- echo ""
51
- }
52
-
53
- run_clean() {
54
- echo "--- CLEAN Enzyme Classification Verification (Tables 1-2) ---"
55
- python scripts/verify_clean.py
56
- echo ""
57
- }
58
-
59
- run_probs() {
60
- echo "--- Precomputed Probability Lookup Verification ---"
61
- python scripts/test_precomputed_probs.py
62
- echo ""
63
- }
64
-
65
- case "$CHECK" in
66
- syn30)
67
- run_syn30
68
- ;;
69
- fdr)
70
- run_fdr
71
- ;;
72
- dali)
73
- run_dali
74
- ;;
75
- clean)
76
- run_clean
77
- ;;
78
- probs)
79
- run_probs
80
- ;;
81
- all)
82
- run_syn30
83
- run_fdr
84
- run_dali
85
- run_clean
86
- run_probs
87
- ;;
88
- *)
89
- echo "Unknown check: $CHECK"
90
- echo "Available: syn30, fdr, dali, clean, probs, all"
91
- exit 1
92
- ;;
93
- esac
94
-
95
- echo "========================================"
96
- echo "Completed: $(date)"
97
- echo "========================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/slurm_verify_probs.sh DELETED
@@ -1,34 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --job-name=cpr-verify-probs
3
- #SBATCH --output=logs/cpr-verify-probs-%j.out
4
- #SBATCH --error=logs/cpr-verify-probs-%j.err
5
- #SBATCH --time=1:00:00
6
- #SBATCH --mem=16G
7
- #SBATCH --cpus-per-task=2
8
-
9
- # CPR Precomputed Probability Verification
10
- # Verifies that the sim->prob lookup table matches direct Venn-Abers computation
11
- # Usage: sbatch scripts/slurm_verify_probs.sh
12
-
13
- set -e
14
- mkdir -p logs data
15
-
16
- echo "========================================"
17
- echo "CPR Probability Lookup Verification"
18
- echo "Date: $(date)"
19
- echo "Node: $(hostname)"
20
- echo "========================================"
21
- echo ""
22
-
23
- # Activate conda environment
24
- source ~/.bashrc
25
- eval "$(conda shell.bash hook)"
26
- conda activate conformal-s
27
-
28
- # Run the verification test
29
- python scripts/test_precomputed_probs.py
30
-
31
- echo ""
32
- echo "========================================"
33
- echo "Completed: $(date)"
34
- echo "========================================"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scripts/submit_fdr_parallel.sh DELETED
@@ -1,47 +0,0 @@
1
- #!/bin/bash
2
- # Submit FDR threshold jobs in parallel - one per alpha level
3
-
4
- ALPHAS="0.001 0.005 0.01 0.02 0.05 0.1 0.15 0.2"
5
-
6
- for alpha in $ALPHAS; do
7
- # Exact match
8
- sbatch --job-name="fdr-exact-${alpha}" \
9
- --partition=standard \
10
- --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
11
- --time=08:00:00 \
12
- --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_exact_${alpha}_%j.log \
13
- --wrap="
14
- eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
15
- conda activate conformal-s
16
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
17
- python scripts/compute_fdr_table.py \
18
- --calibration data/pfam_new_proteins.npy \
19
- --output results/fdr_exact_alpha_${alpha}.csv \
20
- --n-trials 100 \
21
- --n-calib 1000 \
22
- --seed 42 \
23
- --alpha-levels ${alpha}
24
- "
25
-
26
- # Partial match
27
- sbatch --job-name="fdr-partial-${alpha}" \
28
- --partition=standard \
29
- --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
30
- --time=08:00:00 \
31
- --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_partial_${alpha}_%j.log \
32
- --wrap="
33
- eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
34
- conda activate conformal-s
35
- cd /groups/doudna/projects/ronb/conformal-protein-retrieval
36
- python scripts/compute_fdr_table.py \
37
- --calibration data/pfam_new_proteins.npy \
38
- --output results/fdr_partial_alpha_${alpha}.csv \
39
- --n-trials 100 \
40
- --n-calib 1000 \
41
- --seed 42 \
42
- --alpha-levels ${alpha} \
43
- --partial
44
- "
45
- done
46
-
47
- echo "Submitted 16 FDR jobs (8 alphas × 2 match types)"