Spaces:
Running
Running
chore: clean up redundant scripts and consolidate thresholds
Browse files- Archive 16 redundant/one-off scripts to scripts/archive/
- Archive 3 duplicate Python files from notebooks/pfam/
- Remove "simple" CSV files, add full threshold tables to GETTING_STARTED.md
- Keep 4 clean SLURM scripts: build, embed, fdr, fnr
- Update .gitignore for archive directories
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- .gitignore +3 -1
- GETTING_STARTED.md +40 -11
- notebooks/pfam/generate_fdr.py +0 -64
- notebooks/pfam/generate_fnr.py +0 -22
- notebooks/pfam/sva_results.py +0 -59
- results/fdr_thresholds.csv +2 -1
- scripts/embed_fasta.sh +0 -63
- scripts/investigate_fdr.py +0 -106
- scripts/precompute_fdr_thresholds.sh +0 -96
- scripts/precompute_fnr_thresholds.sh +0 -106
- scripts/quick_fdr_check.py +0 -58
- scripts/search_and_assign_probs.sh +0 -123
- scripts/slurm_calibrate_fdr.sh +0 -59
- scripts/slurm_calibrate_fnr.sh +0 -47
- scripts/slurm_compute_fdr_all.sh +0 -65
- scripts/slurm_compute_fnr_partial.sh +0 -38
- scripts/slurm_investigate.sh +0 -36
- scripts/slurm_test_clean_embed.sh +0 -108
- scripts/slurm_verify.sh +0 -97
- scripts/slurm_verify_probs.sh +0 -34
- scripts/submit_fdr_parallel.sh +0 -47
.gitignore
CHANGED
|
@@ -211,5 +211,7 @@ test_clean_output/
|
|
| 211 |
protein_vec_models.gz
|
| 212 |
CLEAN_repo/
|
| 213 |
|
| 214 |
-
# Archived legacy
|
| 215 |
notebooks_archive/
|
|
|
|
|
|
|
|
|
| 211 |
protein_vec_models.gz
|
| 212 |
CLEAN_repo/
|
| 213 |
|
| 214 |
+
# Archived legacy code (redundant/one-off scripts)
|
| 215 |
notebooks_archive/
|
| 216 |
+
scripts/archive/
|
| 217 |
+
notebooks/*/archive/
|
GETTING_STARTED.md
CHANGED
|
@@ -165,20 +165,49 @@ cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
|
|
| 165 |
|
| 166 |
## FDR/FNR Threshold Reference
|
| 167 |
|
| 168 |
-
These thresholds control the trade-off between hits and false positives
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|---------|-------------------|-------------------|----------|
|
| 172 |
-
| 0.001 | (see csv) | 0.9997904 | Ultra-stringent |
|
| 173 |
-
| 0.01 | (see csv) | 0.9998495 | Very stringent |
|
| 174 |
-
| 0.05 | (see csv) | 0.9998899 | Stringent |
|
| 175 |
-
| **0.1** | **0.9999802** | **0.9999076** | **Paper default** |
|
| 176 |
-
| 0.15 | (see csv) | 0.9999174 | Relaxed |
|
| 177 |
-
| 0.2 | (see csv) | 0.9999245 | Discovery-focused |
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
|
| 183 |
---
|
| 184 |
|
|
|
|
| 165 |
|
| 166 |
## FDR/FNR Threshold Reference
|
| 167 |
|
| 168 |
+
These thresholds control the trade-off between hits and false positives.
|
| 169 |
|
| 170 |
+
### FDR Thresholds (False Discovery Rate)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
Controls the expected fraction of hits that are false positives.
|
| 173 |
|
| 174 |
+
| α Level | Threshold (λ) | Std Dev | Use Case |
|
| 175 |
+
|---------|---------------|---------|----------|
|
| 176 |
+
| **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |
|
| 177 |
+
|
| 178 |
+
**Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.
|
| 179 |
+
|
| 180 |
+
### FNR Thresholds (False Negative Rate) - Exact Match
|
| 181 |
+
|
| 182 |
+
Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.
|
| 183 |
+
|
| 184 |
+
| α Level | Threshold (λ) | Std Dev | Use Case |
|
| 185 |
+
|---------|---------------|---------|----------|
|
| 186 |
+
| 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
|
| 187 |
+
| 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
|
| 188 |
+
| 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
|
| 189 |
+
| 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
|
| 190 |
+
| 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
|
| 191 |
+
| **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
|
| 192 |
+
| 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
|
| 193 |
+
| 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |
|
| 194 |
+
|
| 195 |
+
### FNR Thresholds - Partial Match
|
| 196 |
+
|
| 197 |
+
"Partial match" requires at least one Pfam domain to match (more permissive).
|
| 198 |
+
|
| 199 |
+
| α Level | Threshold (λ) | Std Dev | Use Case |
|
| 200 |
+
|---------|---------------|---------|----------|
|
| 201 |
+
| 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
|
| 202 |
+
| 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
|
| 203 |
+
| 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
|
| 204 |
+
| 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
|
| 205 |
+
| 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
|
| 206 |
+
| **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
|
| 207 |
+
| 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
|
| 208 |
+
| 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |
|
| 209 |
+
|
| 210 |
+
Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.
|
| 211 |
|
| 212 |
---
|
| 213 |
|
notebooks/pfam/generate_fdr.py
DELETED
|
@@ -1,64 +0,0 @@
|
|
| 1 |
-
import datetime
|
| 2 |
-
import numpy as np
|
| 3 |
-
import pandas as pd
|
| 4 |
-
import argparse
|
| 5 |
-
from tqdm import tqdm
|
| 6 |
-
from protein_conformal.util import *
|
| 7 |
-
|
| 8 |
-
def main():
|
| 9 |
-
parser = argparse.ArgumentParser(description='Process some integers.')
|
| 10 |
-
parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
|
| 11 |
-
parser.add_argument('--partial', type=bool, default=False, help='Partial hits')
|
| 12 |
-
parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
|
| 13 |
-
parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
|
| 14 |
-
parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
|
| 15 |
-
parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/pfam_fdr.npy', help='Output file for the results')
|
| 16 |
-
parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
|
| 17 |
-
args = parser.parse_args()
|
| 18 |
-
alpha = args.alpha
|
| 19 |
-
num_trials = args.num_trials
|
| 20 |
-
n_calib = args.n_calib
|
| 21 |
-
delta = args.delta
|
| 22 |
-
partial = args.partial
|
| 23 |
-
# Load the data
|
| 24 |
-
# data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
|
| 25 |
-
data = np.load('/data/ron/protein-conformal/data/pfam_new_proteins.npy', allow_pickle=True)
|
| 26 |
-
|
| 27 |
-
risks = []
|
| 28 |
-
tprs = []
|
| 29 |
-
lhats = []
|
| 30 |
-
fdr_cals = []
|
| 31 |
-
# alpha = 0.1
|
| 32 |
-
# num_trials = 100
|
| 33 |
-
# n_calib = 1000
|
| 34 |
-
for trial in tqdm(range(num_trials)):
|
| 35 |
-
np.random.shuffle(data)
|
| 36 |
-
cal_data = data[:n_calib]
|
| 37 |
-
test_data = data[n_calib:]
|
| 38 |
-
X_cal, y_cal = get_sims_labels(cal_data, partial=partial)
|
| 39 |
-
X_test, y_test_exact = get_sims_labels(test_data, partial=partial)
|
| 40 |
-
# sims, labels = get_sims_labels(cal_data, partial=False)
|
| 41 |
-
lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha, delta, N=100)
|
| 42 |
-
lhats.append(lhat)
|
| 43 |
-
fdr_cals.append(fdr_cal)
|
| 44 |
-
# print(X_test.shape)
|
| 45 |
-
# print(y_test_exact.shape)
|
| 46 |
-
risks.append(risk(X_test, y_test_exact, lhat))
|
| 47 |
-
tprs.append(calculate_true_positives(X_test, y_test_exact, lhat))
|
| 48 |
-
|
| 49 |
-
print("Risk: ", np.mean(risks))
|
| 50 |
-
print("TPR: ", np.mean(tprs))
|
| 51 |
-
print("Lhat: ", np.mean(lhats))
|
| 52 |
-
print("FDR Cal: ", np.mean(fdr_cals))
|
| 53 |
-
|
| 54 |
-
output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '' + '.npy')
|
| 55 |
-
|
| 56 |
-
np.save(output_file,
|
| 57 |
-
{'risks': risks,
|
| 58 |
-
'tprs': tprs,
|
| 59 |
-
'lhats': lhats,
|
| 60 |
-
'fdr_cals': fdr_cals})
|
| 61 |
-
|
| 62 |
-
if __name__ == "__main__":
|
| 63 |
-
# add code for command line arguments
|
| 64 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebooks/pfam/generate_fnr.py
DELETED
|
@@ -1,22 +0,0 @@
|
|
| 1 |
-
import numpy as np
|
| 2 |
-
import pandas as pd
|
| 3 |
-
import argparse
|
| 4 |
-
from tqdm import tqdm
|
| 5 |
-
from protein_conformal.util import *
|
| 6 |
-
|
| 7 |
-
def main():
|
| 8 |
-
parser = argparse.ArgumentParser()
|
| 9 |
-
parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
|
| 10 |
-
parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
|
| 11 |
-
parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
|
| 12 |
-
args = parser.parse_args()
|
| 13 |
-
alpha = args.alpha
|
| 14 |
-
num_trials = args.num_trials
|
| 15 |
-
n_calib = args.n_calib
|
| 16 |
-
|
| 17 |
-
# Load the data
|
| 18 |
-
data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
|
| 19 |
-
|
| 20 |
-
if __name__ == "__main__":
|
| 21 |
-
# add code for command line arguments
|
| 22 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
notebooks/pfam/sva_results.py
DELETED
|
@@ -1,59 +0,0 @@
|
|
| 1 |
-
|
| 2 |
-
import numpy as np
|
| 3 |
-
import pandas as pd
|
| 4 |
-
import argparse
|
| 5 |
-
from tqdm import tqdm
|
| 6 |
-
from protein_conformal.util import *
|
| 7 |
-
import datetime
|
| 8 |
-
|
| 9 |
-
def run_trial(data, n_calib, args):
|
| 10 |
-
np.random.shuffle(data)
|
| 11 |
-
cal_data = data[:n_calib]
|
| 12 |
-
test_data = data[n_calib:3*n_calib]
|
| 13 |
-
X_cal, y_cal = get_sims_labels(cal_data, partial=False)
|
| 14 |
-
X_test, y_test_exact = get_sims_labels(test_data, partial=False)
|
| 15 |
-
# flatten the data
|
| 16 |
-
X_cal = X_cal.flatten()
|
| 17 |
-
y_cal = y_cal.flatten()
|
| 18 |
-
X_test = X_test.flatten()
|
| 19 |
-
y_test_exact = y_test_exact.flatten()
|
| 20 |
-
|
| 21 |
-
# generate random indices in the test set
|
| 22 |
-
i = np.random.randint(0, len(X_test))
|
| 23 |
-
# i_s = np.random.randint(0, len(X_test), int(len(X_test) * args.percent_sva_test))
|
| 24 |
-
|
| 25 |
-
p_0, p_1 = simplifed_venn_abers_prediction(X_cal, y_cal, X_test[i])
|
| 26 |
-
result = (np.mean([p_0, p_1]), X_test[i], y_test_exact[i])
|
| 27 |
-
return result
|
| 28 |
-
|
| 29 |
-
def main(args):
|
| 30 |
-
data = np.load(args.input, allow_pickle=True)
|
| 31 |
-
n_calib = args.n_calib # Number of calibration data points
|
| 32 |
-
|
| 33 |
-
sva_results = []
|
| 34 |
-
for trial in tqdm(range(args.num_trials)):
|
| 35 |
-
# print(f'Running trial {i+1} of {args.num_trials}')
|
| 36 |
-
sva_results.append(run_trial(data, n_calib, args))
|
| 37 |
-
|
| 38 |
-
df_sva = pd.DataFrame(sva_results, columns=['p', 'x', 'y'])
|
| 39 |
-
output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '') + '.csv'
|
| 40 |
-
print(f'Saving results to {output_file}')
|
| 41 |
-
df_sva.to_csv(output_file, index=False)
|
| 42 |
-
# make bins for p
|
| 43 |
-
df_sva['p_bin'] = pd.cut(df_sva['p'], bins=10)
|
| 44 |
-
print(df_sva.groupby('p_bin')['y'].mean())
|
| 45 |
-
|
| 46 |
-
if __name__ == "__main__":
|
| 47 |
-
# add code for command line arguments
|
| 48 |
-
parser = argparse.ArgumentParser()
|
| 49 |
-
parser.add_argument('--input', type=str, default='/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', help='Input file for the data')
|
| 50 |
-
# '/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy'
|
| 51 |
-
# parser.add_argument('--percent_sva_test', type=float, default=.1, help='percent of data to use for SVA testing')
|
| 52 |
-
# parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
|
| 53 |
-
parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
|
| 54 |
-
parser.add_argument('--n_calib', type=int, default=50, help='Number of calibration data points')
|
| 55 |
-
# parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
|
| 56 |
-
parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/sva_results', help='Output file for the results')
|
| 57 |
-
parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
|
| 58 |
-
args = parser.parse_args()
|
| 59 |
-
main(args)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
results/fdr_thresholds.csv
CHANGED
|
@@ -1 +1,2 @@
|
|
| 1 |
-
|
|
|
|
|
|
| 1 |
+
alpha,threshold_mean,threshold_std,threshold_min,threshold_max,empirical_fdr_mean,empirical_fdr_std
|
| 2 |
+
0.1,0.99998005881454,1.7455746588230029e-06,0.9999783761573561,0.9999823510044753,0.08709266350751729,0.011992918257283684
|
scripts/embed_fasta.sh
DELETED
|
@@ -1,63 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# Set the CUDA device
|
| 4 |
-
CUDA_DEVICE=3
|
| 5 |
-
|
| 6 |
-
# Specify the directory with all fasta files (default is current directory), assuming
|
| 7 |
-
# all protein fasta files are named *_prot.fasta
|
| 8 |
-
INPUT_DIR="${1:-.}"
|
| 9 |
-
|
| 10 |
-
# Create the "emb" subfolder if it doesn't exist
|
| 11 |
-
EMB_DIR="$INPUT_DIR/emb"
|
| 12 |
-
mkdir -p "$EMB_DIR"
|
| 13 |
-
|
| 14 |
-
# Loop through all *_prot.fasta files, embed
|
| 15 |
-
for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
|
| 16 |
-
# Ensure the file exists
|
| 17 |
-
if [[ -f "$fasta_file" ]]; then
|
| 18 |
-
# Extract base filename without extension
|
| 19 |
-
base_name=$(basename "$fasta_file" "_prot.fasta")
|
| 20 |
-
|
| 21 |
-
# Set output file name inside emb/ folder
|
| 22 |
-
output_file="$EMB_DIR/${base_name}_emb.npy"
|
| 23 |
-
|
| 24 |
-
# Run embedding command
|
| 25 |
-
echo "Processing: $fasta_file -> $output_file"
|
| 26 |
-
cd /home/yangk/proteins/protein-vec/src_run
|
| 27 |
-
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
|
| 28 |
-
fi
|
| 29 |
-
done
|
| 30 |
-
|
| 31 |
-
echo "All files processed. Embeddings saved in $EMB_DIR."
|
| 32 |
-
|
| 33 |
-
#!/bin/bash
|
| 34 |
-
|
| 35 |
-
# Set the CUDA device
|
| 36 |
-
CUDA_DEVICE=3
|
| 37 |
-
|
| 38 |
-
# Specify the directory with all fasta files (default is current directory), assuming
|
| 39 |
-
# all protein fasta files are named *_prot.fasta
|
| 40 |
-
INPUT_DIR="${1:-.}"
|
| 41 |
-
|
| 42 |
-
# Create the "emb" subfolder if it doesn't exist
|
| 43 |
-
EMB_DIR="$INPUT_DIR/emb"
|
| 44 |
-
mkdir -p "$EMB_DIR"
|
| 45 |
-
|
| 46 |
-
# Loop through all *_prot.fasta files, embed
|
| 47 |
-
for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
|
| 48 |
-
# Ensure the file exists
|
| 49 |
-
if [[ -f "$fasta_file" ]]; then
|
| 50 |
-
# Extract base filename without extension
|
| 51 |
-
base_name=$(basename "$fasta_file" "_prot.fasta")
|
| 52 |
-
|
| 53 |
-
# Set output file name inside emb/ folder
|
| 54 |
-
output_file="$EMB_DIR/${base_name}_emb.npy"
|
| 55 |
-
|
| 56 |
-
# Run embedding command
|
| 57 |
-
echo "Processing: $fasta_file -> $output_file"
|
| 58 |
-
cd /home/yangk/proteins/protein-vec/src_run
|
| 59 |
-
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
|
| 60 |
-
fi
|
| 61 |
-
done
|
| 62 |
-
|
| 63 |
-
echo "All files processed. Embeddings saved in $EMB_DIR."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/investigate_fdr.py
DELETED
|
@@ -1,106 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python
|
| 2 |
-
"""
|
| 3 |
-
Investigate FDR calibration discrepancy between datasets.
|
| 4 |
-
Checks for data leakage and compares calibration results.
|
| 5 |
-
"""
|
| 6 |
-
|
| 7 |
-
import numpy as np
|
| 8 |
-
import sys
|
| 9 |
-
sys.path.insert(0, '.')
|
| 10 |
-
from protein_conformal.util import get_sims_labels, get_thresh_FDR
|
| 11 |
-
|
| 12 |
-
print("=" * 60)
|
| 13 |
-
print("FDR Calibration Dataset Investigation")
|
| 14 |
-
print("=" * 60)
|
| 15 |
-
print()
|
| 16 |
-
|
| 17 |
-
# Load both calibration datasets
|
| 18 |
-
print("Loading datasets...")
|
| 19 |
-
pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
|
| 20 |
-
backup_data = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
|
| 21 |
-
|
| 22 |
-
print(f'pfam_new_proteins.npy: {len(pfam_new)} samples')
|
| 23 |
-
print(f'backup dataset: {len(backup_data)} samples')
|
| 24 |
-
print()
|
| 25 |
-
|
| 26 |
-
# Check for overlap (potential leakage)
|
| 27 |
-
print("Checking for overlap between datasets...")
|
| 28 |
-
# Meta is an array, so convert to tuple for hashing
|
| 29 |
-
pfam_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in pfam_new)
|
| 30 |
-
backup_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in backup_data)
|
| 31 |
-
overlap = pfam_metas & backup_metas
|
| 32 |
-
print(f" Unique query sets in pfam_new: {len(pfam_metas)}")
|
| 33 |
-
print(f" Unique query sets in backup: {len(backup_metas)}")
|
| 34 |
-
print(f" Overlap: {len(overlap)} ({len(overlap)/len(pfam_metas)*100:.1f}% of pfam_new)")
|
| 35 |
-
print()
|
| 36 |
-
|
| 37 |
-
# Compare similarity distributions
|
| 38 |
-
print("Similarity score distributions:")
|
| 39 |
-
sims_new, labels_new = get_sims_labels(pfam_new[:500], partial=False)
|
| 40 |
-
sims_backup, labels_backup = get_sims_labels(backup_data[:500], partial=False)
|
| 41 |
-
|
| 42 |
-
print(f" pfam_new (500 samples):")
|
| 43 |
-
print(f" Similarity: min={sims_new.min():.6f}, max={sims_new.max():.6f}, mean={sims_new.mean():.6f}")
|
| 44 |
-
print(f" Labels: {labels_new.sum()}/{labels_new.size} positive ({labels_new.mean()*100:.1f}%)")
|
| 45 |
-
print()
|
| 46 |
-
print(f" backup (500 samples):")
|
| 47 |
-
print(f" Similarity: min={sims_backup.min():.6f}, max={sims_backup.max():.6f}, mean={sims_backup.mean():.6f}")
|
| 48 |
-
print(f" Labels: {labels_backup.sum()}/{labels_backup.size} positive ({labels_backup.mean()*100:.1f}%)")
|
| 49 |
-
print()
|
| 50 |
-
|
| 51 |
-
# Run FDR calibration on both with same parameters
|
| 52 |
-
print("Running FDR calibration (alpha=0.1, n_calib=1000, 10 trials)...")
|
| 53 |
-
print()
|
| 54 |
-
|
| 55 |
-
def run_fdr_trials(data, name, n_trials=10, n_calib=1000):
|
| 56 |
-
lhats = []
|
| 57 |
-
risks = []
|
| 58 |
-
tprs = []
|
| 59 |
-
|
| 60 |
-
for trial in range(n_trials):
|
| 61 |
-
np.random.seed(42 + trial)
|
| 62 |
-
np.random.shuffle(data)
|
| 63 |
-
cal_data = data[:n_calib]
|
| 64 |
-
test_data = data[n_calib:n_calib+500]
|
| 65 |
-
|
| 66 |
-
X_cal, y_cal = get_sims_labels(cal_data, partial=False)
|
| 67 |
-
X_test, y_test = get_sims_labels(test_data, partial=False)
|
| 68 |
-
|
| 69 |
-
lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha=0.1, delta=0.5, N=100)
|
| 70 |
-
lhats.append(lhat)
|
| 71 |
-
|
| 72 |
-
# Calculate test risk and TPR
|
| 73 |
-
preds = (X_test >= lhat).astype(int)
|
| 74 |
-
tp = np.sum((preds == 1) & (y_test == 1))
|
| 75 |
-
fp = np.sum((preds == 1) & (y_test == 0))
|
| 76 |
-
fn = np.sum((preds == 0) & (y_test == 1))
|
| 77 |
-
|
| 78 |
-
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
|
| 79 |
-
risk = fp / (fp + tp) if (fp + tp) > 0 else 0
|
| 80 |
-
|
| 81 |
-
tprs.append(tpr)
|
| 82 |
-
risks.append(risk)
|
| 83 |
-
|
| 84 |
-
print(f"{name}:")
|
| 85 |
-
print(f" λ (threshold): {np.mean(lhats):.10f} ± {np.std(lhats):.10f}")
|
| 86 |
-
print(f" Risk (FDR): {np.mean(risks):.4f} ± {np.std(risks):.4f}")
|
| 87 |
-
print(f" TPR: {np.mean(tprs)*100:.1f}% ± {np.std(tprs)*100:.1f}%")
|
| 88 |
-
print()
|
| 89 |
-
return lhats, risks, tprs
|
| 90 |
-
|
| 91 |
-
lhats_new, risks_new, tprs_new = run_fdr_trials(pfam_new.copy(), "pfam_new_proteins")
|
| 92 |
-
lhats_backup, risks_backup, tprs_backup = run_fdr_trials(backup_data.copy(), "backup_dataset")
|
| 93 |
-
|
| 94 |
-
print("=" * 60)
|
| 95 |
-
print("CONCLUSION")
|
| 96 |
-
print("=" * 60)
|
| 97 |
-
if abs(np.mean(lhats_new) - np.mean(lhats_backup)) < 0.00001:
|
| 98 |
-
print("✓ Thresholds are similar - datasets likely compatible")
|
| 99 |
-
else:
|
| 100 |
-
print("⚠ Thresholds differ significantly!")
|
| 101 |
-
print(f" Difference: {abs(np.mean(lhats_new) - np.mean(lhats_backup)):.10f}")
|
| 102 |
-
|
| 103 |
-
if len(overlap) > len(pfam_metas) * 0.5:
|
| 104 |
-
print("⚠ High overlap between datasets - potential data source")
|
| 105 |
-
else:
|
| 106 |
-
print("✓ Low overlap - datasets appear independent")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/precompute_fdr_thresholds.sh
DELETED
|
@@ -1,96 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# Get the directory where this script is located
|
| 4 |
-
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
|
| 5 |
-
PROJECT_ROOT="$( cd "$SCRIPT_DIR/.." &> /dev/null && pwd )"
|
| 6 |
-
|
| 7 |
-
# Parameters
|
| 8 |
-
MIN_ALPHA=0.01
|
| 9 |
-
MAX_ALPHA=1.0
|
| 10 |
-
NUM_ALPHA_VALUES=100
|
| 11 |
-
NUM_TRIALS=100
|
| 12 |
-
N_CALIB=1000
|
| 13 |
-
DELTA=0.5
|
| 14 |
-
OUTPUT_DIR="$PROJECT_ROOT/results"
|
| 15 |
-
TEMP_DIR="$SCRIPT_DIR/temp_fdr_results"
|
| 16 |
-
CSV_OUTPUT="$OUTPUT_DIR/fdr_thresholds.csv"
|
| 17 |
-
|
| 18 |
-
echo "Script directory: $SCRIPT_DIR"
|
| 19 |
-
echo "Project root: $PROJECT_ROOT"
|
| 20 |
-
echo "Output directory: $OUTPUT_DIR"
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
mkdir -p "$OUTPUT_DIR"
|
| 24 |
-
mkdir -p "$TEMP_DIR"
|
| 25 |
-
|
| 26 |
-
# Initialize CSV file with header
|
| 27 |
-
echo "alpha,lambda_threshold,exact_fdr,partial_fdr" > "$CSV_OUTPUT"
|
| 28 |
-
|
| 29 |
-
# Generate alpha values using Python
|
| 30 |
-
ALPHA_VALUES=$(python -c "
|
| 31 |
-
import numpy as np
|
| 32 |
-
alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
|
| 33 |
-
print(' '.join([str(a) for a in alphas]))
|
| 34 |
-
")
|
| 35 |
-
|
| 36 |
-
# Counter for progress tracking
|
| 37 |
-
counter=0
|
| 38 |
-
total=$NUM_ALPHA_VALUES
|
| 39 |
-
|
| 40 |
-
# Loop over alpha values
|
| 41 |
-
for alpha in $ALPHA_VALUES; do
|
| 42 |
-
counter=$((counter + 1))
|
| 43 |
-
|
| 44 |
-
# Run FDR generation for exact matches
|
| 45 |
-
python "$PROJECT_ROOT/pfam/generate_fdr.py" \
|
| 46 |
-
--alpha "$alpha" \
|
| 47 |
-
--partial false \
|
| 48 |
-
--num_trials "$NUM_TRIALS" \
|
| 49 |
-
--n_calib "$N_CALIB" \
|
| 50 |
-
--delta "$DELTA" \
|
| 51 |
-
--output "$TEMP_DIR/fdr_exact_$alpha" \
|
| 52 |
-
--add_date false
|
| 53 |
-
|
| 54 |
-
# Run FDR generation for partial matches
|
| 55 |
-
echo " Running partial matches..."
|
| 56 |
-
python "$PROJECT_ROOT/pfam/generate_fdr.py" \
|
| 57 |
-
--alpha "$alpha" \
|
| 58 |
-
--partial true \
|
| 59 |
-
--num_trials "$NUM_TRIALS" \
|
| 60 |
-
--n_calib "$N_CALIB" \
|
| 61 |
-
--delta "$DELTA" \
|
| 62 |
-
--output "$TEMP_DIR/fdr_partial_$alpha" \
|
| 63 |
-
--add_date false
|
| 64 |
-
|
| 65 |
-
# Extract results and append to CSV using Python
|
| 66 |
-
python -c "
|
| 67 |
-
import numpy as np
|
| 68 |
-
import sys
|
| 69 |
-
|
| 70 |
-
try:
|
| 71 |
-
# Load exact match results
|
| 72 |
-
exact_data = np.load('$TEMP_DIR/fdr_exact_$alpha.npy', allow_pickle=True).item()
|
| 73 |
-
exact_lhat = np.mean(exact_data['lhats'])
|
| 74 |
-
exact_fdr = np.mean(exact_data['risks'])
|
| 75 |
-
|
| 76 |
-
# Load partial match results
|
| 77 |
-
partial_data = np.load('$TEMP_DIR/fdr_partial_$alpha.npy', allow_pickle=True).item()
|
| 78 |
-
partial_fdr = np.mean(partial_data['risks'])
|
| 79 |
-
|
| 80 |
-
# Write to CSV
|
| 81 |
-
with open('$CSV_OUTPUT', 'a') as f:
|
| 82 |
-
f.write(f'$alpha,{exact_lhat},{exact_fdr},{partial_fdr}\n')
|
| 83 |
-
|
| 84 |
-
print(f' Results: lambda={exact_lhat:.6f}, exact_fdr={exact_fdr:.6f}, partial_fdr={partial_fdr:.6f}')
|
| 85 |
-
|
| 86 |
-
except Exception as e:
|
| 87 |
-
print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
|
| 88 |
-
sys.exit(1)
|
| 89 |
-
"
|
| 90 |
-
done
|
| 91 |
-
|
| 92 |
-
# Clean up temporary files
|
| 93 |
-
rm -rf "$TEMP_DIR"
|
| 94 |
-
echo "Results saved to: $CSV_OUTPUT"
|
| 95 |
-
echo "Total alpha values processed: $total"
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/precompute_fnr_thresholds.sh
DELETED
|
@@ -1,106 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# Script to precompute FNR thresholds for different alpha values
|
| 4 |
-
# Usage: ./precompute_fnr_thresholds.sh [OPTIONS]
|
| 5 |
-
|
| 6 |
-
# Default parameters - can be modified as needed
|
| 7 |
-
MIN_ALPHA=0.01
|
| 8 |
-
MAX_ALPHA=1.0
|
| 9 |
-
NUM_ALPHA_VALUES=100
|
| 10 |
-
NUM_TRIALS=100
|
| 11 |
-
N_CALIB=1000
|
| 12 |
-
OUTPUT_DIR="../results"
|
| 13 |
-
TEMP_DIR="./temp_fnr_results"
|
| 14 |
-
CSV_OUTPUT="$OUTPUT_DIR/fnr_thresholds.csv"
|
| 15 |
-
|
| 16 |
-
mkdir -p "$OUTPUT_DIR"
|
| 17 |
-
mkdir -p "$TEMP_DIR"
|
| 18 |
-
|
| 19 |
-
# Initialize CSV file with header
|
| 20 |
-
echo "alpha,lambda_threshold,exact_fnr,partial_fnr" > "$CSV_OUTPUT"
|
| 21 |
-
|
| 22 |
-
echo "Precomputing FNR thresholds..."
|
| 23 |
-
echo "Alpha range: $MIN_ALPHA to $MAX_ALPHA"
|
| 24 |
-
echo "Number of alpha values: $NUM_ALPHA_VALUES"
|
| 25 |
-
echo "Trials per alpha: $NUM_TRIALS"
|
| 26 |
-
echo "Calibration set size: $N_CALIB"
|
| 27 |
-
echo "Output file: $CSV_OUTPUT"
|
| 28 |
-
echo ""
|
| 29 |
-
|
| 30 |
-
# Generate alpha values using Python
|
| 31 |
-
ALPHA_VALUES=$(python -c "
|
| 32 |
-
import numpy as np
|
| 33 |
-
alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
|
| 34 |
-
print(' '.join([str(a) for a in alphas]))
|
| 35 |
-
")
|
| 36 |
-
|
| 37 |
-
# Counter for progress tracking
|
| 38 |
-
counter=0
|
| 39 |
-
total=$NUM_ALPHA_VALUES
|
| 40 |
-
|
| 41 |
-
# Loop over alpha values
|
| 42 |
-
for alpha in $ALPHA_VALUES; do
|
| 43 |
-
counter=$((counter + 1))
|
| 44 |
-
echo "Processing alpha=$alpha ($counter/$total)..."
|
| 45 |
-
|
| 46 |
-
# Run FNR generation for exact matches
|
| 47 |
-
echo " Running exact matches..."
|
| 48 |
-
python ../pfam/generate_fnr.py \
|
| 49 |
-
--alpha "$alpha" \
|
| 50 |
-
--partial false \
|
| 51 |
-
--num_trials "$NUM_TRIALS" \
|
| 52 |
-
--n_calib "$N_CALIB" \
|
| 53 |
-
--output "$TEMP_DIR/fnr_exact_$alpha" \
|
| 54 |
-
--add_date false
|
| 55 |
-
|
| 56 |
-
# Run FNR generation for partial matches
|
| 57 |
-
echo " Running partial matches..."
|
| 58 |
-
python ../pfam/generate_fnr.py \
|
| 59 |
-
--alpha "$alpha" \
|
| 60 |
-
--partial true \
|
| 61 |
-
--num_trials "$NUM_TRIALS" \
|
| 62 |
-
--n_calib "$N_CALIB" \
|
| 63 |
-
--output "$TEMP_DIR/fnr_partial_$alpha" \
|
| 64 |
-
--add_date false
|
| 65 |
-
|
| 66 |
-
# Extract results and append to CSV using Python
|
| 67 |
-
python -c "
|
| 68 |
-
import numpy as np
|
| 69 |
-
import sys
|
| 70 |
-
|
| 71 |
-
try:
|
| 72 |
-
# Load exact match results
|
| 73 |
-
exact_data = np.load('$TEMP_DIR/fnr_exact_$alpha.npy', allow_pickle=True).item()
|
| 74 |
-
exact_lhat = np.mean(exact_data['lhats'])
|
| 75 |
-
exact_fnr = np.mean(exact_data['fnrs'])
|
| 76 |
-
|
| 77 |
-
# Load partial match results
|
| 78 |
-
partial_data = np.load('$TEMP_DIR/fnr_partial_$alpha.npy', allow_pickle=True).item()
|
| 79 |
-
partial_fnr = np.mean(partial_data['fnrs'])
|
| 80 |
-
|
| 81 |
-
# Write to CSV
|
| 82 |
-
with open('$CSV_OUTPUT', 'a') as f:
|
| 83 |
-
f.write(f'$alpha,{exact_lhat},{exact_fnr},{partial_fnr}\n')
|
| 84 |
-
|
| 85 |
-
print(f' Results: lambda={exact_lhat:.6f}, exact_fnr={exact_fnr:.6f}, partial_fnr={partial_fnr:.6f}')
|
| 86 |
-
|
| 87 |
-
except Exception as e:
|
| 88 |
-
print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
|
| 89 |
-
sys.exit(1)
|
| 90 |
-
"
|
| 91 |
-
|
| 92 |
-
if [ $? -ne 0 ]; then
|
| 93 |
-
echo "Error processing alpha=$alpha" >&2
|
| 94 |
-
exit 1
|
| 95 |
-
fi
|
| 96 |
-
done
|
| 97 |
-
|
| 98 |
-
# Clean up temporary files
|
| 99 |
-
echo ""
|
| 100 |
-
echo "Cleaning up temporary files..."
|
| 101 |
-
rm -rf "$TEMP_DIR"
|
| 102 |
-
|
| 103 |
-
echo "FNR threshold precomputation completed!"
|
| 104 |
-
echo "Results saved to: $CSV_OUTPUT"
|
| 105 |
-
echo "Total alpha values processed: $total"
|
| 106 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/quick_fdr_check.py
DELETED
|
@@ -1,58 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python
|
| 2 |
-
"""Quick FDR calibration check - compare the two datasets."""
|
| 3 |
-
import numpy as np
|
| 4 |
-
import sys
|
| 5 |
-
sys.path.insert(0, '.')
|
| 6 |
-
from protein_conformal.util import get_sims_labels, get_thresh_FDR
|
| 7 |
-
|
| 8 |
-
print("Quick FDR Calibration Check")
|
| 9 |
-
print("=" * 50)
|
| 10 |
-
|
| 11 |
-
# Load both datasets
|
| 12 |
-
pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
|
| 13 |
-
backup = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
|
| 14 |
-
|
| 15 |
-
print(f"pfam_new: {len(pfam_new)} samples")
|
| 16 |
-
print(f"backup: {len(backup)} samples")
|
| 17 |
-
print()
|
| 18 |
-
|
| 19 |
-
# Compare similarity and label distributions
|
| 20 |
-
sims1, labels1 = get_sims_labels(pfam_new[:500], partial=False)
|
| 21 |
-
sims2, labels2 = get_sims_labels(backup[:500], partial=False)
|
| 22 |
-
|
| 23 |
-
print("Stats from first 500 samples:")
|
| 24 |
-
print(f" pfam_new - positives: {labels1.sum()}/{labels1.size} ({100*labels1.mean():.2f}%)")
|
| 25 |
-
print(f" backup - positives: {labels2.sum()}/{labels2.size} ({100*labels2.mean():.2f}%)")
|
| 26 |
-
print()
|
| 27 |
-
|
| 28 |
-
# Run a single FDR calibration on each
|
| 29 |
-
print("Single FDR trial (n_calib=1000, alpha=0.1):")
|
| 30 |
-
np.random.seed(42)
|
| 31 |
-
|
| 32 |
-
# pfam_new
|
| 33 |
-
np.random.shuffle(pfam_new)
|
| 34 |
-
X1, y1 = get_sims_labels(pfam_new[:1000], partial=False)
|
| 35 |
-
lhat1, _ = get_thresh_FDR(y1, X1, alpha=0.1, delta=0.5, N=100)
|
| 36 |
-
|
| 37 |
-
# backup
|
| 38 |
-
np.random.shuffle(backup)
|
| 39 |
-
X2, y2 = get_sims_labels(backup[:1000], partial=False)
|
| 40 |
-
lhat2, _ = get_thresh_FDR(y2, X2, alpha=0.1, delta=0.5, N=100)
|
| 41 |
-
|
| 42 |
-
print(f" pfam_new λ: {lhat1:.10f}")
|
| 43 |
-
print(f" backup λ: {lhat2:.10f}")
|
| 44 |
-
print(f" Paper λ: 0.9999802250 (from pfam_fdr_2024-06-25.npy)")
|
| 45 |
-
print()
|
| 46 |
-
|
| 47 |
-
# Which is closer to paper?
|
| 48 |
-
diff1 = abs(lhat1 - 0.9999802250)
|
| 49 |
-
diff2 = abs(lhat2 - 0.9999802250)
|
| 50 |
-
print(f"Difference from paper threshold:")
|
| 51 |
-
print(f" pfam_new: {diff1:.10f}")
|
| 52 |
-
print(f" backup: {diff2:.10f}")
|
| 53 |
-
print()
|
| 54 |
-
|
| 55 |
-
if diff1 < diff2:
|
| 56 |
-
print("→ pfam_new is closer to paper threshold")
|
| 57 |
-
else:
|
| 58 |
-
print("→ backup is closer to paper threshold")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/search_and_assign_probs.sh
DELETED
|
@@ -1,123 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# Define the lookup directory
|
| 4 |
-
LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
|
| 5 |
-
|
| 6 |
-
# Navigate to the conformal-protein-retrieval directory
|
| 7 |
-
cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
|
| 8 |
-
|
| 9 |
-
# Specify the directory containing FASTA and embedding files
|
| 10 |
-
INPUT_DIR="${1:-.}"
|
| 11 |
-
|
| 12 |
-
# Create an output directory for the results
|
| 13 |
-
OUTPUT_DIR="$INPUT_DIR/output"
|
| 14 |
-
mkdir -p "$OUTPUT_DIR"
|
| 15 |
-
|
| 16 |
-
# Precomputed probabilities file for get_probs.py
|
| 17 |
-
PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
|
| 18 |
-
|
| 19 |
-
# Loop through all *_prot.fasta files
|
| 20 |
-
for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
|
| 21 |
-
if [[ -f "$fasta_file" ]]; then
|
| 22 |
-
# Extract base name
|
| 23 |
-
base_name=$(basename "$fasta_file" "_prot.fasta")
|
| 24 |
-
|
| 25 |
-
# Corresponding embedding file
|
| 26 |
-
emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
|
| 27 |
-
|
| 28 |
-
# Ensure embedding file exists
|
| 29 |
-
if [[ -f "$emb_file" ]]; then
|
| 30 |
-
# Output CSV file for search.py results
|
| 31 |
-
search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
|
| 32 |
-
|
| 33 |
-
# Run the search script, with pfam partial FDR control as filtering threshold
|
| 34 |
-
echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
|
| 35 |
-
python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
|
| 36 |
-
--output "$search_output_csv" \
|
| 37 |
-
--query_embedding "$emb_file" \
|
| 38 |
-
--query_fasta "$fasta_file" \
|
| 39 |
-
--lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
|
| 40 |
-
--lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
|
| 41 |
-
|
| 42 |
-
# Ensure search output file exists before proceeding
|
| 43 |
-
if [[ -f "$search_output_csv" ]]; then
|
| 44 |
-
# Output CSV file for get_probs.py results
|
| 45 |
-
probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
|
| 46 |
-
|
| 47 |
-
# Run get_probs.py
|
| 48 |
-
echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
|
| 49 |
-
python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
|
| 50 |
-
--input "$search_output_csv" --output "$probs_output_csv" --partial
|
| 51 |
-
else
|
| 52 |
-
echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
|
| 53 |
-
fi
|
| 54 |
-
else
|
| 55 |
-
echo "Warning: No matching embedding found for $fasta_file. Skipping..."
|
| 56 |
-
fi
|
| 57 |
-
fi
|
| 58 |
-
done
|
| 59 |
-
|
| 60 |
-
echo "Pipeline completed. Results saved in $OUTPUT_DIR."
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
#!/bin/bash
|
| 64 |
-
|
| 65 |
-
# Define the lookup directory
|
| 66 |
-
LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
|
| 67 |
-
|
| 68 |
-
# Navigate to the conformal-protein-retrieval directory
|
| 69 |
-
cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
|
| 70 |
-
|
| 71 |
-
# Specify the directory containing FASTA and embedding files
|
| 72 |
-
INPUT_DIR="${1:-.}"
|
| 73 |
-
|
| 74 |
-
# Create an output directory for the results
|
| 75 |
-
OUTPUT_DIR="$INPUT_DIR/output"
|
| 76 |
-
mkdir -p "$OUTPUT_DIR"
|
| 77 |
-
|
| 78 |
-
# Precomputed probabilities file for get_probs.py
|
| 79 |
-
PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
|
| 80 |
-
|
| 81 |
-
# Loop through all *_prot.fasta files
|
| 82 |
-
for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
|
| 83 |
-
if [[ -f "$fasta_file" ]]; then
|
| 84 |
-
# Extract base name
|
| 85 |
-
base_name=$(basename "$fasta_file" "_prot.fasta")
|
| 86 |
-
|
| 87 |
-
# Corresponding embedding file
|
| 88 |
-
emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
|
| 89 |
-
|
| 90 |
-
# Ensure embedding file exists
|
| 91 |
-
if [[ -f "$emb_file" ]]; then
|
| 92 |
-
# Output CSV file for search.py results
|
| 93 |
-
search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
|
| 94 |
-
|
| 95 |
-
# Run the search script, with pfam partial FDR control as filtering threshold
|
| 96 |
-
echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
|
| 97 |
-
python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
|
| 98 |
-
--output "$search_output_csv" \
|
| 99 |
-
--query_embedding "$emb_file" \
|
| 100 |
-
--query_fasta "$fasta_file" \
|
| 101 |
-
--lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
|
| 102 |
-
--lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
|
| 103 |
-
|
| 104 |
-
# Ensure search output file exists before proceeding
|
| 105 |
-
if [[ -f "$search_output_csv" ]]; then
|
| 106 |
-
# Output CSV file for get_probs.py results
|
| 107 |
-
probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
|
| 108 |
-
|
| 109 |
-
# Run get_probs.py
|
| 110 |
-
echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
|
| 111 |
-
python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
|
| 112 |
-
--input "$search_output_csv" --output "$probs_output_csv" --partial
|
| 113 |
-
else
|
| 114 |
-
echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
|
| 115 |
-
fi
|
| 116 |
-
else
|
| 117 |
-
echo "Warning: No matching embedding found for $fasta_file. Skipping..."
|
| 118 |
-
fi
|
| 119 |
-
fi
|
| 120 |
-
done
|
| 121 |
-
|
| 122 |
-
echo "Pipeline completed. Results saved in $OUTPUT_DIR."
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_calibrate_fdr.sh
DELETED
|
@@ -1,59 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=cpr-calibrate-fdr
|
| 3 |
-
#SBATCH --output=logs/cpr-calibrate-fdr-%j.out
|
| 4 |
-
#SBATCH --error=logs/cpr-calibrate-fdr-%j.err
|
| 5 |
-
#SBATCH --time=2:00:00
|
| 6 |
-
#SBATCH --mem=32G
|
| 7 |
-
#SBATCH --cpus-per-task=4
|
| 8 |
-
|
| 9 |
-
# CPR FDR Calibration - Reproduces paper threshold computation
|
| 10 |
-
# Usage: sbatch scripts/slurm_calibrate_fdr.sh
|
| 11 |
-
|
| 12 |
-
set -e
|
| 13 |
-
mkdir -p logs results
|
| 14 |
-
|
| 15 |
-
source ~/.bashrc
|
| 16 |
-
eval "$(conda shell.bash hook)"
|
| 17 |
-
conda activate conformal-s
|
| 18 |
-
|
| 19 |
-
echo "========================================"
|
| 20 |
-
echo "CPR FDR Calibration"
|
| 21 |
-
echo "Date: $(date)"
|
| 22 |
-
echo "Node: $(hostname)"
|
| 23 |
-
echo "========================================"
|
| 24 |
-
|
| 25 |
-
# IMPORTANT: Use pfam_new_proteins.npy - the CORRECT calibration dataset
|
| 26 |
-
# The backup dataset (conformal_pfam_with_lookup_dataset.npy) has DATA LEAKAGE:
|
| 27 |
-
# - First 50 samples all have same Pfam family "PF01266;" repeated
|
| 28 |
-
# - Positive rate is 3.00% vs 0.22% in correct dataset
|
| 29 |
-
# - Results in different FDR threshold (~0.999965 vs paper's ~0.999980)
|
| 30 |
-
# See: scripts/quick_fdr_check.py for verification
|
| 31 |
-
CALIB_DATA="data/pfam_new_proteins.npy"
|
| 32 |
-
|
| 33 |
-
# Check if data exists
|
| 34 |
-
if [ ! -f "$CALIB_DATA" ]; then
|
| 35 |
-
echo "ERROR: Calibration data not found at $CALIB_DATA"
|
| 36 |
-
echo "Download from Zenodo: https://zenodo.org/records/14272215"
|
| 37 |
-
exit 1
|
| 38 |
-
fi
|
| 39 |
-
|
| 40 |
-
echo "Using calibration data: $CALIB_DATA"
|
| 41 |
-
echo "NOTE: Using pfam_new_proteins.npy (correct dataset without leakage)"
|
| 42 |
-
echo ""
|
| 43 |
-
|
| 44 |
-
# Run calibration using the ORIGINAL generate_fdr.py script (LTT method)
|
| 45 |
-
# This matches what was used to generate the paper's threshold
|
| 46 |
-
python scripts/pfam/generate_fdr.py \
|
| 47 |
-
--data_path "$CALIB_DATA" \
|
| 48 |
-
--alpha 0.1 \
|
| 49 |
-
--num_trials 100 \
|
| 50 |
-
--n_calib 1000 \
|
| 51 |
-
--delta 0.5 \
|
| 52 |
-
--output results/pfam_fdr \
|
| 53 |
-
--add_date True
|
| 54 |
-
|
| 55 |
-
echo ""
|
| 56 |
-
echo "========================================"
|
| 57 |
-
echo "Expected result: mean lhat ≈ 0.999980"
|
| 58 |
-
echo "Completed: $(date)"
|
| 59 |
-
echo "========================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_calibrate_fnr.sh
DELETED
|
@@ -1,47 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=cpr-calibrate-fnr
|
| 3 |
-
#SBATCH --output=logs/cpr-calibrate-fnr-%j.out
|
| 4 |
-
#SBATCH --error=logs/cpr-calibrate-fnr-%j.err
|
| 5 |
-
#SBATCH --time=2:00:00
|
| 6 |
-
#SBATCH --mem=32G
|
| 7 |
-
#SBATCH --cpus-per-task=4
|
| 8 |
-
|
| 9 |
-
# CPR FNR Calibration - Computes FNR thresholds
|
| 10 |
-
# Usage: sbatch scripts/slurm_calibrate_fnr.sh
|
| 11 |
-
|
| 12 |
-
set -e
|
| 13 |
-
mkdir -p logs results
|
| 14 |
-
|
| 15 |
-
source ~/.bashrc
|
| 16 |
-
eval "$(conda shell.bash hook)"
|
| 17 |
-
conda activate conformal-s
|
| 18 |
-
|
| 19 |
-
echo "========================================"
|
| 20 |
-
echo "CPR FNR Calibration"
|
| 21 |
-
echo "Date: $(date)"
|
| 22 |
-
echo "Node: $(hostname)"
|
| 23 |
-
echo "========================================"
|
| 24 |
-
|
| 25 |
-
# Use the ORIGINAL calibration dataset from backup
|
| 26 |
-
CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy"
|
| 27 |
-
|
| 28 |
-
if [ ! -f "$CALIB_DATA" ]; then
|
| 29 |
-
echo "ERROR: Calibration data not found at $CALIB_DATA"
|
| 30 |
-
exit 1
|
| 31 |
-
fi
|
| 32 |
-
|
| 33 |
-
echo "Using calibration data: $CALIB_DATA"
|
| 34 |
-
echo ""
|
| 35 |
-
|
| 36 |
-
python scripts/pfam/generate_fnr.py \
|
| 37 |
-
--data_path "$CALIB_DATA" \
|
| 38 |
-
--alpha 0.1 \
|
| 39 |
-
--num_trials 100 \
|
| 40 |
-
--n_calib 1000 \
|
| 41 |
-
--output results/pfam_fnr \
|
| 42 |
-
--add_date True
|
| 43 |
-
|
| 44 |
-
echo ""
|
| 45 |
-
echo "========================================"
|
| 46 |
-
echo "Completed: $(date)"
|
| 47 |
-
echo "========================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_compute_fdr_all.sh
DELETED
|
@@ -1,65 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=fdr-all
|
| 3 |
-
#SBATCH --partition=standard
|
| 4 |
-
#SBATCH --nodes=1
|
| 5 |
-
#SBATCH --ntasks=1
|
| 6 |
-
#SBATCH --cpus-per-task=4
|
| 7 |
-
#SBATCH --mem=32G
|
| 8 |
-
#SBATCH --time=24:00:00
|
| 9 |
-
#SBATCH --output=logs/fdr_all_%j.out
|
| 10 |
-
#SBATCH --error=logs/fdr_all_%j.err
|
| 11 |
-
|
| 12 |
-
# Compute FDR thresholds at all alpha levels (both exact and partial matches)
|
| 13 |
-
# This uses the FIXED compute_fdr_table.py with correct argument order
|
| 14 |
-
|
| 15 |
-
set -e
|
| 16 |
-
|
| 17 |
-
echo "=== FDR Threshold Computation (All Alpha Levels) ==="
|
| 18 |
-
echo "Job ID: $SLURM_JOB_ID"
|
| 19 |
-
echo "Started: $(date)"
|
| 20 |
-
echo ""
|
| 21 |
-
|
| 22 |
-
# Setup environment
|
| 23 |
-
eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
|
| 24 |
-
conda activate conformal-s
|
| 25 |
-
|
| 26 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 27 |
-
|
| 28 |
-
# Calibration data
|
| 29 |
-
CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/pfam_new_proteins.npy"
|
| 30 |
-
|
| 31 |
-
# Alpha levels to compute
|
| 32 |
-
ALPHA_LEVELS="0.001,0.005,0.01,0.02,0.05,0.1,0.15,0.2"
|
| 33 |
-
|
| 34 |
-
echo "Calibration data: $CALIB_DATA"
|
| 35 |
-
echo "Alpha levels: $ALPHA_LEVELS"
|
| 36 |
-
echo ""
|
| 37 |
-
|
| 38 |
-
# Exact match FDR thresholds
|
| 39 |
-
echo "=== Computing EXACT match FDR thresholds ==="
|
| 40 |
-
python scripts/compute_fdr_table.py \
|
| 41 |
-
--calibration "$CALIB_DATA" \
|
| 42 |
-
--output results/fdr_thresholds.csv \
|
| 43 |
-
--n-trials 100 \
|
| 44 |
-
--n-calib 1000 \
|
| 45 |
-
--seed 42 \
|
| 46 |
-
--alpha-levels "$ALPHA_LEVELS"
|
| 47 |
-
|
| 48 |
-
echo ""
|
| 49 |
-
echo "=== Computing PARTIAL match FDR thresholds ==="
|
| 50 |
-
python scripts/compute_fdr_table.py \
|
| 51 |
-
--calibration "$CALIB_DATA" \
|
| 52 |
-
--output results/fdr_thresholds_partial.csv \
|
| 53 |
-
--n-trials 100 \
|
| 54 |
-
--n-calib 1000 \
|
| 55 |
-
--seed 42 \
|
| 56 |
-
--alpha-levels "$ALPHA_LEVELS" \
|
| 57 |
-
--partial
|
| 58 |
-
|
| 59 |
-
echo ""
|
| 60 |
-
echo "=== FDR Computation Complete ==="
|
| 61 |
-
echo "Results:"
|
| 62 |
-
echo " - results/fdr_thresholds.csv (exact match)"
|
| 63 |
-
echo " - results/fdr_thresholds_partial.csv (partial match)"
|
| 64 |
-
echo ""
|
| 65 |
-
echo "Finished: $(date)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_compute_fnr_partial.sh
DELETED
|
@@ -1,38 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=fnr-partial
|
| 3 |
-
#SBATCH --partition=standard
|
| 4 |
-
#SBATCH --nodes=1
|
| 5 |
-
#SBATCH --ntasks=1
|
| 6 |
-
#SBATCH --cpus-per-task=4
|
| 7 |
-
#SBATCH --mem=32G
|
| 8 |
-
#SBATCH --time=12:00:00
|
| 9 |
-
#SBATCH --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.log
|
| 10 |
-
#SBATCH --error=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.err
|
| 11 |
-
|
| 12 |
-
# Compute FNR thresholds for PARTIAL matches only
|
| 13 |
-
# Use this if the main FNR job timed out before completing partial match computation
|
| 14 |
-
|
| 15 |
-
eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
|
| 16 |
-
conda activate conformal-s
|
| 17 |
-
|
| 18 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 19 |
-
|
| 20 |
-
echo "============================================"
|
| 21 |
-
echo "Computing FNR Thresholds (Partial Match Only)"
|
| 22 |
-
echo "============================================"
|
| 23 |
-
echo "Start time: $(date)"
|
| 24 |
-
echo "Node: $(hostname)"
|
| 25 |
-
echo ""
|
| 26 |
-
|
| 27 |
-
python scripts/compute_fnr_table.py \
|
| 28 |
-
--calibration data/pfam_new_proteins.npy \
|
| 29 |
-
--output results/fnr_thresholds_partial.csv \
|
| 30 |
-
--n-trials 100 \
|
| 31 |
-
--n-calib 1000 \
|
| 32 |
-
--seed 42 \
|
| 33 |
-
--partial
|
| 34 |
-
|
| 35 |
-
echo ""
|
| 36 |
-
echo "============================================"
|
| 37 |
-
echo "Completed: $(date)"
|
| 38 |
-
echo "============================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_investigate.sh
DELETED
|
@@ -1,36 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=cpr-investigate
|
| 3 |
-
#SBATCH --output=logs/cpr-investigate-%j.out
|
| 4 |
-
#SBATCH --error=logs/cpr-investigate-%j.err
|
| 5 |
-
#SBATCH --time=1:00:00
|
| 6 |
-
#SBATCH --mem=32G
|
| 7 |
-
#SBATCH --cpus-per-task=4
|
| 8 |
-
|
| 9 |
-
# CPR Investigation - FDR calibration and precomputed probability verification
|
| 10 |
-
set -e
|
| 11 |
-
mkdir -p logs data
|
| 12 |
-
|
| 13 |
-
source ~/.bashrc
|
| 14 |
-
eval "$(conda shell.bash hook)"
|
| 15 |
-
conda activate conformal-s
|
| 16 |
-
|
| 17 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 18 |
-
|
| 19 |
-
echo "========================================"
|
| 20 |
-
echo "CPR Investigation"
|
| 21 |
-
echo "Date: $(date)"
|
| 22 |
-
echo "Node: $(hostname)"
|
| 23 |
-
echo "========================================"
|
| 24 |
-
echo ""
|
| 25 |
-
|
| 26 |
-
echo "=== Part 1: FDR Calibration Investigation ==="
|
| 27 |
-
python scripts/investigate_fdr.py
|
| 28 |
-
echo ""
|
| 29 |
-
|
| 30 |
-
echo "=== Part 2: Precomputed Probability Verification ==="
|
| 31 |
-
python scripts/test_precomputed_probs.py
|
| 32 |
-
echo ""
|
| 33 |
-
|
| 34 |
-
echo "========================================"
|
| 35 |
-
echo "Completed: $(date)"
|
| 36 |
-
echo "========================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_test_clean_embed.sh
DELETED
|
@@ -1,108 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=test_clean_embed
|
| 3 |
-
#SBATCH --partition=gpu
|
| 4 |
-
#SBATCH --nodes=1
|
| 5 |
-
#SBATCH --ntasks=1
|
| 6 |
-
#SBATCH --cpus-per-task=4
|
| 7 |
-
#SBATCH --gres=gpu:1
|
| 8 |
-
#SBATCH --time=01:00:00
|
| 9 |
-
#SBATCH --output=logs/test_clean_embed_%j.out
|
| 10 |
-
#SBATCH --error=logs/test_clean_embed_%j.err
|
| 11 |
-
|
| 12 |
-
# Test CLEAN embedding with the CPR CLI
|
| 13 |
-
# This script:
|
| 14 |
-
# 1. Runs CLI tests
|
| 15 |
-
# 2. Tests CLEAN embedding on a small FASTA file
|
| 16 |
-
|
| 17 |
-
set -e
|
| 18 |
-
|
| 19 |
-
echo "=== CPR CLEAN Embedding Test ==="
|
| 20 |
-
echo "Date: $(date)"
|
| 21 |
-
echo "Node: $(hostname)"
|
| 22 |
-
echo "Job ID: $SLURM_JOB_ID"
|
| 23 |
-
|
| 24 |
-
# Create logs directory if it doesn't exist
|
| 25 |
-
mkdir -p logs
|
| 26 |
-
|
| 27 |
-
# Activate conda environment
|
| 28 |
-
eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
|
| 29 |
-
conda activate conformal-s
|
| 30 |
-
|
| 31 |
-
# Print environment info
|
| 32 |
-
echo ""
|
| 33 |
-
echo "=== Environment Info ==="
|
| 34 |
-
which python
|
| 35 |
-
python --version
|
| 36 |
-
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
|
| 37 |
-
python -c "import faiss; print(f'FAISS: {faiss.__version__}')"
|
| 38 |
-
|
| 39 |
-
# Change to repo directory
|
| 40 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 41 |
-
|
| 42 |
-
# 1. Run CLI tests
|
| 43 |
-
echo ""
|
| 44 |
-
echo "=== Running CLI Tests ==="
|
| 45 |
-
python -m pytest tests/test_cli.py -v --tb=short 2>&1 || echo "Note: Some tests may fail if dependencies are missing"
|
| 46 |
-
|
| 47 |
-
# 2. Create a small test FASTA file
|
| 48 |
-
echo ""
|
| 49 |
-
echo "=== Creating Test FASTA ==="
|
| 50 |
-
TEST_DIR="test_clean_output"
|
| 51 |
-
mkdir -p "$TEST_DIR"
|
| 52 |
-
|
| 53 |
-
cat > "$TEST_DIR/test_sequences.fasta" << 'EOF'
|
| 54 |
-
>seq1_test_enzyme
|
| 55 |
-
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK
|
| 56 |
-
>seq2_test_enzyme
|
| 57 |
-
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
|
| 58 |
-
>seq3_test_enzyme
|
| 59 |
-
MTSKGECFVTVTYKNLFPPEQWSPKQYLFHNASDKGFVPTHICTHGCLSPKQLQEFDLVNQADQEGWSGDYTCQCNCTQQALCGFPVFLGCEACTFTPDCHGECVCKFPFGEYFVCDCDGSPDCG
|
| 60 |
-
EOF
|
| 61 |
-
|
| 62 |
-
echo "Created test FASTA with 3 sequences"
|
| 63 |
-
|
| 64 |
-
# 3. Test CLEAN embedding (requires GPU)
|
| 65 |
-
echo ""
|
| 66 |
-
echo "=== Testing CLEAN Embedding ==="
|
| 67 |
-
echo "Checking CLEAN installation..."
|
| 68 |
-
python -c "from CLEAN.model import LayerNormNet; print('CLEAN model import OK')" 2>&1 || {
|
| 69 |
-
echo "CLEAN not installed, installing..."
|
| 70 |
-
cd CLEAN_repo/app
|
| 71 |
-
python build.py install
|
| 72 |
-
cd ../..
|
| 73 |
-
}
|
| 74 |
-
|
| 75 |
-
echo ""
|
| 76 |
-
echo "Running cpr embed with CLEAN model..."
|
| 77 |
-
time python -m protein_conformal.cli embed \
|
| 78 |
-
--input "$TEST_DIR/test_sequences.fasta" \
|
| 79 |
-
--output "$TEST_DIR/test_clean_embeddings.npy" \
|
| 80 |
-
--model clean
|
| 81 |
-
|
| 82 |
-
# 4. Verify output
|
| 83 |
-
echo ""
|
| 84 |
-
echo "=== Verifying Output ==="
|
| 85 |
-
if [ -f "$TEST_DIR/test_clean_embeddings.npy" ]; then
|
| 86 |
-
python -c "
|
| 87 |
-
import numpy as np
|
| 88 |
-
emb = np.load('$TEST_DIR/test_clean_embeddings.npy')
|
| 89 |
-
print(f'Embeddings shape: {emb.shape}')
|
| 90 |
-
print(f'Expected: (3, 128)')
|
| 91 |
-
assert emb.shape == (3, 128), f'Shape mismatch: expected (3, 128), got {emb.shape}'
|
| 92 |
-
print('SUCCESS: CLEAN embedding test passed!')
|
| 93 |
-
"
|
| 94 |
-
else
|
| 95 |
-
echo "ERROR: Output file not created"
|
| 96 |
-
exit 1
|
| 97 |
-
fi
|
| 98 |
-
|
| 99 |
-
# 5. Optional: Compare with reference (if exists)
|
| 100 |
-
echo ""
|
| 101 |
-
echo "=== Test Complete ==="
|
| 102 |
-
echo "Output saved to: $TEST_DIR/test_clean_embeddings.npy"
|
| 103 |
-
echo ""
|
| 104 |
-
|
| 105 |
-
# Cleanup (optional - uncomment to remove test files)
|
| 106 |
-
# rm -rf "$TEST_DIR"
|
| 107 |
-
|
| 108 |
-
echo "Done at $(date)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_verify.sh
DELETED
|
@@ -1,97 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=cpr-verify
|
| 3 |
-
#SBATCH --output=logs/cpr-verify-%j.out
|
| 4 |
-
#SBATCH --error=logs/cpr-verify-%j.err
|
| 5 |
-
#SBATCH --time=1:00:00
|
| 6 |
-
#SBATCH --mem=32G
|
| 7 |
-
#SBATCH --cpus-per-task=4
|
| 8 |
-
|
| 9 |
-
# CPR Verification - Reproduces paper results
|
| 10 |
-
# Usage:
|
| 11 |
-
# sbatch scripts/slurm_verify.sh syn30 # Verify Syn3.0 (Figure 2A)
|
| 12 |
-
# sbatch scripts/slurm_verify.sh fdr # Verify FDR algorithm
|
| 13 |
-
# sbatch scripts/slurm_verify.sh dali # Verify DALI prefiltering (Tables 4-6)
|
| 14 |
-
# sbatch scripts/slurm_verify.sh clean # Verify CLEAN enzyme (Tables 1-2)
|
| 15 |
-
# sbatch scripts/slurm_verify.sh probs # Verify precomputed probability lookup
|
| 16 |
-
# sbatch scripts/slurm_verify.sh all # Run all verifications
|
| 17 |
-
|
| 18 |
-
set -e
|
| 19 |
-
mkdir -p logs results
|
| 20 |
-
|
| 21 |
-
source ~/.bashrc
|
| 22 |
-
eval "$(conda shell.bash hook)"
|
| 23 |
-
conda activate conformal-s
|
| 24 |
-
|
| 25 |
-
CHECK="${1:-all}"
|
| 26 |
-
|
| 27 |
-
echo "========================================"
|
| 28 |
-
echo "CPR Verification"
|
| 29 |
-
echo "Date: $(date)"
|
| 30 |
-
echo "Node: $(hostname)"
|
| 31 |
-
echo "Check: $CHECK"
|
| 32 |
-
echo "========================================"
|
| 33 |
-
echo ""
|
| 34 |
-
|
| 35 |
-
run_syn30() {
|
| 36 |
-
echo "--- Syn3.0 Verification (Paper Figure 2A) ---"
|
| 37 |
-
python scripts/verify_syn30.py
|
| 38 |
-
echo ""
|
| 39 |
-
}
|
| 40 |
-
|
| 41 |
-
run_fdr() {
|
| 42 |
-
echo "--- FDR Algorithm Verification ---"
|
| 43 |
-
python scripts/verify_fdr_algorithm.py
|
| 44 |
-
echo ""
|
| 45 |
-
}
|
| 46 |
-
|
| 47 |
-
run_dali() {
|
| 48 |
-
echo "--- DALI Prefiltering Verification (Tables 4-6) ---"
|
| 49 |
-
python scripts/verify_dali.py
|
| 50 |
-
echo ""
|
| 51 |
-
}
|
| 52 |
-
|
| 53 |
-
run_clean() {
|
| 54 |
-
echo "--- CLEAN Enzyme Classification Verification (Tables 1-2) ---"
|
| 55 |
-
python scripts/verify_clean.py
|
| 56 |
-
echo ""
|
| 57 |
-
}
|
| 58 |
-
|
| 59 |
-
run_probs() {
|
| 60 |
-
echo "--- Precomputed Probability Lookup Verification ---"
|
| 61 |
-
python scripts/test_precomputed_probs.py
|
| 62 |
-
echo ""
|
| 63 |
-
}
|
| 64 |
-
|
| 65 |
-
case "$CHECK" in
|
| 66 |
-
syn30)
|
| 67 |
-
run_syn30
|
| 68 |
-
;;
|
| 69 |
-
fdr)
|
| 70 |
-
run_fdr
|
| 71 |
-
;;
|
| 72 |
-
dali)
|
| 73 |
-
run_dali
|
| 74 |
-
;;
|
| 75 |
-
clean)
|
| 76 |
-
run_clean
|
| 77 |
-
;;
|
| 78 |
-
probs)
|
| 79 |
-
run_probs
|
| 80 |
-
;;
|
| 81 |
-
all)
|
| 82 |
-
run_syn30
|
| 83 |
-
run_fdr
|
| 84 |
-
run_dali
|
| 85 |
-
run_clean
|
| 86 |
-
run_probs
|
| 87 |
-
;;
|
| 88 |
-
*)
|
| 89 |
-
echo "Unknown check: $CHECK"
|
| 90 |
-
echo "Available: syn30, fdr, dali, clean, probs, all"
|
| 91 |
-
exit 1
|
| 92 |
-
;;
|
| 93 |
-
esac
|
| 94 |
-
|
| 95 |
-
echo "========================================"
|
| 96 |
-
echo "Completed: $(date)"
|
| 97 |
-
echo "========================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/slurm_verify_probs.sh
DELETED
|
@@ -1,34 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
#SBATCH --job-name=cpr-verify-probs
|
| 3 |
-
#SBATCH --output=logs/cpr-verify-probs-%j.out
|
| 4 |
-
#SBATCH --error=logs/cpr-verify-probs-%j.err
|
| 5 |
-
#SBATCH --time=1:00:00
|
| 6 |
-
#SBATCH --mem=16G
|
| 7 |
-
#SBATCH --cpus-per-task=2
|
| 8 |
-
|
| 9 |
-
# CPR Precomputed Probability Verification
|
| 10 |
-
# Verifies that the sim->prob lookup table matches direct Venn-Abers computation
|
| 11 |
-
# Usage: sbatch scripts/slurm_verify_probs.sh
|
| 12 |
-
|
| 13 |
-
set -e
|
| 14 |
-
mkdir -p logs data
|
| 15 |
-
|
| 16 |
-
echo "========================================"
|
| 17 |
-
echo "CPR Probability Lookup Verification"
|
| 18 |
-
echo "Date: $(date)"
|
| 19 |
-
echo "Node: $(hostname)"
|
| 20 |
-
echo "========================================"
|
| 21 |
-
echo ""
|
| 22 |
-
|
| 23 |
-
# Activate conda environment
|
| 24 |
-
source ~/.bashrc
|
| 25 |
-
eval "$(conda shell.bash hook)"
|
| 26 |
-
conda activate conformal-s
|
| 27 |
-
|
| 28 |
-
# Run the verification test
|
| 29 |
-
python scripts/test_precomputed_probs.py
|
| 30 |
-
|
| 31 |
-
echo ""
|
| 32 |
-
echo "========================================"
|
| 33 |
-
echo "Completed: $(date)"
|
| 34 |
-
echo "========================================"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
scripts/submit_fdr_parallel.sh
DELETED
|
@@ -1,47 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
# Submit FDR threshold jobs in parallel - one per alpha level
|
| 3 |
-
|
| 4 |
-
ALPHAS="0.001 0.005 0.01 0.02 0.05 0.1 0.15 0.2"
|
| 5 |
-
|
| 6 |
-
for alpha in $ALPHAS; do
|
| 7 |
-
# Exact match
|
| 8 |
-
sbatch --job-name="fdr-exact-${alpha}" \
|
| 9 |
-
--partition=standard \
|
| 10 |
-
--nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
|
| 11 |
-
--time=08:00:00 \
|
| 12 |
-
--output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_exact_${alpha}_%j.log \
|
| 13 |
-
--wrap="
|
| 14 |
-
eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
|
| 15 |
-
conda activate conformal-s
|
| 16 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 17 |
-
python scripts/compute_fdr_table.py \
|
| 18 |
-
--calibration data/pfam_new_proteins.npy \
|
| 19 |
-
--output results/fdr_exact_alpha_${alpha}.csv \
|
| 20 |
-
--n-trials 100 \
|
| 21 |
-
--n-calib 1000 \
|
| 22 |
-
--seed 42 \
|
| 23 |
-
--alpha-levels ${alpha}
|
| 24 |
-
"
|
| 25 |
-
|
| 26 |
-
# Partial match
|
| 27 |
-
sbatch --job-name="fdr-partial-${alpha}" \
|
| 28 |
-
--partition=standard \
|
| 29 |
-
--nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
|
| 30 |
-
--time=08:00:00 \
|
| 31 |
-
--output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_partial_${alpha}_%j.log \
|
| 32 |
-
--wrap="
|
| 33 |
-
eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
|
| 34 |
-
conda activate conformal-s
|
| 35 |
-
cd /groups/doudna/projects/ronb/conformal-protein-retrieval
|
| 36 |
-
python scripts/compute_fdr_table.py \
|
| 37 |
-
--calibration data/pfam_new_proteins.npy \
|
| 38 |
-
--output results/fdr_partial_alpha_${alpha}.csv \
|
| 39 |
-
--n-trials 100 \
|
| 40 |
-
--n-calib 1000 \
|
| 41 |
-
--seed 42 \
|
| 42 |
-
--alpha-levels ${alpha} \
|
| 43 |
-
--partial
|
| 44 |
-
"
|
| 45 |
-
done
|
| 46 |
-
|
| 47 |
-
echo "Submitted 16 FDR jobs (8 alphas × 2 match types)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|