Spaces:

LoocasGoose
/

cpr

Running

ronboger Claude Opus 4.5 commited on Feb 4

Commit

dd5ecfc

1 Parent(s): 68a03a2

chore: clean up redundant scripts and consolidate thresholds

- Archive 16 redundant/one-off scripts to scripts/archive/
- Archive 3 duplicate Python files from notebooks/pfam/
- Remove "simple" CSV files, add full threshold tables to GETTING_STARTED.md
- Keep 4 clean SLURM scripts: build, embed, fdr, fnr
- Update .gitignore for archive directories

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (21) hide show

.gitignore +3 -1
GETTING_STARTED.md +40 -11
notebooks/pfam/generate_fdr.py +0 -64
notebooks/pfam/generate_fnr.py +0 -22
notebooks/pfam/sva_results.py +0 -59
results/fdr_thresholds.csv +2 -1
scripts/embed_fasta.sh +0 -63
scripts/investigate_fdr.py +0 -106
scripts/precompute_fdr_thresholds.sh +0 -96
scripts/precompute_fnr_thresholds.sh +0 -106
scripts/quick_fdr_check.py +0 -58
scripts/search_and_assign_probs.sh +0 -123
scripts/slurm_calibrate_fdr.sh +0 -59
scripts/slurm_calibrate_fnr.sh +0 -47
scripts/slurm_compute_fdr_all.sh +0 -65
scripts/slurm_compute_fnr_partial.sh +0 -38
scripts/slurm_investigate.sh +0 -36
scripts/slurm_test_clean_embed.sh +0 -108
scripts/slurm_verify.sh +0 -97
scripts/slurm_verify_probs.sh +0 -34
scripts/submit_fdr_parallel.sh +0 -47

.gitignore CHANGED Viewed

@@ -211,5 +211,7 @@ test_clean_output/
 protein_vec_models.gz
 CLEAN_repo/
-# Archived legacy notebooks (moved from root dirs)
 notebooks_archive/

 protein_vec_models.gz
 CLEAN_repo/
+# Archived legacy code (redundant/one-off scripts)
 notebooks_archive/
+scripts/archive/
+notebooks/*/archive/

GETTING_STARTED.md CHANGED Viewed

@@ -165,20 +165,49 @@ cpr search --input data/gene_unknown/unknown_aa_seqs.npy \
 ## FDR/FNR Threshold Reference
-These thresholds control the trade-off between hits and false positives:
-| α Level | FDR Threshold (λ) | FNR Threshold (λ) | Use Case |
-|---------|-------------------|-------------------|----------|
-| 0.001 | (see csv) | 0.9997904 | Ultra-stringent |
-| 0.01 | (see csv) | 0.9998495 | Very stringent |
-| 0.05 | (see csv) | 0.9998899 | Stringent |
-| **0.1** | **0.9999802** | **0.9999076** | **Paper default** |
-| 0.15 | (see csv) | 0.9999174 | Relaxed |
-| 0.2 | (see csv) | 0.9999245 | Discovery-focused |
-**Note**: FDR threshold at α=0.1 is the paper-verified value. Other FDR values require running `scripts/compute_fdr_table.py`.
-Full computed tables in `results/fdr_thresholds.csv` and `results/fnr_thresholds.csv`.
 ---

 ## FDR/FNR Threshold Reference
+These thresholds control the trade-off between hits and false positives.
+### FDR Thresholds (False Discovery Rate)
+Controls the expected fraction of hits that are false positives.
+| α Level | Threshold (λ) | Std Dev | Use Case |
+|---------|---------------|---------|----------|
+| **0.1** | **0.9999801** | ±1.7e-06 | **Paper default** |
+**Note**: FDR threshold at α=0.1 is verified against the paper (0.9999802). Additional alpha levels can be computed with `scripts/compute_fdr_table.py`.
+### FNR Thresholds (False Negative Rate) - Exact Match
+Controls the expected fraction of true matches you miss. "Exact match" requires all Pfam domains to match.
+| α Level | Threshold (λ) | Std Dev | Use Case |
+|---------|---------------|---------|----------|
+| 0.001 | 0.9997904 | ±2.3e-05 | Ultra-stringent |
+| 0.005 | 0.9998338 | ±8.2e-06 | Very stringent |
+| 0.01 | 0.9998495 | ±5.5e-06 | Stringent |
+| 0.02 | 0.9998679 | ±5.1e-06 | Moderate |
+| 0.05 | 0.9998899 | ±3.3e-06 | Balanced |
+| **0.1** | **0.9999076** | ±2.2e-06 | **Recommended** |
+| 0.15 | 0.9999174 | ±1.4e-06 | Relaxed |
+| 0.2 | 0.9999245 | ±1.3e-06 | Discovery-focused |
+### FNR Thresholds - Partial Match
+"Partial match" requires at least one Pfam domain to match (more permissive).
+| α Level | Threshold (λ) | Std Dev | Use Case |
+|---------|---------------|---------|----------|
+| 0.001 | 0.9997646 | ±1.5e-06 | Ultra-stringent |
+| 0.005 | 0.9997821 | ±2.8e-06 | Very stringent |
+| 0.01 | 0.9997946 | ±3.1e-06 | Stringent |
+| 0.02 | 0.9998108 | ±3.5e-06 | Moderate |
+| 0.05 | 0.9998389 | ±3.0e-06 | Balanced |
+| **0.1** | **0.9998626** | ±2.8e-06 | **Recommended** |
+| 0.15 | 0.9998779 | ±2.2e-06 | Relaxed |
+| 0.2 | 0.9998903 | ±2.1e-06 | Discovery-focused |
+Full computed tables with min/max values in `results/fdr_thresholds.csv`, `results/fnr_thresholds.csv`, and `results/fnr_thresholds_partial.csv`.
 ---

notebooks/pfam/generate_fdr.py DELETED Viewed

@@ -1,64 +0,0 @@
-import datetime
-import numpy as np
-import pandas as pd
-import argparse
-from tqdm import tqdm
-from protein_conformal.util import *
-def main():
-    parser = argparse.ArgumentParser(description='Process some integers.')
-    parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
-    parser.add_argument('--partial', type=bool, default=False, help='Partial hits')
-    parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
-    parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
-    parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
-    parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/pfam_fdr.npy', help='Output file for the results')
-    parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
-    args = parser.parse_args()
-    alpha = args.alpha
-    num_trials = args.num_trials
-    n_calib = args.n_calib
-    delta = args.delta
-    partial = args.partial
-    # Load the data
-    # data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
-    data = np.load('/data/ron/protein-conformal/data/pfam_new_proteins.npy', allow_pickle=True)
-    risks = []
-    tprs = []
-    lhats = []
-    fdr_cals = []
-    # alpha = 0.1
-    # num_trials = 100
-    # n_calib = 1000
-    for trial in tqdm(range(num_trials)):
-        np.random.shuffle(data)
-        cal_data = data[:n_calib]
-        test_data = data[n_calib:]
-        X_cal, y_cal = get_sims_labels(cal_data, partial=partial)
-        X_test, y_test_exact = get_sims_labels(test_data, partial=partial)
-        # sims, labels = get_sims_labels(cal_data, partial=False)
-        lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha, delta, N=100)
-        lhats.append(lhat)
-        fdr_cals.append(fdr_cal)
-        # print(X_test.shape)
-        # print(y_test_exact.shape)
-        risks.append(risk(X_test, y_test_exact, lhat))
-        tprs.append(calculate_true_positives(X_test, y_test_exact, lhat))
-    print("Risk: ", np.mean(risks))
-    print("TPR: ", np.mean(tprs))
-    print("Lhat: ", np.mean(lhats))
-    print("FDR Cal: ", np.mean(fdr_cals))
-    output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '' + '.npy')
-    np.save(output_file,
-            {'risks': risks,
-             'tprs': tprs,
-             'lhats': lhats,
-             'fdr_cals': fdr_cals})
-if __name__ == "__main__":
-    # add code for command line arguments
-    main()

notebooks/pfam/generate_fnr.py DELETED Viewed

@@ -1,22 +0,0 @@
-import numpy as np
-import pandas as pd
-import argparse
-from tqdm import tqdm
-from protein_conformal.util import *
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
-    parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
-    parser.add_argument('--n_calib', type=int, default=1000, help='Number of calibration data points')
-    args = parser.parse_args()
-    alpha = args.alpha
-    num_trials = args.num_trials
-    n_calib = args.n_calib
-    # Load the data
-    data = np.load('/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
-if __name__ == "__main__":
-    # add code for command line arguments
-    main()

notebooks/pfam/sva_results.py DELETED Viewed

@@ -1,59 +0,0 @@
-import numpy as np
-import pandas as pd
-import argparse
-from tqdm import tqdm
-from protein_conformal.util import *
-import datetime
-def run_trial(data, n_calib, args):
-    np.random.shuffle(data)
-    cal_data = data[:n_calib]
-    test_data = data[n_calib:3*n_calib]
-    X_cal, y_cal = get_sims_labels(cal_data, partial=False)
-    X_test, y_test_exact = get_sims_labels(test_data, partial=False)
-    # flatten the data
-    X_cal = X_cal.flatten()
-    y_cal = y_cal.flatten()
-    X_test = X_test.flatten()
-    y_test_exact = y_test_exact.flatten()
-    # generate random indices in the test set
-    i = np.random.randint(0, len(X_test))
-    # i_s = np.random.randint(0, len(X_test), int(len(X_test) * args.percent_sva_test))
-    p_0, p_1 = simplifed_venn_abers_prediction(X_cal, y_cal, X_test[i])
-    result = (np.mean([p_0, p_1]), X_test[i], y_test_exact[i])
-    return result
-def main(args):
-    data = np.load(args.input, allow_pickle=True)
-    n_calib = args.n_calib  # Number of calibration data points
-    sva_results = []
-    for trial in tqdm(range(args.num_trials)):
-        # print(f'Running trial {i+1} of {args.num_trials}')
-        sva_results.append(run_trial(data, n_calib, args))
-    df_sva = pd.DataFrame(sva_results, columns=['p', 'x', 'y'])
-    output_file = args.output + ('_' + str(datetime.datetime.now().date()) if args.add_date else '') + '.csv'
-    print(f'Saving results to {output_file}')
-    df_sva.to_csv(output_file, index=False)
-    # make bins for p
-    df_sva['p_bin'] = pd.cut(df_sva['p'], bins=10)
-    print(df_sva.groupby('p_bin')['y'].mean())
-if __name__ == "__main__":
-    # add code for command line arguments
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--input', type=str, default='/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', help='Input file for the data')
-    # '/data/ron/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy'
-    # parser.add_argument('--percent_sva_test', type=float, default=.1, help='percent of data to use for SVA testing')
-    # parser.add_argument('--alpha', type=float, default=0.1, help='Alpha value for the algorithm')
-    parser.add_argument('--num_trials', type=int, default=100, help='Number of trials to run')
-    parser.add_argument('--n_calib', type=int, default=50, help='Number of calibration data points')
-    # parser.add_argument('--delta', type=float, default=0.5, help='Delta value for the algorithm')
-    parser.add_argument('--output', type=str, default='/data/ron/protein-conformal/data/sva_results', help='Output file for the results')
-    parser.add_argument('--add_date', type=bool, default=True, help='Add date to output file name')
-    args = parser.parse_args()
-    main(args)

results/fdr_thresholds.csv CHANGED Viewed

	@@ -1 +1,2 @@
1	- alpha,~~lambda_threshold~~,~~exact_fdr~~,~~partial_fdr~~


1	+ alpha,threshold_mean,threshold_std,threshold_min,threshold_max,empirical_fdr_mean,empirical_fdr_std
2	+ 0.1,0.99998005881454,1.7455746588230029e-06,0.9999783761573561,0.9999823510044753,0.08709266350751729,0.011992918257283684

scripts/embed_fasta.sh DELETED Viewed

@@ -1,63 +0,0 @@
-#!/bin/bash
-# Set the CUDA device
-CUDA_DEVICE=3
-# Specify the directory with all fasta files (default is current directory), assuming
-# all protein fasta files are named *_prot.fasta
-INPUT_DIR="${1:-.}"
-# Create the "emb" subfolder if it doesn't exist
-EMB_DIR="$INPUT_DIR/emb"
-mkdir -p "$EMB_DIR"
-# Loop through all *_prot.fasta files, embed
-for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
-    # Ensure the file exists
-    if [[ -f "$fasta_file" ]]; then
-        # Extract base filename without extension
-        base_name=$(basename "$fasta_file" "_prot.fasta")
-        # Set output file name inside emb/ folder
-        output_file="$EMB_DIR/${base_name}_emb.npy"
-        # Run embedding command
-        echo "Processing: $fasta_file -> $output_file"
-        cd /home/yangk/proteins/protein-vec/src_run
-        CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
-    fi
-done
-echo "All files processed. Embeddings saved in $EMB_DIR."
-#!/bin/bash
-# Set the CUDA device
-CUDA_DEVICE=3
-# Specify the directory with all fasta files (default is current directory), assuming
-# all protein fasta files are named *_prot.fasta
-INPUT_DIR="${1:-.}"
-# Create the "emb" subfolder if it doesn't exist
-EMB_DIR="$INPUT_DIR/emb"
-mkdir -p "$EMB_DIR"
-# Loop through all *_prot.fasta files, embed
-for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
-    # Ensure the file exists
-    if [[ -f "$fasta_file" ]]; then
-        # Extract base filename without extension
-        base_name=$(basename "$fasta_file" "_prot.fasta")
-        # Set output file name inside emb/ folder
-        output_file="$EMB_DIR/${base_name}_emb.npy"
-        # Run embedding command
-        echo "Processing: $fasta_file -> $output_file"
-        cd /home/yangk/proteins/protein-vec/src_run
-        CUDA_VISIBLE_DEVICES=$CUDA_DEVICE python embed_seqs.py --input_file "$fasta_file" --output_file "$output_file"
-    fi
-done
-echo "All files processed. Embeddings saved in $EMB_DIR."

scripts/investigate_fdr.py DELETED Viewed

@@ -1,106 +0,0 @@
-#!/usr/bin/env python
-"""
-Investigate FDR calibration discrepancy between datasets.
-Checks for data leakage and compares calibration results.
-"""
-import numpy as np
-import sys
-sys.path.insert(0, '.')
-from protein_conformal.util import get_sims_labels, get_thresh_FDR
-print("=" * 60)
-print("FDR Calibration Dataset Investigation")
-print("=" * 60)
-print()
-# Load both calibration datasets
-print("Loading datasets...")
-pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
-backup_data = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
-print(f'pfam_new_proteins.npy: {len(pfam_new)} samples')
-print(f'backup dataset: {len(backup_data)} samples')
-print()
-# Check for overlap (potential leakage)
-print("Checking for overlap between datasets...")
-# Meta is an array, so convert to tuple for hashing
-pfam_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in pfam_new)
-backup_metas = set(tuple(d['meta'].tolist()) if hasattr(d['meta'], 'tolist') else (d['meta'],) for d in backup_data)
-overlap = pfam_metas & backup_metas
-print(f"  Unique query sets in pfam_new: {len(pfam_metas)}")
-print(f"  Unique query sets in backup: {len(backup_metas)}")
-print(f"  Overlap: {len(overlap)} ({len(overlap)/len(pfam_metas)*100:.1f}% of pfam_new)")
-print()
-# Compare similarity distributions
-print("Similarity score distributions:")
-sims_new, labels_new = get_sims_labels(pfam_new[:500], partial=False)
-sims_backup, labels_backup = get_sims_labels(backup_data[:500], partial=False)
-print(f"  pfam_new (500 samples):")
-print(f"    Similarity: min={sims_new.min():.6f}, max={sims_new.max():.6f}, mean={sims_new.mean():.6f}")
-print(f"    Labels: {labels_new.sum()}/{labels_new.size} positive ({labels_new.mean()*100:.1f}%)")
-print()
-print(f"  backup (500 samples):")
-print(f"    Similarity: min={sims_backup.min():.6f}, max={sims_backup.max():.6f}, mean={sims_backup.mean():.6f}")
-print(f"    Labels: {labels_backup.sum()}/{labels_backup.size} positive ({labels_backup.mean()*100:.1f}%)")
-print()
-# Run FDR calibration on both with same parameters
-print("Running FDR calibration (alpha=0.1, n_calib=1000, 10 trials)...")
-print()
-def run_fdr_trials(data, name, n_trials=10, n_calib=1000):
-    lhats = []
-    risks = []
-    tprs = []
-    for trial in range(n_trials):
-        np.random.seed(42 + trial)
-        np.random.shuffle(data)
-        cal_data = data[:n_calib]
-        test_data = data[n_calib:n_calib+500]
-        X_cal, y_cal = get_sims_labels(cal_data, partial=False)
-        X_test, y_test = get_sims_labels(test_data, partial=False)
-        lhat, fdr_cal = get_thresh_FDR(y_cal, X_cal, alpha=0.1, delta=0.5, N=100)
-        lhats.append(lhat)
-        # Calculate test risk and TPR
-        preds = (X_test >= lhat).astype(int)
-        tp = np.sum((preds == 1) & (y_test == 1))
-        fp = np.sum((preds == 1) & (y_test == 0))
-        fn = np.sum((preds == 0) & (y_test == 1))
-        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
-        risk = fp / (fp + tp) if (fp + tp) > 0 else 0
-        tprs.append(tpr)
-        risks.append(risk)
-    print(f"{name}:")
-    print(f"  λ (threshold): {np.mean(lhats):.10f} ± {np.std(lhats):.10f}")
-    print(f"  Risk (FDR):    {np.mean(risks):.4f} ± {np.std(risks):.4f}")
-    print(f"  TPR:           {np.mean(tprs)*100:.1f}% ± {np.std(tprs)*100:.1f}%")
-    print()
-    return lhats, risks, tprs
-lhats_new, risks_new, tprs_new = run_fdr_trials(pfam_new.copy(), "pfam_new_proteins")
-lhats_backup, risks_backup, tprs_backup = run_fdr_trials(backup_data.copy(), "backup_dataset")
-print("=" * 60)
-print("CONCLUSION")
-print("=" * 60)
-if abs(np.mean(lhats_new) - np.mean(lhats_backup)) < 0.00001:
-    print("✓ Thresholds are similar - datasets likely compatible")
-else:
-    print("⚠ Thresholds differ significantly!")
-    print(f"  Difference: {abs(np.mean(lhats_new) - np.mean(lhats_backup)):.10f}")
-if len(overlap) > len(pfam_metas) * 0.5:
-    print("⚠ High overlap between datasets - potential data source")
-else:
-    print("✓ Low overlap - datasets appear independent")

scripts/precompute_fdr_thresholds.sh DELETED Viewed

@@ -1,96 +0,0 @@
-#!/bin/bash
-# Get the directory where this script is located
-SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
-PROJECT_ROOT="$( cd "$SCRIPT_DIR/.." &> /dev/null && pwd )"
-# Parameters
-MIN_ALPHA=0.01
-MAX_ALPHA=1.0
-NUM_ALPHA_VALUES=100
-NUM_TRIALS=100
-N_CALIB=1000
-DELTA=0.5
-OUTPUT_DIR="$PROJECT_ROOT/results"
-TEMP_DIR="$SCRIPT_DIR/temp_fdr_results"
-CSV_OUTPUT="$OUTPUT_DIR/fdr_thresholds.csv"
-echo "Script directory: $SCRIPT_DIR"
-echo "Project root: $PROJECT_ROOT"
-echo "Output directory: $OUTPUT_DIR"
-mkdir -p "$OUTPUT_DIR"
-mkdir -p "$TEMP_DIR"
-# Initialize CSV file with header
-echo "alpha,lambda_threshold,exact_fdr,partial_fdr" > "$CSV_OUTPUT"
-# Generate alpha values using Python
-ALPHA_VALUES=$(python -c "
-import numpy as np
-alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
-print(' '.join([str(a) for a in alphas]))
-")
-# Counter for progress tracking
-counter=0
-total=$NUM_ALPHA_VALUES
-# Loop over alpha values
-for alpha in $ALPHA_VALUES; do
-    counter=$((counter + 1))
-    # Run FDR generation for exact matches
-    python "$PROJECT_ROOT/pfam/generate_fdr.py" \
-        --alpha "$alpha" \
-        --partial false \
-        --num_trials "$NUM_TRIALS" \
-        --n_calib "$N_CALIB" \
-        --delta "$DELTA" \
-        --output "$TEMP_DIR/fdr_exact_$alpha" \
-        --add_date false
-    # Run FDR generation for partial matches
-    echo "  Running partial matches..."
-    python "$PROJECT_ROOT/pfam/generate_fdr.py" \
-        --alpha "$alpha" \
-        --partial true \
-        --num_trials "$NUM_TRIALS" \
-        --n_calib "$N_CALIB" \
-        --delta "$DELTA" \
-        --output "$TEMP_DIR/fdr_partial_$alpha" \
-        --add_date false
-    # Extract results and append to CSV using Python
-    python -c "
-import numpy as np
-import sys
-try:
-    # Load exact match results
-    exact_data = np.load('$TEMP_DIR/fdr_exact_$alpha.npy', allow_pickle=True).item()
-    exact_lhat = np.mean(exact_data['lhats'])
-    exact_fdr = np.mean(exact_data['risks'])
-    # Load partial match results
-    partial_data = np.load('$TEMP_DIR/fdr_partial_$alpha.npy', allow_pickle=True).item()
-    partial_fdr = np.mean(partial_data['risks'])
-    # Write to CSV
-    with open('$CSV_OUTPUT', 'a') as f:
-        f.write(f'$alpha,{exact_lhat},{exact_fdr},{partial_fdr}\n')
-    print(f'  Results: lambda={exact_lhat:.6f}, exact_fdr={exact_fdr:.6f}, partial_fdr={partial_fdr:.6f}')
-except Exception as e:
-    print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
-    sys.exit(1)
-"
-done
-# Clean up temporary files
-rm -rf "$TEMP_DIR"
-echo "Results saved to: $CSV_OUTPUT"
-echo "Total alpha values processed: $total"

scripts/precompute_fnr_thresholds.sh DELETED Viewed

@@ -1,106 +0,0 @@
-#!/bin/bash
-# Script to precompute FNR thresholds for different alpha values
-# Usage: ./precompute_fnr_thresholds.sh [OPTIONS]
-# Default parameters - can be modified as needed
-MIN_ALPHA=0.01
-MAX_ALPHA=1.0
-NUM_ALPHA_VALUES=100
-NUM_TRIALS=100
-N_CALIB=1000
-OUTPUT_DIR="../results"
-TEMP_DIR="./temp_fnr_results"
-CSV_OUTPUT="$OUTPUT_DIR/fnr_thresholds.csv"
-mkdir -p "$OUTPUT_DIR"
-mkdir -p "$TEMP_DIR"
-# Initialize CSV file with header
-echo "alpha,lambda_threshold,exact_fnr,partial_fnr" > "$CSV_OUTPUT"
-echo "Precomputing FNR thresholds..."
-echo "Alpha range: $MIN_ALPHA to $MAX_ALPHA"
-echo "Number of alpha values: $NUM_ALPHA_VALUES"
-echo "Trials per alpha: $NUM_TRIALS"
-echo "Calibration set size: $N_CALIB"
-echo "Output file: $CSV_OUTPUT"
-echo ""
-# Generate alpha values using Python
-ALPHA_VALUES=$(python -c "
-import numpy as np
-alphas = np.linspace($MIN_ALPHA, $MAX_ALPHA, $NUM_ALPHA_VALUES)
-print(' '.join([str(a) for a in alphas]))
-")
-# Counter for progress tracking
-counter=0
-total=$NUM_ALPHA_VALUES
-# Loop over alpha values
-for alpha in $ALPHA_VALUES; do
-    counter=$((counter + 1))
-    echo "Processing alpha=$alpha ($counter/$total)..."
-    # Run FNR generation for exact matches
-    echo "  Running exact matches..."
-    python ../pfam/generate_fnr.py \
-        --alpha "$alpha" \
-        --partial false \
-        --num_trials "$NUM_TRIALS" \
-        --n_calib "$N_CALIB" \
-        --output "$TEMP_DIR/fnr_exact_$alpha" \
-        --add_date false
-    # Run FNR generation for partial matches
-    echo "  Running partial matches..."
-    python ../pfam/generate_fnr.py \
-        --alpha "$alpha" \
-        --partial true \
-        --num_trials "$NUM_TRIALS" \
-        --n_calib "$N_CALIB" \
-        --output "$TEMP_DIR/fnr_partial_$alpha" \
-        --add_date false
-    # Extract results and append to CSV using Python
-    python -c "
-import numpy as np
-import sys
-try:
-    # Load exact match results
-    exact_data = np.load('$TEMP_DIR/fnr_exact_$alpha.npy', allow_pickle=True).item()
-    exact_lhat = np.mean(exact_data['lhats'])
-    exact_fnr = np.mean(exact_data['fnrs'])
-    # Load partial match results
-    partial_data = np.load('$TEMP_DIR/fnr_partial_$alpha.npy', allow_pickle=True).item()
-    partial_fnr = np.mean(partial_data['fnrs'])
-    # Write to CSV
-    with open('$CSV_OUTPUT', 'a') as f:
-        f.write(f'$alpha,{exact_lhat},{exact_fnr},{partial_fnr}\n')
-    print(f'  Results: lambda={exact_lhat:.6f}, exact_fnr={exact_fnr:.6f}, partial_fnr={partial_fnr:.6f}')
-except Exception as e:
-    print(f'Error processing alpha=$alpha: {e}', file=sys.stderr)
-    sys.exit(1)
-"
-    if [ $? -ne 0 ]; then
-        echo "Error processing alpha=$alpha" >&2
-        exit 1
-    fi
-done
-# Clean up temporary files
-echo ""
-echo "Cleaning up temporary files..."
-rm -rf "$TEMP_DIR"
-echo "FNR threshold precomputation completed!"
-echo "Results saved to: $CSV_OUTPUT"
-echo "Total alpha values processed: $total"

scripts/quick_fdr_check.py DELETED Viewed

@@ -1,58 +0,0 @@
-#!/usr/bin/env python
-"""Quick FDR calibration check - compare the two datasets."""
-import numpy as np
-import sys
-sys.path.insert(0, '.')
-from protein_conformal.util import get_sims_labels, get_thresh_FDR
-print("Quick FDR Calibration Check")
-print("=" * 50)
-# Load both datasets
-pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
-backup = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
-print(f"pfam_new: {len(pfam_new)} samples")
-print(f"backup:   {len(backup)} samples")
-print()
-# Compare similarity and label distributions
-sims1, labels1 = get_sims_labels(pfam_new[:500], partial=False)
-sims2, labels2 = get_sims_labels(backup[:500], partial=False)
-print("Stats from first 500 samples:")
-print(f"  pfam_new - positives: {labels1.sum()}/{labels1.size} ({100*labels1.mean():.2f}%)")
-print(f"  backup   - positives: {labels2.sum()}/{labels2.size} ({100*labels2.mean():.2f}%)")
-print()
-# Run a single FDR calibration on each
-print("Single FDR trial (n_calib=1000, alpha=0.1):")
-np.random.seed(42)
-# pfam_new
-np.random.shuffle(pfam_new)
-X1, y1 = get_sims_labels(pfam_new[:1000], partial=False)
-lhat1, _ = get_thresh_FDR(y1, X1, alpha=0.1, delta=0.5, N=100)
-# backup
-np.random.shuffle(backup)
-X2, y2 = get_sims_labels(backup[:1000], partial=False)
-lhat2, _ = get_thresh_FDR(y2, X2, alpha=0.1, delta=0.5, N=100)
-print(f"  pfam_new λ: {lhat1:.10f}")
-print(f"  backup   λ: {lhat2:.10f}")
-print(f"  Paper    λ: 0.9999802250 (from pfam_fdr_2024-06-25.npy)")
-print()
-# Which is closer to paper?
-diff1 = abs(lhat1 - 0.9999802250)
-diff2 = abs(lhat2 - 0.9999802250)
-print(f"Difference from paper threshold:")
-print(f"  pfam_new: {diff1:.10f}")
-print(f"  backup:   {diff2:.10f}")
-print()
-if diff1 < diff2:
-    print("→ pfam_new is closer to paper threshold")
-else:
-    print("→ backup is closer to paper threshold")

scripts/search_and_assign_probs.sh DELETED Viewed

@@ -1,123 +0,0 @@
-#!/bin/bash
-# Define the lookup directory
-LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
-# Navigate to the conformal-protein-retrieval directory
-cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
-# Specify the directory containing FASTA and embedding files
-INPUT_DIR="${1:-.}"
-# Create an output directory for the results
-OUTPUT_DIR="$INPUT_DIR/output"
-mkdir -p "$OUTPUT_DIR"
-# Precomputed probabilities file for get_probs.py
-PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
-# Loop through all *_prot.fasta files
-for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
-    if [[ -f "$fasta_file" ]]; then
-        # Extract base name
-        base_name=$(basename "$fasta_file" "_prot.fasta")
-        # Corresponding embedding file
-        emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
-        # Ensure embedding file exists
-        if [[ -f "$emb_file" ]]; then
-            # Output CSV file for search.py results
-            search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
-            # Run the search script, with pfam partial FDR control as filtering threshold
-            echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
-            python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
-                --output "$search_output_csv" \
-                --query_embedding "$emb_file" \
-                --query_fasta "$fasta_file" \
-                --lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
-                --lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
-            # Ensure search output file exists before proceeding
-            if [[ -f "$search_output_csv" ]]; then
-                # Output CSV file for get_probs.py results
-                probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
-                # Run get_probs.py
-                echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
-                python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
-                    --input "$search_output_csv" --output "$probs_output_csv" --partial
-            else
-                echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
-            fi
-        else
-            echo "Warning: No matching embedding found for $fasta_file. Skipping..."
-        fi
-    fi
-done
-echo "Pipeline completed. Results saved in $OUTPUT_DIR."
-#!/bin/bash
-# Define the lookup directory
-LOOKUP_DIR="/home/yangk/proteins/conformal_protein_search/data"
-# Navigate to the conformal-protein-retrieval directory
-cd /home/yangk/proteins/conformal_protein_search/conformal-protein-retrieval || exit
-# Specify the directory containing FASTA and embedding files
-INPUT_DIR="${1:-.}"
-# Create an output directory for the results
-OUTPUT_DIR="$INPUT_DIR/output"
-mkdir -p "$OUTPUT_DIR"
-# Precomputed probabilities file for get_probs.py
-PRECOMPUTED_PROBS="$LOOKUP_DIR/pfam_new_proteins_SVA_probs_1000_bins_200_calibration_pts.csv"
-# Loop through all *_prot.fasta files
-for fasta_file in "$INPUT_DIR"/*_prot.fasta; do
-    if [[ -f "$fasta_file" ]]; then
-        # Extract base name
-        base_name=$(basename "$fasta_file" "_prot.fasta")
-        # Corresponding embedding file
-        emb_file="$INPUT_DIR/emb/${base_name}_emb.npy"
-        # Ensure embedding file exists
-        if [[ -f "$emb_file" ]]; then
-            # Output CSV file for search.py results
-            search_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits.csv"
-            # Run the search script, with pfam partial FDR control as filtering threshold
-            echo "Running search for: $fasta_file & $emb_file -> $search_output_csv"
-            python scripts/search.py --fdr --fdr_lambda 0.9999642502418673 \
-                --output "$search_output_csv" \
-                --query_embedding "$emb_file" \
-                --query_fasta "$fasta_file" \
-                --lookup_embedding "$LOOKUP_DIR/lookup_embeddings.npy" \
-                --lookup_fasta "$LOOKUP_DIR/lookup_embeddings_meta_data.fasta"
-            # Ensure search output file exists before proceeding
-            if [[ -f "$search_output_csv" ]]; then
-                # Output CSV file for get_probs.py results
-                probs_output_csv="$OUTPUT_DIR/partial_pfam_${base_name}_hits_with_probs.csv"
-                # Run get_probs.py
-                echo "Running get_probs for: $search_output_csv -> $probs_output_csv"
-                python scripts/get_probs.py --precomputed --precomputed_path "$PRECOMPUTED_PROBS" \
-                    --input "$search_output_csv" --output "$probs_output_csv" --partial
-            else
-                echo "Warning: Search output file $search_output_csv not found. Skipping probability computation."
-            fi
-        else
-            echo "Warning: No matching embedding found for $fasta_file. Skipping..."
-        fi
-    fi
-done
-echo "Pipeline completed. Results saved in $OUTPUT_DIR."

scripts/slurm_calibrate_fdr.sh DELETED Viewed

@@ -1,59 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=cpr-calibrate-fdr
-#SBATCH --output=logs/cpr-calibrate-fdr-%j.out
-#SBATCH --error=logs/cpr-calibrate-fdr-%j.err
-#SBATCH --time=2:00:00
-#SBATCH --mem=32G
-#SBATCH --cpus-per-task=4
-# CPR FDR Calibration - Reproduces paper threshold computation
-# Usage: sbatch scripts/slurm_calibrate_fdr.sh
-set -e
-mkdir -p logs results
-source ~/.bashrc
-eval "$(conda shell.bash hook)"
-conda activate conformal-s
-echo "========================================"
-echo "CPR FDR Calibration"
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "========================================"
-# IMPORTANT: Use pfam_new_proteins.npy - the CORRECT calibration dataset
-# The backup dataset (conformal_pfam_with_lookup_dataset.npy) has DATA LEAKAGE:
-#   - First 50 samples all have same Pfam family "PF01266;" repeated
-#   - Positive rate is 3.00% vs 0.22% in correct dataset
-#   - Results in different FDR threshold (~0.999965 vs paper's ~0.999980)
-# See: scripts/quick_fdr_check.py for verification
-CALIB_DATA="data/pfam_new_proteins.npy"
-# Check if data exists
-if [ ! -f "$CALIB_DATA" ]; then
-    echo "ERROR: Calibration data not found at $CALIB_DATA"
-    echo "Download from Zenodo: https://zenodo.org/records/14272215"
-    exit 1
-fi
-echo "Using calibration data: $CALIB_DATA"
-echo "NOTE: Using pfam_new_proteins.npy (correct dataset without leakage)"
-echo ""
-# Run calibration using the ORIGINAL generate_fdr.py script (LTT method)
-# This matches what was used to generate the paper's threshold
-python scripts/pfam/generate_fdr.py \
-    --data_path "$CALIB_DATA" \
-    --alpha 0.1 \
-    --num_trials 100 \
-    --n_calib 1000 \
-    --delta 0.5 \
-    --output results/pfam_fdr \
-    --add_date True
-echo ""
-echo "========================================"
-echo "Expected result: mean lhat ≈ 0.999980"
-echo "Completed: $(date)"
-echo "========================================"

scripts/slurm_calibrate_fnr.sh DELETED Viewed

@@ -1,47 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=cpr-calibrate-fnr
-#SBATCH --output=logs/cpr-calibrate-fnr-%j.out
-#SBATCH --error=logs/cpr-calibrate-fnr-%j.err
-#SBATCH --time=2:00:00
-#SBATCH --mem=32G
-#SBATCH --cpus-per-task=4
-# CPR FNR Calibration - Computes FNR thresholds
-# Usage: sbatch scripts/slurm_calibrate_fnr.sh
-set -e
-mkdir -p logs results
-source ~/.bashrc
-eval "$(conda shell.bash hook)"
-conda activate conformal-s
-echo "========================================"
-echo "CPR FNR Calibration"
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "========================================"
-# Use the ORIGINAL calibration dataset from backup
-CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy"
-if [ ! -f "$CALIB_DATA" ]; then
-    echo "ERROR: Calibration data not found at $CALIB_DATA"
-    exit 1
-fi
-echo "Using calibration data: $CALIB_DATA"
-echo ""
-python scripts/pfam/generate_fnr.py \
-    --data_path "$CALIB_DATA" \
-    --alpha 0.1 \
-    --num_trials 100 \
-    --n_calib 1000 \
-    --output results/pfam_fnr \
-    --add_date True
-echo ""
-echo "========================================"
-echo "Completed: $(date)"
-echo "========================================"

scripts/slurm_compute_fdr_all.sh DELETED Viewed

@@ -1,65 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=fdr-all
-#SBATCH --partition=standard
-#SBATCH --nodes=1
-#SBATCH --ntasks=1
-#SBATCH --cpus-per-task=4
-#SBATCH --mem=32G
-#SBATCH --time=24:00:00
-#SBATCH --output=logs/fdr_all_%j.out
-#SBATCH --error=logs/fdr_all_%j.err
-# Compute FDR thresholds at all alpha levels (both exact and partial matches)
-# This uses the FIXED compute_fdr_table.py with correct argument order
-set -e
-echo "=== FDR Threshold Computation (All Alpha Levels) ==="
-echo "Job ID: $SLURM_JOB_ID"
-echo "Started: $(date)"
-echo ""
-# Setup environment
-eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
-conda activate conformal-s
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-# Calibration data
-CALIB_DATA="/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/pfam_new_proteins.npy"
-# Alpha levels to compute
-ALPHA_LEVELS="0.001,0.005,0.01,0.02,0.05,0.1,0.15,0.2"
-echo "Calibration data: $CALIB_DATA"
-echo "Alpha levels: $ALPHA_LEVELS"
-echo ""
-# Exact match FDR thresholds
-echo "=== Computing EXACT match FDR thresholds ==="
-python scripts/compute_fdr_table.py \
-    --calibration "$CALIB_DATA" \
-    --output results/fdr_thresholds.csv \
-    --n-trials 100 \
-    --n-calib 1000 \
-    --seed 42 \
-    --alpha-levels "$ALPHA_LEVELS"
-echo ""
-echo "=== Computing PARTIAL match FDR thresholds ==="
-python scripts/compute_fdr_table.py \
-    --calibration "$CALIB_DATA" \
-    --output results/fdr_thresholds_partial.csv \
-    --n-trials 100 \
-    --n-calib 1000 \
-    --seed 42 \
-    --alpha-levels "$ALPHA_LEVELS" \
-    --partial
-echo ""
-echo "=== FDR Computation Complete ==="
-echo "Results:"
-echo "  - results/fdr_thresholds.csv (exact match)"
-echo "  - results/fdr_thresholds_partial.csv (partial match)"
-echo ""
-echo "Finished: $(date)"

scripts/slurm_compute_fnr_partial.sh DELETED Viewed

@@ -1,38 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=fnr-partial
-#SBATCH --partition=standard
-#SBATCH --nodes=1
-#SBATCH --ntasks=1
-#SBATCH --cpus-per-task=4
-#SBATCH --mem=32G
-#SBATCH --time=12:00:00
-#SBATCH --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.log
-#SBATCH --error=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fnr_partial_%j.err
-# Compute FNR thresholds for PARTIAL matches only
-# Use this if the main FNR job timed out before completing partial match computation
-eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
-conda activate conformal-s
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-echo "============================================"
-echo "Computing FNR Thresholds (Partial Match Only)"
-echo "============================================"
-echo "Start time: $(date)"
-echo "Node: $(hostname)"
-echo ""
-python scripts/compute_fnr_table.py \
-    --calibration data/pfam_new_proteins.npy \
-    --output results/fnr_thresholds_partial.csv \
-    --n-trials 100 \
-    --n-calib 1000 \
-    --seed 42 \
-    --partial
-echo ""
-echo "============================================"
-echo "Completed: $(date)"
-echo "============================================"

scripts/slurm_investigate.sh DELETED Viewed

@@ -1,36 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=cpr-investigate
-#SBATCH --output=logs/cpr-investigate-%j.out
-#SBATCH --error=logs/cpr-investigate-%j.err
-#SBATCH --time=1:00:00
-#SBATCH --mem=32G
-#SBATCH --cpus-per-task=4
-# CPR Investigation - FDR calibration and precomputed probability verification
-set -e
-mkdir -p logs data
-source ~/.bashrc
-eval "$(conda shell.bash hook)"
-conda activate conformal-s
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-echo "========================================"
-echo "CPR Investigation"
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "========================================"
-echo ""
-echo "=== Part 1: FDR Calibration Investigation ==="
-python scripts/investigate_fdr.py
-echo ""
-echo "=== Part 2: Precomputed Probability Verification ==="
-python scripts/test_precomputed_probs.py
-echo ""
-echo "========================================"
-echo "Completed: $(date)"
-echo "========================================"

scripts/slurm_test_clean_embed.sh DELETED Viewed

@@ -1,108 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=test_clean_embed
-#SBATCH --partition=gpu
-#SBATCH --nodes=1
-#SBATCH --ntasks=1
-#SBATCH --cpus-per-task=4
-#SBATCH --gres=gpu:1
-#SBATCH --time=01:00:00
-#SBATCH --output=logs/test_clean_embed_%j.out
-#SBATCH --error=logs/test_clean_embed_%j.err
-# Test CLEAN embedding with the CPR CLI
-# This script:
-# 1. Runs CLI tests
-# 2. Tests CLEAN embedding on a small FASTA file
-set -e
-echo "=== CPR CLEAN Embedding Test ==="
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "Job ID: $SLURM_JOB_ID"
-# Create logs directory if it doesn't exist
-mkdir -p logs
-# Activate conda environment
-eval "$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)"
-conda activate conformal-s
-# Print environment info
-echo ""
-echo "=== Environment Info ==="
-which python
-python --version
-python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
-python -c "import faiss; print(f'FAISS: {faiss.__version__}')"
-# Change to repo directory
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-# 1. Run CLI tests
-echo ""
-echo "=== Running CLI Tests ==="
-python -m pytest tests/test_cli.py -v --tb=short 2>&1 || echo "Note: Some tests may fail if dependencies are missing"
-# 2. Create a small test FASTA file
-echo ""
-echo "=== Creating Test FASTA ==="
-TEST_DIR="test_clean_output"
-mkdir -p "$TEST_DIR"
-cat > "$TEST_DIR/test_sequences.fasta" << 'EOF'
->seq1_test_enzyme
-MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK
->seq2_test_enzyme
-MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
->seq3_test_enzyme
-MTSKGECFVTVTYKNLFPPEQWSPKQYLFHNASDKGFVPTHICTHGCLSPKQLQEFDLVNQADQEGWSGDYTCQCNCTQQALCGFPVFLGCEACTFTPDCHGECVCKFPFGEYFVCDCDGSPDCG
-EOF
-echo "Created test FASTA with 3 sequences"
-# 3. Test CLEAN embedding (requires GPU)
-echo ""
-echo "=== Testing CLEAN Embedding ==="
-echo "Checking CLEAN installation..."
-python -c "from CLEAN.model import LayerNormNet; print('CLEAN model import OK')" 2>&1 || {
-    echo "CLEAN not installed, installing..."
-    cd CLEAN_repo/app
-    python build.py install
-    cd ../..
-}
-echo ""
-echo "Running cpr embed with CLEAN model..."
-time python -m protein_conformal.cli embed \
-    --input "$TEST_DIR/test_sequences.fasta" \
-    --output "$TEST_DIR/test_clean_embeddings.npy" \
-    --model clean
-# 4. Verify output
-echo ""
-echo "=== Verifying Output ==="
-if [ -f "$TEST_DIR/test_clean_embeddings.npy" ]; then
-    python -c "
-import numpy as np
-emb = np.load('$TEST_DIR/test_clean_embeddings.npy')
-print(f'Embeddings shape: {emb.shape}')
-print(f'Expected: (3, 128)')
-assert emb.shape == (3, 128), f'Shape mismatch: expected (3, 128), got {emb.shape}'
-print('SUCCESS: CLEAN embedding test passed!')
-"
-else
-    echo "ERROR: Output file not created"
-    exit 1
-fi
-# 5. Optional: Compare with reference (if exists)
-echo ""
-echo "=== Test Complete ==="
-echo "Output saved to: $TEST_DIR/test_clean_embeddings.npy"
-echo ""
-# Cleanup (optional - uncomment to remove test files)
-# rm -rf "$TEST_DIR"
-echo "Done at $(date)"

scripts/slurm_verify.sh DELETED Viewed

@@ -1,97 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=cpr-verify
-#SBATCH --output=logs/cpr-verify-%j.out
-#SBATCH --error=logs/cpr-verify-%j.err
-#SBATCH --time=1:00:00
-#SBATCH --mem=32G
-#SBATCH --cpus-per-task=4
-# CPR Verification - Reproduces paper results
-# Usage:
-#   sbatch scripts/slurm_verify.sh syn30   # Verify Syn3.0 (Figure 2A)
-#   sbatch scripts/slurm_verify.sh fdr     # Verify FDR algorithm
-#   sbatch scripts/slurm_verify.sh dali    # Verify DALI prefiltering (Tables 4-6)
-#   sbatch scripts/slurm_verify.sh clean   # Verify CLEAN enzyme (Tables 1-2)
-#   sbatch scripts/slurm_verify.sh probs   # Verify precomputed probability lookup
-#   sbatch scripts/slurm_verify.sh all     # Run all verifications
-set -e
-mkdir -p logs results
-source ~/.bashrc
-eval "$(conda shell.bash hook)"
-conda activate conformal-s
-CHECK="${1:-all}"
-echo "========================================"
-echo "CPR Verification"
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "Check: $CHECK"
-echo "========================================"
-echo ""
-run_syn30() {
-    echo "--- Syn3.0 Verification (Paper Figure 2A) ---"
-    python scripts/verify_syn30.py
-    echo ""
-}
-run_fdr() {
-    echo "--- FDR Algorithm Verification ---"
-    python scripts/verify_fdr_algorithm.py
-    echo ""
-}
-run_dali() {
-    echo "--- DALI Prefiltering Verification (Tables 4-6) ---"
-    python scripts/verify_dali.py
-    echo ""
-}
-run_clean() {
-    echo "--- CLEAN Enzyme Classification Verification (Tables 1-2) ---"
-    python scripts/verify_clean.py
-    echo ""
-}
-run_probs() {
-    echo "--- Precomputed Probability Lookup Verification ---"
-    python scripts/test_precomputed_probs.py
-    echo ""
-}
-case "$CHECK" in
-    syn30)
-        run_syn30
-        ;;
-    fdr)
-        run_fdr
-        ;;
-    dali)
-        run_dali
-        ;;
-    clean)
-        run_clean
-        ;;
-    probs)
-        run_probs
-        ;;
-    all)
-        run_syn30
-        run_fdr
-        run_dali
-        run_clean
-        run_probs
-        ;;
-    *)
-        echo "Unknown check: $CHECK"
-        echo "Available: syn30, fdr, dali, clean, probs, all"
-        exit 1
-        ;;
-esac
-echo "========================================"
-echo "Completed: $(date)"
-echo "========================================"

scripts/slurm_verify_probs.sh DELETED Viewed

@@ -1,34 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=cpr-verify-probs
-#SBATCH --output=logs/cpr-verify-probs-%j.out
-#SBATCH --error=logs/cpr-verify-probs-%j.err
-#SBATCH --time=1:00:00
-#SBATCH --mem=16G
-#SBATCH --cpus-per-task=2
-# CPR Precomputed Probability Verification
-# Verifies that the sim->prob lookup table matches direct Venn-Abers computation
-# Usage: sbatch scripts/slurm_verify_probs.sh
-set -e
-mkdir -p logs data
-echo "========================================"
-echo "CPR Probability Lookup Verification"
-echo "Date: $(date)"
-echo "Node: $(hostname)"
-echo "========================================"
-echo ""
-# Activate conda environment
-source ~/.bashrc
-eval "$(conda shell.bash hook)"
-conda activate conformal-s
-# Run the verification test
-python scripts/test_precomputed_probs.py
-echo ""
-echo "========================================"
-echo "Completed: $(date)"
-echo "========================================"

scripts/submit_fdr_parallel.sh DELETED Viewed

@@ -1,47 +0,0 @@
-#!/bin/bash
-# Submit FDR threshold jobs in parallel - one per alpha level
-ALPHAS="0.001 0.005 0.01 0.02 0.05 0.1 0.15 0.2"
-for alpha in $ALPHAS; do
-    # Exact match
-    sbatch --job-name="fdr-exact-${alpha}" \
-           --partition=standard \
-           --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
-           --time=08:00:00 \
-           --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_exact_${alpha}_%j.log \
-           --wrap="
-eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
-conda activate conformal-s
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-python scripts/compute_fdr_table.py \
-    --calibration data/pfam_new_proteins.npy \
-    --output results/fdr_exact_alpha_${alpha}.csv \
-    --n-trials 100 \
-    --n-calib 1000 \
-    --seed 42 \
-    --alpha-levels ${alpha}
-"
-    # Partial match
-    sbatch --job-name="fdr-partial-${alpha}" \
-           --partition=standard \
-           --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=32G \
-           --time=08:00:00 \
-           --output=/groups/doudna/projects/ronb/conformal-protein-retrieval/logs/fdr_partial_${alpha}_%j.log \
-           --wrap="
-eval \"\$(/shared/software/miniconda3/latest/bin/conda shell.bash hook)\"
-conda activate conformal-s
-cd /groups/doudna/projects/ronb/conformal-protein-retrieval
-python scripts/compute_fdr_table.py \
-    --calibration data/pfam_new_proteins.npy \
-    --output results/fdr_partial_alpha_${alpha}.csv \
-    --n-trials 100 \
-    --n-calib 1000 \
-    --seed 42 \
-    --alpha-levels ${alpha} \
-    --partial
-"
-done
-echo "Submitted 16 FDR jobs (8 alphas × 2 match types)"