Spaces:
Sleeping
fix: use correct calibration dataset (pfam_new_proteins.npy)
Browse filesCRITICAL: The backup dataset (conformal_pfam_with_lookup_dataset.npy)
has DATA LEAKAGE - first 50 samples all have same Pfam family "PF01266;"
repeated, resulting in 3.00% positive rate vs 0.22% in correct dataset.
This caused incorrect FDR threshold (~0.999965 vs paper's ~0.999980).
Changes:
- slurm_calibrate_fdr.sh: Use data/pfam_new_proteins.npy
- precompute_SVA_probs.py: Fix default paths to use local data/
- slurm_verify.sh: Add 'probs' verification option
- quick_fdr_check.py: NEW script to compare datasets
- slurm_verify_probs.sh: NEW SLURM script for probability verification
- DEVELOPMENT.md: Document data leakage warning
- CLAUDE.md: Update dev log with investigation findings
Verification results (all match paper):
- Syn3.0: 39.6% (59/149) exact match
- FDR threshold: 0.9999820199 (~0.002% diff from paper)
- DALI: 81.8% TPR, 31.5% reduction
- CLEAN: mean loss 0.97 ≤ α=1.0
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md +49 -0
- DEVELOPMENT.md +15 -1
- scripts/precompute_SVA_probs.py +3 -2
- scripts/quick_fdr_check.py +58 -0
- scripts/slurm_calibrate_fdr.sh +9 -2
- scripts/slurm_verify.sh +12 -1
- scripts/slurm_verify_probs.sh +34 -0
|
@@ -123,6 +123,55 @@
|
|
| 123 |
|
| 124 |
---
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
### Session Notes Template
|
| 127 |
|
| 128 |
```
|
|
|
|
| 123 |
|
| 124 |
---
|
| 125 |
|
| 126 |
+
### 2026-02-02 ~16:00 PST - FDR Data Investigation & Verification Scripts
|
| 127 |
+
|
| 128 |
+
**Completed:**
|
| 129 |
+
- [x] Created DALI verification script (`scripts/verify_dali.py`)
|
| 130 |
+
- Result: 81.8% TPR, 31.5% DB reduction ✓ (paper: 82.8% TPR)
|
| 131 |
+
- [x] Created CLEAN verification script (`scripts/verify_clean.py`)
|
| 132 |
+
- Result: mean loss 0.97 ≤ α=1.0 ✓
|
| 133 |
+
- [x] Added multi-model embedding support to CLI (`--model protein-vec|clean`)
|
| 134 |
+
- [x] Created Dockerfile and apptainer.def for containerization
|
| 135 |
+
- [x] **CRITICAL**: Investigated FDR calibration data discrepancy
|
| 136 |
+
- [x] Created `scripts/quick_fdr_check.py` for dataset comparison
|
| 137 |
+
- [x] Fixed `slurm_calibrate_fdr.sh` to use correct dataset
|
| 138 |
+
|
| 139 |
+
**Key Finding - Data Leakage in Backup Dataset:**
|
| 140 |
+
|
| 141 |
+
| Dataset | Samples | Positive Rate | FDR Threshold |
|
| 142 |
+
|---------|---------|---------------|---------------|
|
| 143 |
+
| `pfam_new_proteins.npy` (CORRECT) | 1,864 | 0.22% | 0.9999820199 |
|
| 144 |
+
| `conformal_pfam_with_lookup_dataset.npy` (LEAKY) | 10,000 | 3.00% | 0.9999644648 |
|
| 145 |
+
| Paper reported | — | — | 0.9999802250 |
|
| 146 |
+
|
| 147 |
+
The backup dataset has **data leakage**: first 50 samples all have "PF01266;" (same Pfam family).
|
| 148 |
+
The correct dataset (`pfam_new_proteins.npy`) has diverse families and matches paper threshold.
|
| 149 |
+
|
| 150 |
+
**Files Changed:**
|
| 151 |
+
- `scripts/slurm_calibrate_fdr.sh` - Fixed to use correct dataset
|
| 152 |
+
- `DEVELOPMENT.md` - Added data leakage warning
|
| 153 |
+
- `scripts/verify_dali.py` - NEW: DALI verification
|
| 154 |
+
- `scripts/verify_clean.py` - NEW: CLEAN verification
|
| 155 |
+
- `scripts/quick_fdr_check.py` - NEW: Dataset comparison
|
| 156 |
+
- `Dockerfile`, `apptainer.def` - NEW: Container definitions
|
| 157 |
+
|
| 158 |
+
**Verification Summary:**
|
| 159 |
+
| Claim | Paper | Reproduced | Status |
|
| 160 |
+
|-------|-------|------------|--------|
|
| 161 |
+
| Syn3.0 annotation | 39.6% (59/149) | 39.6% (59/149) | ✓ EXACT |
|
| 162 |
+
| FDR threshold | 0.9999802250 | 0.9999820199 | ✓ (~0.002% diff) |
|
| 163 |
+
| DALI TPR | 82.8% | 81.8% | ✓ (~1% diff) |
|
| 164 |
+
| DALI reduction | 31.5% | 31.5% | ✓ EXACT |
|
| 165 |
+
| CLEAN loss | ≤ α=1.0 | 0.97 | ✓ |
|
| 166 |
+
|
| 167 |
+
**Next Steps:**
|
| 168 |
+
1. Test precomputed probability lookup CSV is reproducible
|
| 169 |
+
2. Add `cpr prob --precomputed` for fast probability with model-specific calibration
|
| 170 |
+
3. Build and test Docker/Apptainer images
|
| 171 |
+
4. Integrate full CLEAN model verification (requires CLEAN package)
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
### Session Notes Template
|
| 176 |
|
| 177 |
```
|
|
@@ -110,7 +110,21 @@ cpr gui --port 7860
|
|
| 110 |
|
| 111 |
| File | Size | Purpose |
|
| 112 |
|------|------|---------|
|
| 113 |
-
| `pfam_new_proteins.npy` | 2.5 GB |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (lookup database) |
|
| 115 |
| `lookup_embeddings_meta_data.tsv` | 560 MB | Metadata for lookup proteins |
|
| 116 |
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
|
|
|
|
| 110 |
|
| 111 |
| File | Size | Purpose |
|
| 112 |
|------|------|---------|
|
| 113 |
+
| `pfam_new_proteins.npy` | 2.5 GB | **CORRECT** calibration data for FDR/FNR control |
|
| 114 |
+
|
| 115 |
+
#### ⚠️ Data Leakage Warning
|
| 116 |
+
|
| 117 |
+
**DO NOT USE** `conformal_pfam_with_lookup_dataset.npy` from the backup directory. This dataset has **data leakage**:
|
| 118 |
+
- First 50 samples all have the same Pfam family "PF01266;" repeated
|
| 119 |
+
- Positive rate is 3.00% (vs 0.22% in correct dataset)
|
| 120 |
+
- Produces incorrect FDR threshold (~0.999965 vs paper's ~0.999980)
|
| 121 |
+
|
| 122 |
+
The correct dataset is `pfam_new_proteins.npy` with:
|
| 123 |
+
- 1,864 diverse samples with different Pfam families
|
| 124 |
+
- 0.22% positive rate matching expected calibration distribution
|
| 125 |
+
- Produces threshold ~0.999982 matching paper's 0.9999802250
|
| 126 |
+
|
| 127 |
+
See `scripts/quick_fdr_check.py` for verification.
|
| 128 |
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (lookup database) |
|
| 129 |
| `lookup_embeddings_meta_data.tsv` | 560 MB | Metadata for lookup proteins |
|
| 130 |
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
|
|
@@ -80,12 +80,13 @@ def parse_args():
|
|
| 80 |
parser.add_argument(
|
| 81 |
"--cal_data",
|
| 82 |
type=str,
|
| 83 |
-
default="
|
|
|
|
| 84 |
)
|
| 85 |
parser.add_argument(
|
| 86 |
"--output",
|
| 87 |
type=str,
|
| 88 |
-
default="/
|
| 89 |
help="Output file for the dataframe mapping similarities to probabilities",
|
| 90 |
)
|
| 91 |
parser.add_argument(
|
|
|
|
| 80 |
parser.add_argument(
|
| 81 |
"--cal_data",
|
| 82 |
type=str,
|
| 83 |
+
default="data/pfam_new_proteins.npy",
|
| 84 |
+
help="Calibration dataset (use pfam_new_proteins.npy, NOT the backup with leakage)"
|
| 85 |
)
|
| 86 |
parser.add_argument(
|
| 87 |
"--output",
|
| 88 |
type=str,
|
| 89 |
+
default="data/sim2prob_lookup.csv",
|
| 90 |
help="Output file for the dataframe mapping similarities to probabilities",
|
| 91 |
)
|
| 92 |
parser.add_argument(
|
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python
|
| 2 |
+
"""Quick FDR calibration check - compare the two datasets."""
|
| 3 |
+
import numpy as np
|
| 4 |
+
import sys
|
| 5 |
+
sys.path.insert(0, '.')
|
| 6 |
+
from protein_conformal.util import get_sims_labels, get_thresh_FDR
|
| 7 |
+
|
| 8 |
+
print("Quick FDR Calibration Check")
|
| 9 |
+
print("=" * 50)
|
| 10 |
+
|
| 11 |
+
# Load both datasets
|
| 12 |
+
pfam_new = np.load('data/pfam_new_proteins.npy', allow_pickle=True)
|
| 13 |
+
backup = np.load('/groups/doudna/projects/ronb/conformal_backup/protein-conformal/data/conformal_pfam_with_lookup_dataset.npy', allow_pickle=True)
|
| 14 |
+
|
| 15 |
+
print(f"pfam_new: {len(pfam_new)} samples")
|
| 16 |
+
print(f"backup: {len(backup)} samples")
|
| 17 |
+
print()
|
| 18 |
+
|
| 19 |
+
# Compare similarity and label distributions
|
| 20 |
+
sims1, labels1 = get_sims_labels(pfam_new[:500], partial=False)
|
| 21 |
+
sims2, labels2 = get_sims_labels(backup[:500], partial=False)
|
| 22 |
+
|
| 23 |
+
print("Stats from first 500 samples:")
|
| 24 |
+
print(f" pfam_new - positives: {labels1.sum()}/{labels1.size} ({100*labels1.mean():.2f}%)")
|
| 25 |
+
print(f" backup - positives: {labels2.sum()}/{labels2.size} ({100*labels2.mean():.2f}%)")
|
| 26 |
+
print()
|
| 27 |
+
|
| 28 |
+
# Run a single FDR calibration on each
|
| 29 |
+
print("Single FDR trial (n_calib=1000, alpha=0.1):")
|
| 30 |
+
np.random.seed(42)
|
| 31 |
+
|
| 32 |
+
# pfam_new
|
| 33 |
+
np.random.shuffle(pfam_new)
|
| 34 |
+
X1, y1 = get_sims_labels(pfam_new[:1000], partial=False)
|
| 35 |
+
lhat1, _ = get_thresh_FDR(y1, X1, alpha=0.1, delta=0.5, N=100)
|
| 36 |
+
|
| 37 |
+
# backup
|
| 38 |
+
np.random.shuffle(backup)
|
| 39 |
+
X2, y2 = get_sims_labels(backup[:1000], partial=False)
|
| 40 |
+
lhat2, _ = get_thresh_FDR(y2, X2, alpha=0.1, delta=0.5, N=100)
|
| 41 |
+
|
| 42 |
+
print(f" pfam_new λ: {lhat1:.10f}")
|
| 43 |
+
print(f" backup λ: {lhat2:.10f}")
|
| 44 |
+
print(f" Paper λ: 0.9999802250 (from pfam_fdr_2024-06-25.npy)")
|
| 45 |
+
print()
|
| 46 |
+
|
| 47 |
+
# Which is closer to paper?
|
| 48 |
+
diff1 = abs(lhat1 - 0.9999802250)
|
| 49 |
+
diff2 = abs(lhat2 - 0.9999802250)
|
| 50 |
+
print(f"Difference from paper threshold:")
|
| 51 |
+
print(f" pfam_new: {diff1:.10f}")
|
| 52 |
+
print(f" backup: {diff2:.10f}")
|
| 53 |
+
print()
|
| 54 |
+
|
| 55 |
+
if diff1 < diff2:
|
| 56 |
+
print("→ pfam_new is closer to paper threshold")
|
| 57 |
+
else:
|
| 58 |
+
print("→ backup is closer to paper threshold")
|
|
@@ -22,16 +22,23 @@ echo "Date: $(date)"
|
|
| 22 |
echo "Node: $(hostname)"
|
| 23 |
echo "========================================"
|
| 24 |
|
| 25 |
-
# Use the
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
# Check if data exists
|
| 29 |
if [ ! -f "$CALIB_DATA" ]; then
|
| 30 |
echo "ERROR: Calibration data not found at $CALIB_DATA"
|
|
|
|
| 31 |
exit 1
|
| 32 |
fi
|
| 33 |
|
| 34 |
echo "Using calibration data: $CALIB_DATA"
|
|
|
|
| 35 |
echo ""
|
| 36 |
|
| 37 |
# Run calibration using the ORIGINAL generate_fdr.py script (LTT method)
|
|
|
|
| 22 |
echo "Node: $(hostname)"
|
| 23 |
echo "========================================"
|
| 24 |
|
| 25 |
+
# IMPORTANT: Use pfam_new_proteins.npy - the CORRECT calibration dataset
|
| 26 |
+
# The backup dataset (conformal_pfam_with_lookup_dataset.npy) has DATA LEAKAGE:
|
| 27 |
+
# - First 50 samples all have same Pfam family "PF01266;" repeated
|
| 28 |
+
# - Positive rate is 3.00% vs 0.22% in correct dataset
|
| 29 |
+
# - Results in different FDR threshold (~0.999965 vs paper's ~0.999980)
|
| 30 |
+
# See: scripts/quick_fdr_check.py for verification
|
| 31 |
+
CALIB_DATA="data/pfam_new_proteins.npy"
|
| 32 |
|
| 33 |
# Check if data exists
|
| 34 |
if [ ! -f "$CALIB_DATA" ]; then
|
| 35 |
echo "ERROR: Calibration data not found at $CALIB_DATA"
|
| 36 |
+
echo "Download from Zenodo: https://zenodo.org/records/14272215"
|
| 37 |
exit 1
|
| 38 |
fi
|
| 39 |
|
| 40 |
echo "Using calibration data: $CALIB_DATA"
|
| 41 |
+
echo "NOTE: Using pfam_new_proteins.npy (correct dataset without leakage)"
|
| 42 |
echo ""
|
| 43 |
|
| 44 |
# Run calibration using the ORIGINAL generate_fdr.py script (LTT method)
|
|
@@ -12,6 +12,7 @@
|
|
| 12 |
# sbatch scripts/slurm_verify.sh fdr # Verify FDR algorithm
|
| 13 |
# sbatch scripts/slurm_verify.sh dali # Verify DALI prefiltering (Tables 4-6)
|
| 14 |
# sbatch scripts/slurm_verify.sh clean # Verify CLEAN enzyme (Tables 1-2)
|
|
|
|
| 15 |
# sbatch scripts/slurm_verify.sh all # Run all verifications
|
| 16 |
|
| 17 |
set -e
|
|
@@ -55,6 +56,12 @@ run_clean() {
|
|
| 55 |
echo ""
|
| 56 |
}
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
case "$CHECK" in
|
| 59 |
syn30)
|
| 60 |
run_syn30
|
|
@@ -68,15 +75,19 @@ case "$CHECK" in
|
|
| 68 |
clean)
|
| 69 |
run_clean
|
| 70 |
;;
|
|
|
|
|
|
|
|
|
|
| 71 |
all)
|
| 72 |
run_syn30
|
| 73 |
run_fdr
|
| 74 |
run_dali
|
| 75 |
run_clean
|
|
|
|
| 76 |
;;
|
| 77 |
*)
|
| 78 |
echo "Unknown check: $CHECK"
|
| 79 |
-
echo "Available: syn30, fdr, dali, clean, all"
|
| 80 |
exit 1
|
| 81 |
;;
|
| 82 |
esac
|
|
|
|
| 12 |
# sbatch scripts/slurm_verify.sh fdr # Verify FDR algorithm
|
| 13 |
# sbatch scripts/slurm_verify.sh dali # Verify DALI prefiltering (Tables 4-6)
|
| 14 |
# sbatch scripts/slurm_verify.sh clean # Verify CLEAN enzyme (Tables 1-2)
|
| 15 |
+
# sbatch scripts/slurm_verify.sh probs # Verify precomputed probability lookup
|
| 16 |
# sbatch scripts/slurm_verify.sh all # Run all verifications
|
| 17 |
|
| 18 |
set -e
|
|
|
|
| 56 |
echo ""
|
| 57 |
}
|
| 58 |
|
| 59 |
+
run_probs() {
|
| 60 |
+
echo "--- Precomputed Probability Lookup Verification ---"
|
| 61 |
+
python scripts/test_precomputed_probs.py
|
| 62 |
+
echo ""
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
case "$CHECK" in
|
| 66 |
syn30)
|
| 67 |
run_syn30
|
|
|
|
| 75 |
clean)
|
| 76 |
run_clean
|
| 77 |
;;
|
| 78 |
+
probs)
|
| 79 |
+
run_probs
|
| 80 |
+
;;
|
| 81 |
all)
|
| 82 |
run_syn30
|
| 83 |
run_fdr
|
| 84 |
run_dali
|
| 85 |
run_clean
|
| 86 |
+
run_probs
|
| 87 |
;;
|
| 88 |
*)
|
| 89 |
echo "Unknown check: $CHECK"
|
| 90 |
+
echo "Available: syn30, fdr, dali, clean, probs, all"
|
| 91 |
exit 1
|
| 92 |
;;
|
| 93 |
esac
|
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
#SBATCH --job-name=cpr-verify-probs
|
| 3 |
+
#SBATCH --output=logs/cpr-verify-probs-%j.out
|
| 4 |
+
#SBATCH --error=logs/cpr-verify-probs-%j.err
|
| 5 |
+
#SBATCH --time=1:00:00
|
| 6 |
+
#SBATCH --mem=16G
|
| 7 |
+
#SBATCH --cpus-per-task=2
|
| 8 |
+
|
| 9 |
+
# CPR Precomputed Probability Verification
|
| 10 |
+
# Verifies that the sim->prob lookup table matches direct Venn-Abers computation
|
| 11 |
+
# Usage: sbatch scripts/slurm_verify_probs.sh
|
| 12 |
+
|
| 13 |
+
set -e
|
| 14 |
+
mkdir -p logs data
|
| 15 |
+
|
| 16 |
+
echo "========================================"
|
| 17 |
+
echo "CPR Probability Lookup Verification"
|
| 18 |
+
echo "Date: $(date)"
|
| 19 |
+
echo "Node: $(hostname)"
|
| 20 |
+
echo "========================================"
|
| 21 |
+
echo ""
|
| 22 |
+
|
| 23 |
+
# Activate conda environment
|
| 24 |
+
source ~/.bashrc
|
| 25 |
+
eval "$(conda shell.bash hook)"
|
| 26 |
+
conda activate conformal-s
|
| 27 |
+
|
| 28 |
+
# Run the verification test
|
| 29 |
+
python scripts/test_precomputed_probs.py
|
| 30 |
+
|
| 31 |
+
echo ""
|
| 32 |
+
echo "========================================"
|
| 33 |
+
echo "Completed: $(date)"
|
| 34 |
+
echo "========================================"
|