File size: 7,754 Bytes
fd7bdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5bfd99b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40cdbbd
be42188
a5b0ee4
 
be42188
 
 
5bfd99b
 
 
 
 
 
 
 
 
 
 
 
fd7bdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# CPR Project Learnings

This document captures key learnings, gotchas, and patterns discovered while working on the Conformal Protein Retrieval project.

---

## Git & Branch Management

### Branch Structure
- **`main`** - Your fork's main branch
- **`upstream/main`** - Ron Boger's original repo (https://github.com/ronboger/conformal-protein-retrieval)
- **`gradio`** - Active development branch for the Gradio web interface
- **`huggingface`** - Remote for HuggingFace Spaces deployment

### Syncing with Upstream
```bash
git fetch upstream
git log gradio..upstream/main --oneline  # Show commits in upstream NOT in gradio
git merge upstream/main                   # Merge upstream into current branch
```

---

## Data & Thresholds

### Critical: Data Leakage Warning
**DO NOT USE** `conformal_pfam_with_lookup_dataset.npy` from backup directories.
- First 50 samples all have the same Pfam family "PF01266;"
- Positive rate: 3.00% (vs 0.22% correct)
- Produces WRONG FDR threshold

**USE**: `pfam_new_proteins.npy` from Zenodo
- 1,864 diverse samples
- 0.22% positive rate
- Matches paper threshold: 0.9999802

### Threshold Files
| File | Purpose |
|------|---------|
| `results/fdr_thresholds.csv` | FDR (exact match) thresholds |
| `results/fnr_thresholds.csv` | FNR (exact match) thresholds |
| `results/fnr_thresholds_partial.csv` | FNR (partial match) thresholds |
| `results/calibration_probs.csv` | Venn-Abers probability calibration data |

### Verified Paper Claims
| Claim | Paper Value | Verified Value |
|-------|-------------|----------------|
| Syn3.0 annotation (alpha=0.1) | 39.6% (59/149) | 39.6% (59/149) |
| FDR threshold (alpha=0.1) | 0.9999802250 | 0.9999801 |
| DALI TPR | 82.8% | 81.8% |
| DALI DB reduction | 31.5% | 31.5% |
| CLEAN loss <= alpha | 1.0 | 0.97 |

---

## Code Patterns

### FDR vs FNR Thresholds
- **FDR** (False Discovery Rate): Higher threshold = stricter = fewer but more confident results
- **FNR** (False Negative Rate): Lower threshold = more permissive = more results, fewer misses
- For FNR: lower alpha -> lower threshold (opposite intuition!)

### Array Dimension Bug (Fixed)
The `get_thresh_FDR()` function failed on 1D arrays. Fixed by checking array dimensions:
```python
is_1d = len(labels.shape) == 1
if is_1d:
    # Use risk_1d function
else:
    # Use standard risk function
```

### Gradio File Uploads
Gradio may pass file objects in different formats:
- File-like objects with `.read()`
- Temp files with `.name` attribute
- Plain filesystem paths (when type='filepath')
- Dicts with 'path'/'name' metadata

Handle all cases with fallback logic (see `_persist_uploaded_file()` in gradio_interface.py).

---

## Environment & Dependencies

### Python Environment
- Conda environment: `conformal-s` (Python 3.11.10)
- Key packages: faiss 1.9.0, torch 2.5.0, numpy 1.26.4

### Missing Dependencies (Not in requirements.txt)
- `pytorch-lightning` - for Protein-Vec model loading
- `h5py` - for utils_search.py
- `gradio` - for web interface

### NumPy Compatibility
NumPy 1.22+ renamed `interpolation=` to `method=` in `np.quantile()`. Use `method=` for compatibility.

---

## Gradio Interface

### Current UI Features (as of 2026-02)
1. **FASTA input**: Text area + file upload
2. **Risk control**: FDR/FNR toggle with alpha slider
3. **Match type**: Exact vs Partial Pfam matching
4. **Database selection**: UniProt, SCOPE, or Custom upload
5. **Results**: Sortable table with export (CSV/JSON)
6. **Probability calibration**: Uses pre-computed Venn-Abers data

### HuggingFace Deployment
- Set `HF_DATASET_ID` environment variable for automatic data download
- Uses `huggingface_hub.hf_hub_download()` for large files
- Files are cached locally after first download

### Performance Optimizations
- `LOOKUP_RESOURCE_CACHE`: Caches FAISS index + metadata by file path + mtime
- `@lru_cache`: Caches threshold CSV parsing
- `StageTimer`: Logs timing for each pipeline stage

---

## Common Issues & Solutions

### Issue: "No module named 'protein_conformal'"
**Solution**: Install in development mode: `pip install -e .`

### Issue: Gradio import fails
**Solution**: Made gradio import optional in `__init__.py` with try/except

### Issue: FDR threshold doesn't match paper
**Solution**: Check if using correct calibration data (pfam_new_proteins.npy, NOT backup files)

### Issue: NumPy deprecation warning for quantile
**Solution**: Use `method='lower'` instead of `interpolation='lower'`

### Issue: setup.py references non-existent src/ directory
**Solution**: Simplified to defer to pyproject.toml

---

## Testing

### Run Tests
```bash
pytest tests/ -v                    # All tests
pytest tests/test_util.py -v        # Just util tests
pytest tests/test_cli.py -v         # Just CLI tests
pytest tests/ --cov=protein_conformal --cov-report=html  # With coverage
```

### Test Count
- 27 util tests
- 24 CLI tests
- All passing as of 2026-02-03

---

## Files to Know

### Core Algorithm
- `protein_conformal/util.py` - All conformal prediction algorithms

### CLI
- `protein_conformal/cli.py` - `cpr embed`, `cpr search`, `cpr verify`

### Gradio
- `protein_conformal/gradio_app.py` - Entry point
- `protein_conformal/backend/gradio_interface.py` - Main UI logic

### Threshold Computation
- `scripts/compute_fdr_table.py` - FDR thresholds
- `scripts/compute_fnr_table.py` - FNR thresholds

### Verification
- `scripts/verify_syn30.py` - JCVI Syn3.0 (Figure 2A)
- `scripts/verify_dali.py` - DALI prefiltering
- `scripts/verify_clean.py` - CLEAN enzyme

---

## HuggingFace Spaces Deployment

### Key Lesson: Optional Imports
When deploying to HuggingFace Spaces, **wrap optional module imports in try/except**. The Space only installs what's in `requirements.txt`, so unused modules with extra dependencies will crash the app on import.

```python
# Bad - crashes if py3Dmol not installed
from .visualization import create_structure_with_heatmap

# Good - gracefully handles missing deps
try:
    from .visualization import create_structure_with_heatmap
except ImportError:
    create_structure_with_heatmap = None
```

### Gradio 4.x/5.x Breaking Changes
- `gr.Dataframe(height=...)` removed - don't use height parameter
- `gr.update()` removed in Gradio 5.x - use component constructors instead (e.g. `gr.File(visible=False)` not `gr.update(visible=False)`)
- **`gr.JSON` crashes on Python 3.13 / HF Spaces** - `gradio_client` can't handle `additionalProperties: true` (boolean) in JSON Schema. Causes `TypeError: argument of type 'bool' is not iterable` in `get_api_info()`, breaking ALL event handlers with "No API found". **Fix**: use `gr.Code(language="json")` instead and serialize dicts with `json.dumps()`.
- Use `gr.themes.Soft()` instead of custom CSS where possible
- Test locally with same Gradio version as HF Spaces

### Requirements.txt Best Practices
1. Include ALL imports used by the main app path
2. Comment out optional deps with clear notes
3. Test locally with a fresh venv before pushing

### Dataset Integration
- Set `HF_DATASET_ID` env variable in Space settings
- Dataset structure must match paths in `app.py` `ensure_assets()`
- Files downloaded on first run, then cached

---

## Session History

### 2026-02-05 (Gradio branch)
- Confirmed gradio branch is synced with upstream/main
- No remaining changes to integrate
- Last 3 commits were Gradio UI improvements

### 2026-02-03
- Archived 16 redundant scripts
- Consolidated threshold CSVs
- Added full tables to GETTING_STARTED.md

### 2026-02-02
- Verified Syn3.0: 59/149 = 39.6%
- Fixed FDR bug (1D/2D array handling)
- Created CLI with embed, search, verify commands

### 2026-01-28
- Initial cleanup
- Removed duplicate src/protein_conformal/
- Created pyproject.toml and test infrastructure