cpr / docs /LEARNINGS.md
LoocasGoose's picture
API implementation fix
a5b0ee4
# CPR Project Learnings
This document captures key learnings, gotchas, and patterns discovered while working on the Conformal Protein Retrieval project.
---
## Git & Branch Management
### Branch Structure
- **`main`** - Your fork's main branch
- **`upstream/main`** - Ron Boger's original repo (https://github.com/ronboger/conformal-protein-retrieval)
- **`gradio`** - Active development branch for the Gradio web interface
- **`huggingface`** - Remote for HuggingFace Spaces deployment
### Syncing with Upstream
```bash
git fetch upstream
git log gradio..upstream/main --oneline # Show commits in upstream NOT in gradio
git merge upstream/main # Merge upstream into current branch
```
---
## Data & Thresholds
### Critical: Data Leakage Warning
**DO NOT USE** `conformal_pfam_with_lookup_dataset.npy` from backup directories.
- First 50 samples all have the same Pfam family "PF01266;"
- Positive rate: 3.00% (vs 0.22% correct)
- Produces WRONG FDR threshold
**USE**: `pfam_new_proteins.npy` from Zenodo
- 1,864 diverse samples
- 0.22% positive rate
- Matches paper threshold: 0.9999802
### Threshold Files
| File | Purpose |
|------|---------|
| `results/fdr_thresholds.csv` | FDR (exact match) thresholds |
| `results/fnr_thresholds.csv` | FNR (exact match) thresholds |
| `results/fnr_thresholds_partial.csv` | FNR (partial match) thresholds |
| `results/calibration_probs.csv` | Venn-Abers probability calibration data |
### Verified Paper Claims
| Claim | Paper Value | Verified Value |
|-------|-------------|----------------|
| Syn3.0 annotation (alpha=0.1) | 39.6% (59/149) | 39.6% (59/149) |
| FDR threshold (alpha=0.1) | 0.9999802250 | 0.9999801 |
| DALI TPR | 82.8% | 81.8% |
| DALI DB reduction | 31.5% | 31.5% |
| CLEAN loss <= alpha | 1.0 | 0.97 |
---
## Code Patterns
### FDR vs FNR Thresholds
- **FDR** (False Discovery Rate): Higher threshold = stricter = fewer but more confident results
- **FNR** (False Negative Rate): Lower threshold = more permissive = more results, fewer misses
- For FNR: lower alpha -> lower threshold (opposite intuition!)
### Array Dimension Bug (Fixed)
The `get_thresh_FDR()` function failed on 1D arrays. Fixed by checking array dimensions:
```python
is_1d = len(labels.shape) == 1
if is_1d:
# Use risk_1d function
else:
# Use standard risk function
```
### Gradio File Uploads
Gradio may pass file objects in different formats:
- File-like objects with `.read()`
- Temp files with `.name` attribute
- Plain filesystem paths (when type='filepath')
- Dicts with 'path'/'name' metadata
Handle all cases with fallback logic (see `_persist_uploaded_file()` in gradio_interface.py).
---
## Environment & Dependencies
### Python Environment
- Conda environment: `conformal-s` (Python 3.11.10)
- Key packages: faiss 1.9.0, torch 2.5.0, numpy 1.26.4
### Missing Dependencies (Not in requirements.txt)
- `pytorch-lightning` - for Protein-Vec model loading
- `h5py` - for utils_search.py
- `gradio` - for web interface
### NumPy Compatibility
NumPy 1.22+ renamed `interpolation=` to `method=` in `np.quantile()`. Use `method=` for compatibility.
---
## Gradio Interface
### Current UI Features (as of 2026-02)
1. **FASTA input**: Text area + file upload
2. **Risk control**: FDR/FNR toggle with alpha slider
3. **Match type**: Exact vs Partial Pfam matching
4. **Database selection**: UniProt, SCOPE, or Custom upload
5. **Results**: Sortable table with export (CSV/JSON)
6. **Probability calibration**: Uses pre-computed Venn-Abers data
### HuggingFace Deployment
- Set `HF_DATASET_ID` environment variable for automatic data download
- Uses `huggingface_hub.hf_hub_download()` for large files
- Files are cached locally after first download
### Performance Optimizations
- `LOOKUP_RESOURCE_CACHE`: Caches FAISS index + metadata by file path + mtime
- `@lru_cache`: Caches threshold CSV parsing
- `StageTimer`: Logs timing for each pipeline stage
---
## Common Issues & Solutions
### Issue: "No module named 'protein_conformal'"
**Solution**: Install in development mode: `pip install -e .`
### Issue: Gradio import fails
**Solution**: Made gradio import optional in `__init__.py` with try/except
### Issue: FDR threshold doesn't match paper
**Solution**: Check if using correct calibration data (pfam_new_proteins.npy, NOT backup files)
### Issue: NumPy deprecation warning for quantile
**Solution**: Use `method='lower'` instead of `interpolation='lower'`
### Issue: setup.py references non-existent src/ directory
**Solution**: Simplified to defer to pyproject.toml
---
## Testing
### Run Tests
```bash
pytest tests/ -v # All tests
pytest tests/test_util.py -v # Just util tests
pytest tests/test_cli.py -v # Just CLI tests
pytest tests/ --cov=protein_conformal --cov-report=html # With coverage
```
### Test Count
- 27 util tests
- 24 CLI tests
- All passing as of 2026-02-03
---
## Files to Know
### Core Algorithm
- `protein_conformal/util.py` - All conformal prediction algorithms
### CLI
- `protein_conformal/cli.py` - `cpr embed`, `cpr search`, `cpr verify`
### Gradio
- `protein_conformal/gradio_app.py` - Entry point
- `protein_conformal/backend/gradio_interface.py` - Main UI logic
### Threshold Computation
- `scripts/compute_fdr_table.py` - FDR thresholds
- `scripts/compute_fnr_table.py` - FNR thresholds
### Verification
- `scripts/verify_syn30.py` - JCVI Syn3.0 (Figure 2A)
- `scripts/verify_dali.py` - DALI prefiltering
- `scripts/verify_clean.py` - CLEAN enzyme
---
## HuggingFace Spaces Deployment
### Key Lesson: Optional Imports
When deploying to HuggingFace Spaces, **wrap optional module imports in try/except**. The Space only installs what's in `requirements.txt`, so unused modules with extra dependencies will crash the app on import.
```python
# Bad - crashes if py3Dmol not installed
from .visualization import create_structure_with_heatmap
# Good - gracefully handles missing deps
try:
from .visualization import create_structure_with_heatmap
except ImportError:
create_structure_with_heatmap = None
```
### Gradio 4.x/5.x Breaking Changes
- `gr.Dataframe(height=...)` removed - don't use height parameter
- `gr.update()` removed in Gradio 5.x - use component constructors instead (e.g. `gr.File(visible=False)` not `gr.update(visible=False)`)
- **`gr.JSON` crashes on Python 3.13 / HF Spaces** - `gradio_client` can't handle `additionalProperties: true` (boolean) in JSON Schema. Causes `TypeError: argument of type 'bool' is not iterable` in `get_api_info()`, breaking ALL event handlers with "No API found". **Fix**: use `gr.Code(language="json")` instead and serialize dicts with `json.dumps()`.
- Use `gr.themes.Soft()` instead of custom CSS where possible
- Test locally with same Gradio version as HF Spaces
### Requirements.txt Best Practices
1. Include ALL imports used by the main app path
2. Comment out optional deps with clear notes
3. Test locally with a fresh venv before pushing
### Dataset Integration
- Set `HF_DATASET_ID` env variable in Space settings
- Dataset structure must match paths in `app.py` `ensure_assets()`
- Files downloaded on first run, then cached
---
## Session History
### 2026-02-05 (Gradio branch)
- Confirmed gradio branch is synced with upstream/main
- No remaining changes to integrate
- Last 3 commits were Gradio UI improvements
### 2026-02-03
- Archived 16 redundant scripts
- Consolidated threshold CSVs
- Added full tables to GETTING_STARTED.md
### 2026-02-02
- Verified Syn3.0: 59/149 = 39.6%
- Fixed FDR bug (1D/2D array handling)
- Created CLI with embed, search, verify commands
### 2026-01-28
- Initial cleanup
- Removed duplicate src/protein_conformal/
- Created pyproject.toml and test infrastructure