Spaces:
Running
Running
| # CPR Project Learnings | |
| This document captures key learnings, gotchas, and patterns discovered while working on the Conformal Protein Retrieval project. | |
| --- | |
| ## Git & Branch Management | |
| ### Branch Structure | |
| - **`main`** - Your fork's main branch | |
| - **`upstream/main`** - Ron Boger's original repo (https://github.com/ronboger/conformal-protein-retrieval) | |
| - **`gradio`** - Active development branch for the Gradio web interface | |
| - **`huggingface`** - Remote for HuggingFace Spaces deployment | |
| ### Syncing with Upstream | |
| ```bash | |
| git fetch upstream | |
| git log gradio..upstream/main --oneline # Show commits in upstream NOT in gradio | |
| git merge upstream/main # Merge upstream into current branch | |
| ``` | |
| --- | |
| ## Data & Thresholds | |
| ### Critical: Data Leakage Warning | |
| **DO NOT USE** `conformal_pfam_with_lookup_dataset.npy` from backup directories. | |
| - First 50 samples all have the same Pfam family "PF01266;" | |
| - Positive rate: 3.00% (vs 0.22% correct) | |
| - Produces WRONG FDR threshold | |
| **USE**: `pfam_new_proteins.npy` from Zenodo | |
| - 1,864 diverse samples | |
| - 0.22% positive rate | |
| - Matches paper threshold: 0.9999802 | |
| ### Threshold Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `results/fdr_thresholds.csv` | FDR (exact match) thresholds | | |
| | `results/fnr_thresholds.csv` | FNR (exact match) thresholds | | |
| | `results/fnr_thresholds_partial.csv` | FNR (partial match) thresholds | | |
| | `results/calibration_probs.csv` | Venn-Abers probability calibration data | | |
| ### Verified Paper Claims | |
| | Claim | Paper Value | Verified Value | | |
| |-------|-------------|----------------| | |
| | Syn3.0 annotation (alpha=0.1) | 39.6% (59/149) | 39.6% (59/149) | | |
| | FDR threshold (alpha=0.1) | 0.9999802250 | 0.9999801 | | |
| | DALI TPR | 82.8% | 81.8% | | |
| | DALI DB reduction | 31.5% | 31.5% | | |
| | CLEAN loss <= alpha | 1.0 | 0.97 | | |
| --- | |
| ## Code Patterns | |
| ### FDR vs FNR Thresholds | |
| - **FDR** (False Discovery Rate): Higher threshold = stricter = fewer but more confident results | |
| - **FNR** (False Negative Rate): Lower threshold = more permissive = more results, fewer misses | |
| - For FNR: lower alpha -> lower threshold (opposite intuition!) | |
| ### Array Dimension Bug (Fixed) | |
| The `get_thresh_FDR()` function failed on 1D arrays. Fixed by checking array dimensions: | |
| ```python | |
| is_1d = len(labels.shape) == 1 | |
| if is_1d: | |
| # Use risk_1d function | |
| else: | |
| # Use standard risk function | |
| ``` | |
| ### Gradio File Uploads | |
| Gradio may pass file objects in different formats: | |
| - File-like objects with `.read()` | |
| - Temp files with `.name` attribute | |
| - Plain filesystem paths (when type='filepath') | |
| - Dicts with 'path'/'name' metadata | |
| Handle all cases with fallback logic (see `_persist_uploaded_file()` in gradio_interface.py). | |
| --- | |
| ## Environment & Dependencies | |
| ### Python Environment | |
| - Conda environment: `conformal-s` (Python 3.11.10) | |
| - Key packages: faiss 1.9.0, torch 2.5.0, numpy 1.26.4 | |
| ### Missing Dependencies (Not in requirements.txt) | |
| - `pytorch-lightning` - for Protein-Vec model loading | |
| - `h5py` - for utils_search.py | |
| - `gradio` - for web interface | |
| ### NumPy Compatibility | |
| NumPy 1.22+ renamed `interpolation=` to `method=` in `np.quantile()`. Use `method=` for compatibility. | |
| --- | |
| ## Gradio Interface | |
| ### Current UI Features (as of 2026-02) | |
| 1. **FASTA input**: Text area + file upload | |
| 2. **Risk control**: FDR/FNR toggle with alpha slider | |
| 3. **Match type**: Exact vs Partial Pfam matching | |
| 4. **Database selection**: UniProt, SCOPE, or Custom upload | |
| 5. **Results**: Sortable table with export (CSV/JSON) | |
| 6. **Probability calibration**: Uses pre-computed Venn-Abers data | |
| ### HuggingFace Deployment | |
| - Set `HF_DATASET_ID` environment variable for automatic data download | |
| - Uses `huggingface_hub.hf_hub_download()` for large files | |
| - Files are cached locally after first download | |
| ### Performance Optimizations | |
| - `LOOKUP_RESOURCE_CACHE`: Caches FAISS index + metadata by file path + mtime | |
| - `@lru_cache`: Caches threshold CSV parsing | |
| - `StageTimer`: Logs timing for each pipeline stage | |
| --- | |
| ## Common Issues & Solutions | |
| ### Issue: "No module named 'protein_conformal'" | |
| **Solution**: Install in development mode: `pip install -e .` | |
| ### Issue: Gradio import fails | |
| **Solution**: Made gradio import optional in `__init__.py` with try/except | |
| ### Issue: FDR threshold doesn't match paper | |
| **Solution**: Check if using correct calibration data (pfam_new_proteins.npy, NOT backup files) | |
| ### Issue: NumPy deprecation warning for quantile | |
| **Solution**: Use `method='lower'` instead of `interpolation='lower'` | |
| ### Issue: setup.py references non-existent src/ directory | |
| **Solution**: Simplified to defer to pyproject.toml | |
| --- | |
| ## Testing | |
| ### Run Tests | |
| ```bash | |
| pytest tests/ -v # All tests | |
| pytest tests/test_util.py -v # Just util tests | |
| pytest tests/test_cli.py -v # Just CLI tests | |
| pytest tests/ --cov=protein_conformal --cov-report=html # With coverage | |
| ``` | |
| ### Test Count | |
| - 27 util tests | |
| - 24 CLI tests | |
| - All passing as of 2026-02-03 | |
| --- | |
| ## Files to Know | |
| ### Core Algorithm | |
| - `protein_conformal/util.py` - All conformal prediction algorithms | |
| ### CLI | |
| - `protein_conformal/cli.py` - `cpr embed`, `cpr search`, `cpr verify` | |
| ### Gradio | |
| - `protein_conformal/gradio_app.py` - Entry point | |
| - `protein_conformal/backend/gradio_interface.py` - Main UI logic | |
| ### Threshold Computation | |
| - `scripts/compute_fdr_table.py` - FDR thresholds | |
| - `scripts/compute_fnr_table.py` - FNR thresholds | |
| ### Verification | |
| - `scripts/verify_syn30.py` - JCVI Syn3.0 (Figure 2A) | |
| - `scripts/verify_dali.py` - DALI prefiltering | |
| - `scripts/verify_clean.py` - CLEAN enzyme | |
| --- | |
| ## HuggingFace Spaces Deployment | |
| ### Key Lesson: Optional Imports | |
| When deploying to HuggingFace Spaces, **wrap optional module imports in try/except**. The Space only installs what's in `requirements.txt`, so unused modules with extra dependencies will crash the app on import. | |
| ```python | |
| # Bad - crashes if py3Dmol not installed | |
| from .visualization import create_structure_with_heatmap | |
| # Good - gracefully handles missing deps | |
| try: | |
| from .visualization import create_structure_with_heatmap | |
| except ImportError: | |
| create_structure_with_heatmap = None | |
| ``` | |
| ### Gradio 4.x/5.x Breaking Changes | |
| - `gr.Dataframe(height=...)` removed - don't use height parameter | |
| - `gr.update()` removed in Gradio 5.x - use component constructors instead (e.g. `gr.File(visible=False)` not `gr.update(visible=False)`) | |
| - **`gr.JSON` crashes on Python 3.13 / HF Spaces** - `gradio_client` can't handle `additionalProperties: true` (boolean) in JSON Schema. Causes `TypeError: argument of type 'bool' is not iterable` in `get_api_info()`, breaking ALL event handlers with "No API found". **Fix**: use `gr.Code(language="json")` instead and serialize dicts with `json.dumps()`. | |
| - Use `gr.themes.Soft()` instead of custom CSS where possible | |
| - Test locally with same Gradio version as HF Spaces | |
| ### Requirements.txt Best Practices | |
| 1. Include ALL imports used by the main app path | |
| 2. Comment out optional deps with clear notes | |
| 3. Test locally with a fresh venv before pushing | |
| ### Dataset Integration | |
| - Set `HF_DATASET_ID` env variable in Space settings | |
| - Dataset structure must match paths in `app.py` `ensure_assets()` | |
| - Files downloaded on first run, then cached | |
| --- | |
| ## Session History | |
| ### 2026-02-05 (Gradio branch) | |
| - Confirmed gradio branch is synced with upstream/main | |
| - No remaining changes to integrate | |
| - Last 3 commits were Gradio UI improvements | |
| ### 2026-02-03 | |
| - Archived 16 redundant scripts | |
| - Consolidated threshold CSVs | |
| - Added full tables to GETTING_STARTED.md | |
| ### 2026-02-02 | |
| - Verified Syn3.0: 59/149 = 39.6% | |
| - Fixed FDR bug (1D/2D array handling) | |
| - Created CLI with embed, search, verify commands | |
| ### 2026-01-28 | |
| - Initial cleanup | |
| - Removed duplicate src/protein_conformal/ | |
| - Created pyproject.toml and test infrastructure | |