cpr / docs /LEARNINGS.md
LoocasGoose's picture
API implementation fix
a5b0ee4

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

CPR Project Learnings

This document captures key learnings, gotchas, and patterns discovered while working on the Conformal Protein Retrieval project.


Git & Branch Management

Branch Structure

Syncing with Upstream

git fetch upstream
git log gradio..upstream/main --oneline  # Show commits in upstream NOT in gradio
git merge upstream/main                   # Merge upstream into current branch

Data & Thresholds

Critical: Data Leakage Warning

DO NOT USE conformal_pfam_with_lookup_dataset.npy from backup directories.

  • First 50 samples all have the same Pfam family "PF01266;"
  • Positive rate: 3.00% (vs 0.22% correct)
  • Produces WRONG FDR threshold

USE: pfam_new_proteins.npy from Zenodo

  • 1,864 diverse samples
  • 0.22% positive rate
  • Matches paper threshold: 0.9999802

Threshold Files

File Purpose
results/fdr_thresholds.csv FDR (exact match) thresholds
results/fnr_thresholds.csv FNR (exact match) thresholds
results/fnr_thresholds_partial.csv FNR (partial match) thresholds
results/calibration_probs.csv Venn-Abers probability calibration data

Verified Paper Claims

Claim Paper Value Verified Value
Syn3.0 annotation (alpha=0.1) 39.6% (59/149) 39.6% (59/149)
FDR threshold (alpha=0.1) 0.9999802250 0.9999801
DALI TPR 82.8% 81.8%
DALI DB reduction 31.5% 31.5%
CLEAN loss <= alpha 1.0 0.97

Code Patterns

FDR vs FNR Thresholds

  • FDR (False Discovery Rate): Higher threshold = stricter = fewer but more confident results
  • FNR (False Negative Rate): Lower threshold = more permissive = more results, fewer misses
  • For FNR: lower alpha -> lower threshold (opposite intuition!)

Array Dimension Bug (Fixed)

The get_thresh_FDR() function failed on 1D arrays. Fixed by checking array dimensions:

is_1d = len(labels.shape) == 1
if is_1d:
    # Use risk_1d function
else:
    # Use standard risk function

Gradio File Uploads

Gradio may pass file objects in different formats:

  • File-like objects with .read()
  • Temp files with .name attribute
  • Plain filesystem paths (when type='filepath')
  • Dicts with 'path'/'name' metadata

Handle all cases with fallback logic (see _persist_uploaded_file() in gradio_interface.py).


Environment & Dependencies

Python Environment

  • Conda environment: conformal-s (Python 3.11.10)
  • Key packages: faiss 1.9.0, torch 2.5.0, numpy 1.26.4

Missing Dependencies (Not in requirements.txt)

  • pytorch-lightning - for Protein-Vec model loading
  • h5py - for utils_search.py
  • gradio - for web interface

NumPy Compatibility

NumPy 1.22+ renamed interpolation= to method= in np.quantile(). Use method= for compatibility.


Gradio Interface

Current UI Features (as of 2026-02)

  1. FASTA input: Text area + file upload
  2. Risk control: FDR/FNR toggle with alpha slider
  3. Match type: Exact vs Partial Pfam matching
  4. Database selection: UniProt, SCOPE, or Custom upload
  5. Results: Sortable table with export (CSV/JSON)
  6. Probability calibration: Uses pre-computed Venn-Abers data

HuggingFace Deployment

  • Set HF_DATASET_ID environment variable for automatic data download
  • Uses huggingface_hub.hf_hub_download() for large files
  • Files are cached locally after first download

Performance Optimizations

  • LOOKUP_RESOURCE_CACHE: Caches FAISS index + metadata by file path + mtime
  • @lru_cache: Caches threshold CSV parsing
  • StageTimer: Logs timing for each pipeline stage

Common Issues & Solutions

Issue: "No module named 'protein_conformal'"

Solution: Install in development mode: pip install -e .

Issue: Gradio import fails

Solution: Made gradio import optional in __init__.py with try/except

Issue: FDR threshold doesn't match paper

Solution: Check if using correct calibration data (pfam_new_proteins.npy, NOT backup files)

Issue: NumPy deprecation warning for quantile

Solution: Use method='lower' instead of interpolation='lower'

Issue: setup.py references non-existent src/ directory

Solution: Simplified to defer to pyproject.toml


Testing

Run Tests

pytest tests/ -v                    # All tests
pytest tests/test_util.py -v        # Just util tests
pytest tests/test_cli.py -v         # Just CLI tests
pytest tests/ --cov=protein_conformal --cov-report=html  # With coverage

Test Count

  • 27 util tests
  • 24 CLI tests
  • All passing as of 2026-02-03

Files to Know

Core Algorithm

  • protein_conformal/util.py - All conformal prediction algorithms

CLI

  • protein_conformal/cli.py - cpr embed, cpr search, cpr verify

Gradio

  • protein_conformal/gradio_app.py - Entry point
  • protein_conformal/backend/gradio_interface.py - Main UI logic

Threshold Computation

  • scripts/compute_fdr_table.py - FDR thresholds
  • scripts/compute_fnr_table.py - FNR thresholds

Verification

  • scripts/verify_syn30.py - JCVI Syn3.0 (Figure 2A)
  • scripts/verify_dali.py - DALI prefiltering
  • scripts/verify_clean.py - CLEAN enzyme

HuggingFace Spaces Deployment

Key Lesson: Optional Imports

When deploying to HuggingFace Spaces, wrap optional module imports in try/except. The Space only installs what's in requirements.txt, so unused modules with extra dependencies will crash the app on import.

# Bad - crashes if py3Dmol not installed
from .visualization import create_structure_with_heatmap

# Good - gracefully handles missing deps
try:
    from .visualization import create_structure_with_heatmap
except ImportError:
    create_structure_with_heatmap = None

Gradio 4.x/5.x Breaking Changes

  • gr.Dataframe(height=...) removed - don't use height parameter
  • gr.update() removed in Gradio 5.x - use component constructors instead (e.g. gr.File(visible=False) not gr.update(visible=False))
  • gr.JSON crashes on Python 3.13 / HF Spaces - gradio_client can't handle additionalProperties: true (boolean) in JSON Schema. Causes TypeError: argument of type 'bool' is not iterable in get_api_info(), breaking ALL event handlers with "No API found". Fix: use gr.Code(language="json") instead and serialize dicts with json.dumps().
  • Use gr.themes.Soft() instead of custom CSS where possible
  • Test locally with same Gradio version as HF Spaces

Requirements.txt Best Practices

  1. Include ALL imports used by the main app path
  2. Comment out optional deps with clear notes
  3. Test locally with a fresh venv before pushing

Dataset Integration

  • Set HF_DATASET_ID env variable in Space settings
  • Dataset structure must match paths in app.py ensure_assets()
  • Files downloaded on first run, then cached

Session History

2026-02-05 (Gradio branch)

  • Confirmed gradio branch is synced with upstream/main
  • No remaining changes to integrate
  • Last 3 commits were Gradio UI improvements

2026-02-03

  • Archived 16 redundant scripts
  • Consolidated threshold CSVs
  • Added full tables to GETTING_STARTED.md

2026-02-02

  • Verified Syn3.0: 59/149 = 39.6%
  • Fixed FDR bug (1D/2D array handling)
  • Created CLI with embed, search, verify commands

2026-01-28

  • Initial cleanup
  • Removed duplicate src/protein_conformal/
  • Created pyproject.toml and test infrastructure