Spaces:
Running
A newer version of the Gradio SDK is available: 6.14.0
Installation Guide
This guide covers how to install Conformal Protein Retrieval (CPR) and download the required data files.
Prerequisites
- Python 3.9 or higher
- ~15 GB disk space for full dataset
- GPU recommended for embedding (but CPU works)
Quick Install
# Clone the repository
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
# Install the package
pip install -e .
# Or with GUI support
pip install -e ".[gui]"
# Or with all optional dependencies
pip install -e ".[all]"
Conda Environment (Recommended)
# Create environment from file
conda env create -f environment.yml
conda activate cpr
# Install the package
pip install -e .
Docker
# Build the image
docker build -t cpr .
# Run with GUI
docker run -p 7860:7860 cpr python -m protein_conformal.gradio_app
Downloading Data
All data files are hosted on Zenodo: https://zenodo.org/records/14272215
Required Files (Minimum)
For basic FDR/FNR-controlled search against Pfam:
| File | Size | Download |
|---|---|---|
pfam_new_proteins.npy |
2.5 GB | Download |
For UniProt Search
| File | Size | Download |
|---|---|---|
lookup_embeddings.npy |
1.1 GB | Download |
lookup_embeddings_meta_data.tsv |
560 MB | Download |
For AlphaFold DB Search
| File | Size | Download |
|---|---|---|
afdb_embeddings_protein_vec.npy |
4.7 GB | Download |
AFDB_sequences.fasta |
671 MB | Download |
Supplementary Data
| File | Size | Description |
|---|---|---|
scope_supplement.zip |
800 MB | SCOPe hierarchical risk data |
ec_supplement.zip |
199 MB | EC number classification data |
clean_selection.zip |
1.6 GB | Improved enzyme classification data |
Download Script
# Create data directory
mkdir -p data
# Download minimum required files
cd data
# Pfam calibration data (required for FDR/FNR control)
wget https://zenodo.org/records/14272215/files/pfam_new_proteins.npy
# UniProt lookup database (for general protein search)
wget https://zenodo.org/records/14272215/files/lookup_embeddings.npy
wget https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv
Protein-Vec Model Weights
To generate embeddings for new proteins, you need the Protein-Vec model weights.
Option 1: Download Pre-trained Weights
TODO: Add download link for Protein-Vec weights
The model files should be placed in protein_vec_models/:
protein_vec_models/
βββ protein_vec.ckpt # Model checkpoint
βββ protein_vec_params.json # Model configuration
βββ model_protein_moe.py # Model definition
βββ utils_search.py # Utility functions
Option 2: Use Pre-computed Embeddings
If you only need to search against existing databases (UniProt, AFDB), you can skip the embedding step and use the pre-computed embeddings from Zenodo.
Verifying Installation
# Check that the package is installed
python -c "import protein_conformal; print('OK')"
# Run the test suite
pip install pytest
pytest tests/ -v
# Launch the GUI (if installed with [gui])
python -m protein_conformal.gradio_app
Directory Structure
After downloading, your directory should look like:
conformal-protein-retrieval/
βββ data/
β βββ pfam_new_proteins.npy # Calibration data
β βββ lookup_embeddings.npy # UniProt embeddings
β βββ lookup_embeddings_meta_data.tsv
βββ protein_vec_models/ # Model weights (if embedding)
β βββ protein_vec.ckpt
β βββ protein_vec_params.json
βββ protein_conformal/ # Source code
βββ ...
Troubleshooting
FAISS Installation Issues
If you encounter issues with faiss-cpu:
# Try conda instead of pip
conda install -c pytorch faiss-cpu
# Or for GPU support
conda install -c pytorch faiss-gpu
Memory Issues
The calibration data (pfam_new_proteins.npy) is large. If you run into memory issues:
- Use a machine with at least 8 GB RAM
- Consider using memory-mapped arrays:
data = np.load('pfam_new_proteins.npy', mmap_mode='r', allow_pickle=True)
PyTorch/Transformers Issues
For embedding, ensure compatible versions:
pip install torch>=2.0.0 transformers>=4.30.0
Next Steps
- See Quick Start for usage examples
- See API Reference for programmatic use
- See the notebooks/ directory for detailed analysis examples