Spaces:
Running
Running
File size: 5,025 Bytes
f4b267d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | # Installation Guide
This guide covers how to install Conformal Protein Retrieval (CPR) and download the required data files.
## Prerequisites
- Python 3.9 or higher
- ~15 GB disk space for full dataset
- GPU recommended for embedding (but CPU works)
## Quick Install
```bash
# Clone the repository
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval
# Install the package
pip install -e .
# Or with GUI support
pip install -e ".[gui]"
# Or with all optional dependencies
pip install -e ".[all]"
```
## Conda Environment (Recommended)
```bash
# Create environment from file
conda env create -f environment.yml
conda activate cpr
# Install the package
pip install -e .
```
## Docker
```bash
# Build the image
docker build -t cpr .
# Run with GUI
docker run -p 7860:7860 cpr python -m protein_conformal.gradio_app
```
---
## Downloading Data
All data files are hosted on Zenodo: https://zenodo.org/records/14272215
### Required Files (Minimum)
For basic FDR/FNR-controlled search against Pfam:
| File | Size | Download |
|------|------|----------|
| `pfam_new_proteins.npy` | 2.5 GB | [Download](https://zenodo.org/records/14272215/files/pfam_new_proteins.npy) |
### For UniProt Search
| File | Size | Download |
|------|------|----------|
| `lookup_embeddings.npy` | 1.1 GB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings.npy) |
| `lookup_embeddings_meta_data.tsv` | 560 MB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv) |
### For AlphaFold DB Search
| File | Size | Download |
|------|------|----------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | [Download](https://zenodo.org/records/14272215/files/afdb_embeddings_protein_vec.npy) |
| `AFDB_sequences.fasta` | 671 MB | [Download](https://zenodo.org/records/14272215/files/AFDB_sequences.fasta) |
### Supplementary Data
| File | Size | Description |
|------|------|-------------|
| `scope_supplement.zip` | 800 MB | SCOPe hierarchical risk data |
| `ec_supplement.zip` | 199 MB | EC number classification data |
| `clean_selection.zip` | 1.6 GB | Improved enzyme classification data |
### Download Script
```bash
# Create data directory
mkdir -p data
# Download minimum required files
cd data
# Pfam calibration data (required for FDR/FNR control)
wget https://zenodo.org/records/14272215/files/pfam_new_proteins.npy
# UniProt lookup database (for general protein search)
wget https://zenodo.org/records/14272215/files/lookup_embeddings.npy
wget https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv
```
---
## Protein-Vec Model Weights
To generate embeddings for new proteins, you need the Protein-Vec model weights.
### Option 1: Download Pre-trained Weights
**TODO**: Add download link for Protein-Vec weights
The model files should be placed in `protein_vec_models/`:
```
protein_vec_models/
βββ protein_vec.ckpt # Model checkpoint
βββ protein_vec_params.json # Model configuration
βββ model_protein_moe.py # Model definition
βββ utils_search.py # Utility functions
```
### Option 2: Use Pre-computed Embeddings
If you only need to search against existing databases (UniProt, AFDB), you can skip the embedding step and use the pre-computed embeddings from Zenodo.
---
## Verifying Installation
```bash
# Check that the package is installed
python -c "import protein_conformal; print('OK')"
# Run the test suite
pip install pytest
pytest tests/ -v
# Launch the GUI (if installed with [gui])
python -m protein_conformal.gradio_app
```
---
## Directory Structure
After downloading, your directory should look like:
```
conformal-protein-retrieval/
βββ data/
β βββ pfam_new_proteins.npy # Calibration data
β βββ lookup_embeddings.npy # UniProt embeddings
β βββ lookup_embeddings_meta_data.tsv
βββ protein_vec_models/ # Model weights (if embedding)
β βββ protein_vec.ckpt
β βββ protein_vec_params.json
βββ protein_conformal/ # Source code
βββ ...
```
---
## Troubleshooting
### FAISS Installation Issues
If you encounter issues with `faiss-cpu`:
```bash
# Try conda instead of pip
conda install -c pytorch faiss-cpu
# Or for GPU support
conda install -c pytorch faiss-gpu
```
### Memory Issues
The calibration data (`pfam_new_proteins.npy`) is large. If you run into memory issues:
1. Use a machine with at least 8 GB RAM
2. Consider using memory-mapped arrays:
```python
data = np.load('pfam_new_proteins.npy', mmap_mode='r', allow_pickle=True)
```
### PyTorch/Transformers Issues
For embedding, ensure compatible versions:
```bash
pip install torch>=2.0.0 transformers>=4.30.0
```
---
## Next Steps
- See [Quick Start](quickstart.md) for usage examples
- See [API Reference](api.md) for programmatic use
- See the [notebooks/](../notebooks/) directory for detailed analysis examples
|