Spaces:

LoocasGoose
/

cpr

Running

File size: 5,025 Bytes

f4b267d

# Installation Guide

This guide covers how to install Conformal Protein Retrieval (CPR) and download the required data files.

## Prerequisites

- Python 3.9 or higher
- ~15 GB disk space for full dataset
- GPU recommended for embedding (but CPU works)

## Quick Install

```bash
# Clone the repository
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval

# Install the package
pip install -e .

# Or with GUI support
pip install -e ".[gui]"

# Or with all optional dependencies
pip install -e ".[all]"
```

## Conda Environment (Recommended)

```bash
# Create environment from file
conda env create -f environment.yml
conda activate cpr

# Install the package
pip install -e .
```

## Docker

```bash
# Build the image
docker build -t cpr .

# Run with GUI
docker run -p 7860:7860 cpr python -m protein_conformal.gradio_app
```

---

## Downloading Data

All data files are hosted on Zenodo: https://zenodo.org/records/14272215

### Required Files (Minimum)

For basic FDR/FNR-controlled search against Pfam:

| File | Size | Download |
|------|------|----------|
| `pfam_new_proteins.npy` | 2.5 GB | [Download](https://zenodo.org/records/14272215/files/pfam_new_proteins.npy) |

### For UniProt Search

| File | Size | Download |
|------|------|----------|
| `lookup_embeddings.npy` | 1.1 GB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings.npy) |
| `lookup_embeddings_meta_data.tsv` | 560 MB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv) |

### For AlphaFold DB Search

| File | Size | Download |
|------|------|----------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | [Download](https://zenodo.org/records/14272215/files/afdb_embeddings_protein_vec.npy) |
| `AFDB_sequences.fasta` | 671 MB | [Download](https://zenodo.org/records/14272215/files/AFDB_sequences.fasta) |

### Supplementary Data

| File | Size | Description |
|------|------|-------------|
| `scope_supplement.zip` | 800 MB | SCOPe hierarchical risk data |
| `ec_supplement.zip` | 199 MB | EC number classification data |
| `clean_selection.zip` | 1.6 GB | Improved enzyme classification data |

### Download Script

```bash
# Create data directory
mkdir -p data

# Download minimum required files
cd data

# Pfam calibration data (required for FDR/FNR control)
wget https://zenodo.org/records/14272215/files/pfam_new_proteins.npy

# UniProt lookup database (for general protein search)
wget https://zenodo.org/records/14272215/files/lookup_embeddings.npy
wget https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv
```

---

## Protein-Vec Model Weights

To generate embeddings for new proteins, you need the Protein-Vec model weights.

### Option 1: Download Pre-trained Weights

**TODO**: Add download link for Protein-Vec weights

The model files should be placed in `protein_vec_models/`:
```
protein_vec_models/
├── protein_vec.ckpt           # Model checkpoint
├── protein_vec_params.json    # Model configuration
├── model_protein_moe.py       # Model definition
└── utils_search.py            # Utility functions
```

### Option 2: Use Pre-computed Embeddings

If you only need to search against existing databases (UniProt, AFDB), you can skip the embedding step and use the pre-computed embeddings from Zenodo.

---

## Verifying Installation

```bash
# Check that the package is installed
python -c "import protein_conformal; print('OK')"

# Run the test suite
pip install pytest
pytest tests/ -v

# Launch the GUI (if installed with [gui])
python -m protein_conformal.gradio_app
```

---

## Directory Structure

After downloading, your directory should look like:

```
conformal-protein-retrieval/
├── data/
│   ├── pfam_new_proteins.npy          # Calibration data
│   ├── lookup_embeddings.npy          # UniProt embeddings
│   └── lookup_embeddings_meta_data.tsv
├── protein_vec_models/                 # Model weights (if embedding)
│   ├── protein_vec.ckpt
│   └── protein_vec_params.json
├── protein_conformal/                  # Source code
└── ...
```

---

## Troubleshooting

### FAISS Installation Issues

If you encounter issues with `faiss-cpu`:

```bash
# Try conda instead of pip
conda install -c pytorch faiss-cpu

# Or for GPU support
conda install -c pytorch faiss-gpu
```

### Memory Issues

The calibration data (`pfam_new_proteins.npy`) is large. If you run into memory issues:

1. Use a machine with at least 8 GB RAM
2. Consider using memory-mapped arrays:
   ```python
   data = np.load('pfam_new_proteins.npy', mmap_mode='r', allow_pickle=True)
   ```

### PyTorch/Transformers Issues

For embedding, ensure compatible versions:

```bash
pip install torch>=2.0.0 transformers>=4.30.0
```

---

## Next Steps

- See [Quick Start](quickstart.md) for usage examples
- See [API Reference](api.md) for programmatic use
- See the [notebooks/](../notebooks/) directory for detailed analysis examples