File size: 5,025 Bytes
f4b267d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# Installation Guide

This guide covers how to install Conformal Protein Retrieval (CPR) and download the required data files.

## Prerequisites

- Python 3.9 or higher
- ~15 GB disk space for full dataset
- GPU recommended for embedding (but CPU works)

## Quick Install

```bash
# Clone the repository
git clone https://github.com/ronboger/conformal-protein-retrieval.git
cd conformal-protein-retrieval

# Install the package
pip install -e .

# Or with GUI support
pip install -e ".[gui]"

# Or with all optional dependencies
pip install -e ".[all]"
```

## Conda Environment (Recommended)

```bash
# Create environment from file
conda env create -f environment.yml
conda activate cpr

# Install the package
pip install -e .
```

## Docker

```bash
# Build the image
docker build -t cpr .

# Run with GUI
docker run -p 7860:7860 cpr python -m protein_conformal.gradio_app
```

---

## Downloading Data

All data files are hosted on Zenodo: https://zenodo.org/records/14272215

### Required Files (Minimum)

For basic FDR/FNR-controlled search against Pfam:

| File | Size | Download |
|------|------|----------|
| `pfam_new_proteins.npy` | 2.5 GB | [Download](https://zenodo.org/records/14272215/files/pfam_new_proteins.npy) |

### For UniProt Search

| File | Size | Download |
|------|------|----------|
| `lookup_embeddings.npy` | 1.1 GB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings.npy) |
| `lookup_embeddings_meta_data.tsv` | 560 MB | [Download](https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv) |

### For AlphaFold DB Search

| File | Size | Download |
|------|------|----------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | [Download](https://zenodo.org/records/14272215/files/afdb_embeddings_protein_vec.npy) |
| `AFDB_sequences.fasta` | 671 MB | [Download](https://zenodo.org/records/14272215/files/AFDB_sequences.fasta) |

### Supplementary Data

| File | Size | Description |
|------|------|-------------|
| `scope_supplement.zip` | 800 MB | SCOPe hierarchical risk data |
| `ec_supplement.zip` | 199 MB | EC number classification data |
| `clean_selection.zip` | 1.6 GB | Improved enzyme classification data |

### Download Script

```bash
# Create data directory
mkdir -p data

# Download minimum required files
cd data

# Pfam calibration data (required for FDR/FNR control)
wget https://zenodo.org/records/14272215/files/pfam_new_proteins.npy

# UniProt lookup database (for general protein search)
wget https://zenodo.org/records/14272215/files/lookup_embeddings.npy
wget https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv
```

---

## Protein-Vec Model Weights

To generate embeddings for new proteins, you need the Protein-Vec model weights.

### Option 1: Download Pre-trained Weights

**TODO**: Add download link for Protein-Vec weights

The model files should be placed in `protein_vec_models/`:
```
protein_vec_models/
β”œβ”€β”€ protein_vec.ckpt           # Model checkpoint
β”œβ”€β”€ protein_vec_params.json    # Model configuration
β”œβ”€β”€ model_protein_moe.py       # Model definition
└── utils_search.py            # Utility functions
```

### Option 2: Use Pre-computed Embeddings

If you only need to search against existing databases (UniProt, AFDB), you can skip the embedding step and use the pre-computed embeddings from Zenodo.

---

## Verifying Installation

```bash
# Check that the package is installed
python -c "import protein_conformal; print('OK')"

# Run the test suite
pip install pytest
pytest tests/ -v

# Launch the GUI (if installed with [gui])
python -m protein_conformal.gradio_app
```

---

## Directory Structure

After downloading, your directory should look like:

```
conformal-protein-retrieval/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ pfam_new_proteins.npy          # Calibration data
β”‚   β”œβ”€β”€ lookup_embeddings.npy          # UniProt embeddings
β”‚   └── lookup_embeddings_meta_data.tsv
β”œβ”€β”€ protein_vec_models/                 # Model weights (if embedding)
β”‚   β”œβ”€β”€ protein_vec.ckpt
β”‚   └── protein_vec_params.json
β”œβ”€β”€ protein_conformal/                  # Source code
└── ...
```

---

## Troubleshooting

### FAISS Installation Issues

If you encounter issues with `faiss-cpu`:

```bash
# Try conda instead of pip
conda install -c pytorch faiss-cpu

# Or for GPU support
conda install -c pytorch faiss-gpu
```

### Memory Issues

The calibration data (`pfam_new_proteins.npy`) is large. If you run into memory issues:

1. Use a machine with at least 8 GB RAM
2. Consider using memory-mapped arrays:
   ```python
   data = np.load('pfam_new_proteins.npy', mmap_mode='r', allow_pickle=True)
   ```

### PyTorch/Transformers Issues

For embedding, ensure compatible versions:

```bash
pip install torch>=2.0.0 transformers>=4.30.0
```

---

## Next Steps

- See [Quick Start](quickstart.md) for usage examples
- See [API Reference](api.md) for programmatic use
- See the [notebooks/](../notebooks/) directory for detailed analysis examples