File size: 4,961 Bytes
c95d941
 
 
 
 
 
 
0d63974
 
 
 
 
 
 
 
 
c95d941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d63974
 
 
c95d941
0d63974
 
 
 
c95d941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Data Requirements

This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.

## Quick Start

```bash
# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..

# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz

# 3. Verify setup
cpr verify --check syn30
```

## Data Sources

### Zenodo (https://zenodo.org/records/14272215)

Large data files that should NOT be committed to git:

| File | Size | Description | Location |
|------|------|-------------|----------|
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (540K proteins) | `data/` |
| `pfam_new_proteins.npy` | 2.4 GB | Pfam calibration data | `data/` |
| `lookup_embeddings_meta_data.tsv` | 535 MB | UniProt metadata (Pfam, protein names, etc.) | `data/` |

### GitHub Repository

Small files that ARE committed to git:

| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 unknown gene sequences |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
| `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |

### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))

Model weights (2.9 GB compressed):

```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```

| File | Size | Required For |
|------|------|--------------|
| `protein_vec.ckpt` | 804 MB | Core embedding model |
| `protein_vec_params.json` | 240 B | Model configuration |
| `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
| `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |

## Directory Structure

```
conformal-protein-retrieval/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ lookup_embeddings.npy          # [Zenodo] UniProt embeddings
β”‚   β”œβ”€β”€ lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
β”‚   β”œβ”€β”€ pfam_new_proteins.npy          # [Zenodo] Calibration data
β”‚   β”œβ”€β”€ gene_unknown/
β”‚   β”‚   β”œβ”€β”€ unknown_aa_seqs.fasta      # [GitHub] Syn3.0 sequences
β”‚   β”‚   β”œβ”€β”€ unknown_aa_seqs.npy        # [GitHub] Syn3.0 embeddings
β”‚   β”‚   └── jcvi_syn30_unknown_gene_hits.csv  # [GitHub] Results
β”‚   └── ec/                            # CLEAN enzyme data
β”œβ”€β”€ protein_vec_models/                # [Archive] Model weights
β”‚   β”œβ”€β”€ protein_vec.ckpt
β”‚   β”œβ”€β”€ protein_vec_params.json
β”‚   β”œβ”€β”€ model_protein_moe.py           # Model code
β”‚   β”œβ”€β”€ utils_search.py                # Embedding utilities
β”‚   └── ...
└── results/                           # Output directory
```

## Reproducing Paper Results

### Figure 2A: JCVI Syn3.0 Annotation (39.6%)

**Required files:**
- `data/gene_unknown/unknown_aa_seqs.npy`
- `data/lookup_embeddings.npy`
- `data/lookup_embeddings_meta_data.tsv`
- `data/pfam_new_proteins.npy`

**Run:**
```bash
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
```

### Tables 1-2: CLEAN Enzyme Classification

**Required files:**
- `clean_selection/clean_new_v_ec_cluster.npy`
- Additional CLEAN data from Zenodo

### Tables 4-6: DALI Prefiltering

**Required files:**
- SCOPe domain data
- DALI Z-scores
- AFDB embeddings

## What to Add to Zenodo

If you're updating Zenodo, include:

1. **Essential (required for paper verification):**
   - `lookup_embeddings.npy`
   - `lookup_embeddings_meta_data.tsv`
   - `pfam_new_proteins.npy`

2. **Optional (for full experiments):**
   - `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings
   - CLEAN embeddings
   - SCOPe/DALI data

## What to Add to GitHub

Keep in GitHub (small files):
- `data/gene_unknown/*.fasta` - Query sequences
- `data/gene_unknown/*.npy` - Pre-computed query embeddings (< 1 MB)
- `results/*.csv` - Result summaries
- `protein_vec_models/*.py` - Model code (NOT weights)
- `protein_vec_models/*.json` - Model configs

Add to `.gitignore` (large files):
```
*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz
```

## Verification Checklist

After setting up data, verify with:

```bash
# Check file sizes
ls -lh data/*.npy

# Expected:
# lookup_embeddings.npy      ~1.1 GB
# pfam_new_proteins.npy      ~2.4 GB

# Run verification
cpr verify --check fdr    # Tests algorithm
cpr verify --check syn30  # Tests paper result (39.6%)
```