Spaces:
Running
Running
File size: 4,961 Bytes
c95d941 0d63974 c95d941 0d63974 c95d941 0d63974 c95d941 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | # Data Requirements
This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.
## Quick Start
```bash
# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..
# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
# 3. Verify setup
cpr verify --check syn30
```
## Data Sources
### Zenodo (https://zenodo.org/records/14272215)
Large data files that should NOT be committed to git:
| File | Size | Description | Location |
|------|------|-------------|----------|
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (540K proteins) | `data/` |
| `pfam_new_proteins.npy` | 2.4 GB | Pfam calibration data | `data/` |
| `lookup_embeddings_meta_data.tsv` | 535 MB | UniProt metadata (Pfam, protein names, etc.) | `data/` |
### GitHub Repository
Small files that ARE committed to git:
| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 unknown gene sequences |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
| `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))
Model weights (2.9 GB compressed):
```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```
| File | Size | Required For |
|------|------|--------------|
| `protein_vec.ckpt` | 804 MB | Core embedding model |
| `protein_vec_params.json` | 240 B | Model configuration |
| `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
| `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
## Directory Structure
```
conformal-protein-retrieval/
βββ data/
β βββ lookup_embeddings.npy # [Zenodo] UniProt embeddings
β βββ lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
β βββ pfam_new_proteins.npy # [Zenodo] Calibration data
β βββ gene_unknown/
β β βββ unknown_aa_seqs.fasta # [GitHub] Syn3.0 sequences
β β βββ unknown_aa_seqs.npy # [GitHub] Syn3.0 embeddings
β β βββ jcvi_syn30_unknown_gene_hits.csv # [GitHub] Results
β βββ ec/ # CLEAN enzyme data
βββ protein_vec_models/ # [Archive] Model weights
β βββ protein_vec.ckpt
β βββ protein_vec_params.json
β βββ model_protein_moe.py # Model code
β βββ utils_search.py # Embedding utilities
β βββ ...
βββ results/ # Output directory
```
## Reproducing Paper Results
### Figure 2A: JCVI Syn3.0 Annotation (39.6%)
**Required files:**
- `data/gene_unknown/unknown_aa_seqs.npy`
- `data/lookup_embeddings.npy`
- `data/lookup_embeddings_meta_data.tsv`
- `data/pfam_new_proteins.npy`
**Run:**
```bash
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
```
### Tables 1-2: CLEAN Enzyme Classification
**Required files:**
- `clean_selection/clean_new_v_ec_cluster.npy`
- Additional CLEAN data from Zenodo
### Tables 4-6: DALI Prefiltering
**Required files:**
- SCOPe domain data
- DALI Z-scores
- AFDB embeddings
## What to Add to Zenodo
If you're updating Zenodo, include:
1. **Essential (required for paper verification):**
- `lookup_embeddings.npy`
- `lookup_embeddings_meta_data.tsv`
- `pfam_new_proteins.npy`
2. **Optional (for full experiments):**
- `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings
- CLEAN embeddings
- SCOPe/DALI data
## What to Add to GitHub
Keep in GitHub (small files):
- `data/gene_unknown/*.fasta` - Query sequences
- `data/gene_unknown/*.npy` - Pre-computed query embeddings (< 1 MB)
- `results/*.csv` - Result summaries
- `protein_vec_models/*.py` - Model code (NOT weights)
- `protein_vec_models/*.json` - Model configs
Add to `.gitignore` (large files):
```
*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz
```
## Verification Checklist
After setting up data, verify with:
```bash
# Check file sizes
ls -lh data/*.npy
# Expected:
# lookup_embeddings.npy ~1.1 GB
# pfam_new_proteins.npy ~2.4 GB
# Run verification
cpr verify --check fdr # Tests algorithm
cpr verify --check syn30 # Tests paper result (39.6%)
```
|