cpr / DATA.md
ronboger's picture
docs: add Protein-Vec model weights Zenodo link with wget commands
0d63974
# Data Requirements
This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.
## Quick Start
```bash
# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..
# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
# 3. Verify setup
cpr verify --check syn30
```
## Data Sources
### Zenodo (https://zenodo.org/records/14272215)
Large data files that should NOT be committed to git:
| File | Size | Description | Location |
|------|------|-------------|----------|
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (540K proteins) | `data/` |
| `pfam_new_proteins.npy` | 2.4 GB | Pfam calibration data | `data/` |
| `lookup_embeddings_meta_data.tsv` | 535 MB | UniProt metadata (Pfam, protein names, etc.) | `data/` |
### GitHub Repository
Small files that ARE committed to git:
| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 unknown gene sequences |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
| `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))
Model weights (2.9 GB compressed):
```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```
| File | Size | Required For |
|------|------|--------------|
| `protein_vec.ckpt` | 804 MB | Core embedding model |
| `protein_vec_params.json` | 240 B | Model configuration |
| `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
| `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
## Directory Structure
```
conformal-protein-retrieval/
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ lookup_embeddings.npy # [Zenodo] UniProt embeddings
β”‚ β”œβ”€β”€ lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
β”‚ β”œβ”€β”€ pfam_new_proteins.npy # [Zenodo] Calibration data
β”‚ β”œβ”€β”€ gene_unknown/
β”‚ β”‚ β”œβ”€β”€ unknown_aa_seqs.fasta # [GitHub] Syn3.0 sequences
β”‚ β”‚ β”œβ”€β”€ unknown_aa_seqs.npy # [GitHub] Syn3.0 embeddings
β”‚ β”‚ └── jcvi_syn30_unknown_gene_hits.csv # [GitHub] Results
β”‚ └── ec/ # CLEAN enzyme data
β”œβ”€β”€ protein_vec_models/ # [Archive] Model weights
β”‚ β”œβ”€β”€ protein_vec.ckpt
β”‚ β”œβ”€β”€ protein_vec_params.json
β”‚ β”œβ”€β”€ model_protein_moe.py # Model code
β”‚ β”œβ”€β”€ utils_search.py # Embedding utilities
β”‚ └── ...
└── results/ # Output directory
```
## Reproducing Paper Results
### Figure 2A: JCVI Syn3.0 Annotation (39.6%)
**Required files:**
- `data/gene_unknown/unknown_aa_seqs.npy`
- `data/lookup_embeddings.npy`
- `data/lookup_embeddings_meta_data.tsv`
- `data/pfam_new_proteins.npy`
**Run:**
```bash
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
```
### Tables 1-2: CLEAN Enzyme Classification
**Required files:**
- `clean_selection/clean_new_v_ec_cluster.npy`
- Additional CLEAN data from Zenodo
### Tables 4-6: DALI Prefiltering
**Required files:**
- SCOPe domain data
- DALI Z-scores
- AFDB embeddings
## What to Add to Zenodo
If you're updating Zenodo, include:
1. **Essential (required for paper verification):**
- `lookup_embeddings.npy`
- `lookup_embeddings_meta_data.tsv`
- `pfam_new_proteins.npy`
2. **Optional (for full experiments):**
- `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings
- CLEAN embeddings
- SCOPe/DALI data
## What to Add to GitHub
Keep in GitHub (small files):
- `data/gene_unknown/*.fasta` - Query sequences
- `data/gene_unknown/*.npy` - Pre-computed query embeddings (< 1 MB)
- `results/*.csv` - Result summaries
- `protein_vec_models/*.py` - Model code (NOT weights)
- `protein_vec_models/*.json` - Model configs
Add to `.gitignore` (large files):
```
*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz
```
## Verification Checklist
After setting up data, verify with:
```bash
# Check file sizes
ls -lh data/*.npy
# Expected:
# lookup_embeddings.npy ~1.1 GB
# pfam_new_proteins.npy ~2.4 GB
# Run verification
cpr verify --check fdr # Tests algorithm
cpr verify --check syn30 # Tests paper result (39.6%)
```