# Data Requirements

This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.

## Quick Start

```bash
# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..

# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz

# 3. Verify setup
cpr verify --check syn30
```

## Data Sources

### Zenodo (https://zenodo.org/records/14272215)

Large data files that should NOT be committed to git:

| File | Size | Description | Location |
|------|------|-------------|----------|
| `lookup_embeddings.npy` | 1.1 GB | UniProt protein embeddings (540K proteins) | `data/` |
| `pfam_new_proteins.npy` | 2.4 GB | Pfam calibration data | `data/` |
| `lookup_embeddings_meta_data.tsv` | 535 MB | UniProt metadata (Pfam, protein names, etc.) | `data/` |

### GitHub Repository

Small files that ARE committed to git:

| File | Size | Description |
|------|------|-------------|
| `data/gene_unknown/unknown_aa_seqs.fasta` | 56 KB | JCVI Syn3.0 unknown gene sequences |
| `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
| `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |

### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))

Model weights (2.9 GB compressed):

```bash
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
```

| File | Size | Required For |
|------|------|--------------|
| `protein_vec.ckpt` | 804 MB | Core embedding model |
| `protein_vec_params.json` | 240 B | Model configuration |
| `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
| `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |

## Directory Structure

```
conformal-protein-retrieval/
├── data/
│   ├── lookup_embeddings.npy          # [Zenodo] UniProt embeddings
│   ├── lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
│   ├── pfam_new_proteins.npy          # [Zenodo] Calibration data
│   ├── gene_unknown/
│   │   ├── unknown_aa_seqs.fasta      # [GitHub] Syn3.0 sequences
│   │   ├── unknown_aa_seqs.npy        # [GitHub] Syn3.0 embeddings
│   │   └── jcvi_syn30_unknown_gene_hits.csv  # [GitHub] Results
│   └── ec/                            # CLEAN enzyme data
├── protein_vec_models/                # [Archive] Model weights
│   ├── protein_vec.ckpt
│   ├── protein_vec_params.json
│   ├── model_protein_moe.py           # Model code
│   ├── utils_search.py                # Embedding utilities
│   └── ...
└── results/                           # Output directory
```

## Reproducing Paper Results

### Figure 2A: JCVI Syn3.0 Annotation (39.6%)

**Required files:**
- `data/gene_unknown/unknown_aa_seqs.npy`
- `data/lookup_embeddings.npy`
- `data/lookup_embeddings_meta_data.tsv`
- `data/pfam_new_proteins.npy`

**Run:**
```bash
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR α=0.1
```

### Tables 1-2: CLEAN Enzyme Classification

**Required files:**
- `clean_selection/clean_new_v_ec_cluster.npy`
- Additional CLEAN data from Zenodo

### Tables 4-6: DALI Prefiltering

**Required files:**
- SCOPe domain data
- DALI Z-scores
- AFDB embeddings

## What to Add to Zenodo

If you're updating Zenodo, include:

1. **Essential (required for paper verification):**
   - `lookup_embeddings.npy`
   - `lookup_embeddings_meta_data.tsv`
   - `pfam_new_proteins.npy`

2. **Optional (for full experiments):**
   - `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings
   - CLEAN embeddings
   - SCOPe/DALI data

## What to Add to GitHub

Keep in GitHub (small files):
- `data/gene_unknown/*.fasta` - Query sequences
- `data/gene_unknown/*.npy` - Pre-computed query embeddings (< 1 MB)
- `results/*.csv` - Result summaries
- `protein_vec_models/*.py` - Model code (NOT weights)
- `protein_vec_models/*.json` - Model configs

Add to `.gitignore` (large files):
```
*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz
```

## Verification Checklist

After setting up data, verify with:

```bash
# Check file sizes
ls -lh data/*.npy

# Expected:
# lookup_embeddings.npy      ~1.1 GB
# pfam_new_proteins.npy      ~2.4 GB

# Run verification
cpr verify --check fdr    # Tests algorithm
cpr verify --check syn30  # Tests paper result (39.6%)
```