Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.14.0
Data Requirements
This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.
Quick Start
# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..
# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
# 3. Verify setup
cpr verify --check syn30
Data Sources
Zenodo (https://zenodo.org/records/14272215)
Large data files that should NOT be committed to git:
| File | Size | Description | Location |
|---|---|---|---|
lookup_embeddings.npy |
1.1 GB | UniProt protein embeddings (540K proteins) | data/ |
pfam_new_proteins.npy |
2.4 GB | Pfam calibration data | data/ |
lookup_embeddings_meta_data.tsv |
535 MB | UniProt metadata (Pfam, protein names, etc.) | data/ |
GitHub Repository
Small files that ARE committed to git:
| File | Size | Description |
|---|---|---|
data/gene_unknown/unknown_aa_seqs.fasta |
56 KB | JCVI Syn3.0 unknown gene sequences |
data/gene_unknown/unknown_aa_seqs.npy |
299 KB | Pre-computed embeddings for Syn3.0 genes |
data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv |
61 KB | Results: 59 annotated genes |
Protein-Vec Models (Zenodo #18478696)
Model weights (2.9 GB compressed):
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
| File | Size | Required For |
|---|---|---|
protein_vec.ckpt |
804 MB | Core embedding model |
protein_vec_params.json |
240 B | Model configuration |
aspect_vec_*.ckpt |
~200-400 MB each | Aspect-specific models |
tm_vec_swiss_model_large.ckpt |
391 MB | TM-Vec model |
Directory Structure
conformal-protein-retrieval/
βββ data/
β βββ lookup_embeddings.npy # [Zenodo] UniProt embeddings
β βββ lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
β βββ pfam_new_proteins.npy # [Zenodo] Calibration data
β βββ gene_unknown/
β β βββ unknown_aa_seqs.fasta # [GitHub] Syn3.0 sequences
β β βββ unknown_aa_seqs.npy # [GitHub] Syn3.0 embeddings
β β βββ jcvi_syn30_unknown_gene_hits.csv # [GitHub] Results
β βββ ec/ # CLEAN enzyme data
βββ protein_vec_models/ # [Archive] Model weights
β βββ protein_vec.ckpt
β βββ protein_vec_params.json
β βββ model_protein_moe.py # Model code
β βββ utils_search.py # Embedding utilities
β βββ ...
βββ results/ # Output directory
Reproducing Paper Results
Figure 2A: JCVI Syn3.0 Annotation (39.6%)
Required files:
data/gene_unknown/unknown_aa_seqs.npydata/lookup_embeddings.npydata/lookup_embeddings_meta_data.tsvdata/pfam_new_proteins.npy
Run:
cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1
Tables 1-2: CLEAN Enzyme Classification
Required files:
clean_selection/clean_new_v_ec_cluster.npy- Additional CLEAN data from Zenodo
Tables 4-6: DALI Prefiltering
Required files:
- SCOPe domain data
- DALI Z-scores
- AFDB embeddings
What to Add to Zenodo
If you're updating Zenodo, include:
Essential (required for paper verification):
lookup_embeddings.npylookup_embeddings_meta_data.tsvpfam_new_proteins.npy
Optional (for full experiments):
afdb_embeddings_protein_vec.npy(4.7 GB) - AlphaFold DB embeddings- CLEAN embeddings
- SCOPe/DALI data
What to Add to GitHub
Keep in GitHub (small files):
data/gene_unknown/*.fasta- Query sequencesdata/gene_unknown/*.npy- Pre-computed query embeddings (< 1 MB)results/*.csv- Result summariesprotein_vec_models/*.py- Model code (NOT weights)protein_vec_models/*.json- Model configs
Add to .gitignore (large files):
*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz
Verification Checklist
After setting up data, verify with:
# Check file sizes
ls -lh data/*.npy
# Expected:
# lookup_embeddings.npy ~1.1 GB
# pfam_new_proteins.npy ~2.4 GB
# Run verification
cpr verify --check fdr # Tests algorithm
cpr verify --check syn30 # Tests paper result (39.6%)