cpr / DATA.md
ronboger's picture
docs: add Protein-Vec model weights Zenodo link with wget commands
0d63974

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Data Requirements

This document describes the data files needed to run CPR (Conformal Protein Retrieval) and reproduce the paper results.

Quick Start

# 1. Download required data files
cd data/
wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
cd ..

# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz

# 3. Verify setup
cpr verify --check syn30

Data Sources

Zenodo (https://zenodo.org/records/14272215)

Large data files that should NOT be committed to git:

File Size Description Location
lookup_embeddings.npy 1.1 GB UniProt protein embeddings (540K proteins) data/
pfam_new_proteins.npy 2.4 GB Pfam calibration data data/
lookup_embeddings_meta_data.tsv 535 MB UniProt metadata (Pfam, protein names, etc.) data/

GitHub Repository

Small files that ARE committed to git:

File Size Description
data/gene_unknown/unknown_aa_seqs.fasta 56 KB JCVI Syn3.0 unknown gene sequences
data/gene_unknown/unknown_aa_seqs.npy 299 KB Pre-computed embeddings for Syn3.0 genes
data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv 61 KB Results: 59 annotated genes

Protein-Vec Models (Zenodo #18478696)

Model weights (2.9 GB compressed):

wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
tar -xzf protein_vec_models.gz
File Size Required For
protein_vec.ckpt 804 MB Core embedding model
protein_vec_params.json 240 B Model configuration
aspect_vec_*.ckpt ~200-400 MB each Aspect-specific models
tm_vec_swiss_model_large.ckpt 391 MB TM-Vec model

Directory Structure

conformal-protein-retrieval/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ lookup_embeddings.npy          # [Zenodo] UniProt embeddings
β”‚   β”œβ”€β”€ lookup_embeddings_meta_data.tsv # [Zenodo] UniProt metadata
β”‚   β”œβ”€β”€ pfam_new_proteins.npy          # [Zenodo] Calibration data
β”‚   β”œβ”€β”€ gene_unknown/
β”‚   β”‚   β”œβ”€β”€ unknown_aa_seqs.fasta      # [GitHub] Syn3.0 sequences
β”‚   β”‚   β”œβ”€β”€ unknown_aa_seqs.npy        # [GitHub] Syn3.0 embeddings
β”‚   β”‚   └── jcvi_syn30_unknown_gene_hits.csv  # [GitHub] Results
β”‚   └── ec/                            # CLEAN enzyme data
β”œβ”€β”€ protein_vec_models/                # [Archive] Model weights
β”‚   β”œβ”€β”€ protein_vec.ckpt
β”‚   β”œβ”€β”€ protein_vec_params.json
β”‚   β”œβ”€β”€ model_protein_moe.py           # Model code
β”‚   β”œβ”€β”€ utils_search.py                # Embedding utilities
β”‚   └── ...
└── results/                           # Output directory

Reproducing Paper Results

Figure 2A: JCVI Syn3.0 Annotation (39.6%)

Required files:

  • data/gene_unknown/unknown_aa_seqs.npy
  • data/lookup_embeddings.npy
  • data/lookup_embeddings_meta_data.tsv
  • data/pfam_new_proteins.npy

Run:

cpr verify --check syn30
# Expected: 59/149 = 39.6% hits at FDR Ξ±=0.1

Tables 1-2: CLEAN Enzyme Classification

Required files:

  • clean_selection/clean_new_v_ec_cluster.npy
  • Additional CLEAN data from Zenodo

Tables 4-6: DALI Prefiltering

Required files:

  • SCOPe domain data
  • DALI Z-scores
  • AFDB embeddings

What to Add to Zenodo

If you're updating Zenodo, include:

  1. Essential (required for paper verification):

    • lookup_embeddings.npy
    • lookup_embeddings_meta_data.tsv
    • pfam_new_proteins.npy
  2. Optional (for full experiments):

    • afdb_embeddings_protein_vec.npy (4.7 GB) - AlphaFold DB embeddings
    • CLEAN embeddings
    • SCOPe/DALI data

What to Add to GitHub

Keep in GitHub (small files):

  • data/gene_unknown/*.fasta - Query sequences
  • data/gene_unknown/*.npy - Pre-computed query embeddings (< 1 MB)
  • results/*.csv - Result summaries
  • protein_vec_models/*.py - Model code (NOT weights)
  • protein_vec_models/*.json - Model configs

Add to .gitignore (large files):

*.ckpt
data/*.npy
data/*.tsv
protein_vec_models.gz

Verification Checklist

After setting up data, verify with:

# Check file sizes
ls -lh data/*.npy

# Expected:
# lookup_embeddings.npy      ~1.1 GB
# pfam_new_proteins.npy      ~2.4 GB

# Run verification
cpr verify --check fdr    # Tests algorithm
cpr verify --check syn30  # Tests paper result (39.6%)