cpr / UPLOAD_CHECKLIST.md
ronboger's picture
feat: add one-step cpr find command and FDR-level search
7453ae1

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Upload Checklist: What Goes Where

This document specifies exactly what files go to GitHub vs Zenodo.

Summary

Location What Why
GitHub Code, small data (<1MB), configs Version control, collaboration
Zenodo Large data files (>1MB), embeddings Long-term archival, DOI
User obtains Protein-Vec model weights Large binary, separate distribution

GitHub Repository (You Commit This)

Code & Configuration

protein_conformal/          # All Python code
β”œβ”€β”€ __init__.py
β”œβ”€β”€ cli.py
β”œβ”€β”€ util.py
β”œβ”€β”€ scope_utils.py
β”œβ”€β”€ embed_protein_vec.py
β”œβ”€β”€ gradio_app.py
└── backend/

scripts/                    # Helper scripts
β”œβ”€β”€ verify_*.py
β”œβ”€β”€ compute_fdr_table.py
β”œβ”€β”€ slurm_*.sh
└── *.py

tests/                      # Test suite
notebooks/                  # Analysis notebooks
docs/                       # Documentation

Small Data Files (<1MB each)

data/gene_unknown/
β”œβ”€β”€ unknown_aa_seqs.fasta   # 56 KB - JCVI Syn3.0 sequences
β”œβ”€β”€ unknown_aa_seqs.npy     # 299 KB - Pre-computed embeddings
└── jcvi_syn30_unknown_gene_hits.csv  # 61 KB - Results

results/
β”œβ”€β”€ fdr_thresholds.csv      # ~2 KB - Threshold lookup table
β”œβ”€β”€ fnr_thresholds.csv      # ~7 KB - FNR thresholds
└── sim2prob_lookup.csv     # ~8 KB - Probability lookup

Configuration & Docs

pyproject.toml
setup.py
Dockerfile
apptainer.def
README.md
GETTING_STARTED.md
DATA.md
CLAUDE.md
docs/REPRODUCIBILITY.md
.gitignore

Model Code (NOT weights)

protein_vec_models/
β”œβ”€β”€ model_protein_moe.py      # Model architecture code
β”œβ”€β”€ utils_search.py           # Embedding utilities
β”œβ”€β”€ data_protein_vec.py       # Data loading code
β”œβ”€β”€ embed_structure_model.py
β”œβ”€β”€ model_protein_vec_single_variable.py
β”œβ”€β”€ train_protein_vec.py
β”œβ”€β”€ __init__.py
└── *.json                    # Config files only

Zenodo Repository (You Upload This)

Zenodo URL: https://zenodo.org/records/14272215

Essential Files (Required for paper verification)

File Size Description
lookup_embeddings.npy 1.1 GB UniProt database embeddings (540K proteins)
lookup_embeddings_meta_data.tsv 535 MB Protein metadata (names, Pfam domains, etc.)
pfam_new_proteins.npy 2.4 GB Calibration data for FDR/probability

Optional Files (For extended experiments)

File Size Description
afdb_embeddings_protein_vec.npy 4.7 GB AlphaFold DB embeddings
CLEAN enzyme data varies For Tables 1-2 reproduction
SCOPe/DALI data varies For Tables 4-6 reproduction

User Must Obtain Separately

Protein-Vec Model Weights (~3 GB)

These are NOT in GitHub or Zenodo. Users get them by:

  1. Option A: Contact authors for protein_vec_models.gz
  2. Option B: Use pre-computed embeddings from Zenodo (no weights needed for searching)

Files needed if embedding new sequences:

protein_vec_models/
β”œβ”€β”€ protein_vec.ckpt          # 804 MB - Main model
β”œβ”€β”€ protein_vec_params.json   # Config
β”œβ”€β”€ aspect_vec_*.ckpt         # 200-400 MB each - Aspect models
└── tm_vec_swiss_model_large.ckpt  # 391 MB

CLEAN Model Weights (if using --model clean)

Get from: https://github.com/tttianhao/CLEAN


.gitignore Must Include

# Large data files (on Zenodo)
data/*.npy
data/*.tsv
data/*.pkl

# Model weights (user obtains separately)
protein_vec_models/*.ckpt
protein_vec_models.gz

# Build artifacts
*.sif
.apptainer_cache/
logs/
.claude/

Verification: Is Everything Set Up Correctly?

Run this after cloning + downloading:

# Check GitHub files present
ls data/gene_unknown/unknown_aa_seqs.fasta  # Should exist
ls results/fdr_thresholds.csv               # Should exist

# Check Zenodo files downloaded
ls -lh data/lookup_embeddings.npy           # Should be ~1.1 GB
ls -lh data/pfam_new_proteins.npy           # Should be ~2.4 GB

# Check model weights (if embedding)
ls protein_vec_models/protein_vec.ckpt      # Should exist if embedding

# Run verification
cpr verify --check syn30
# Expected: 58-60/149 hits (39.6%)

For Repository Maintainers

When releasing a new version:

  1. GitHub:

    • Commit all code changes
    • Update results/fdr_thresholds.csv with new calibration
    • Tag release: git tag v1.x.x
  2. Zenodo:

    • Upload updated embedding files if changed
    • Create new version linked to GitHub release

Files to NEVER commit to GitHub:

  • Any .npy file > 1 MB
  • Any .ckpt file (model weights)
  • Any .pkl file > 1 MB
  • Any .tsv or .csv > 1 MB