Spaces:

LoocasGoose
/

cpr

Running

App Files Files Community

cpr / UPLOAD_CHECKLIST.md

ronboger

feat: add one-step cpr find command and FDR-level search

7453ae1 4 months ago

preview code

raw

history blame contribute delete

4.82 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Upload Checklist: What Goes Where

This document specifies exactly what files go to GitHub vs Zenodo.

Summary

Location	What	Why
GitHub	Code, small data (<1MB), configs	Version control, collaboration
Zenodo	Large data files (>1MB), embeddings	Long-term archival, DOI
User obtains	Protein-Vec model weights	Large binary, separate distribution

GitHub Repository (You Commit This)

Code & Configuration

protein_conformal/          # All Python code
├── __init__.py
├── cli.py
├── util.py
├── scope_utils.py
├── embed_protein_vec.py
├── gradio_app.py
└── backend/

scripts/                    # Helper scripts
├── verify_*.py
├── compute_fdr_table.py
├── slurm_*.sh
└── *.py

tests/                      # Test suite
notebooks/                  # Analysis notebooks
docs/                       # Documentation

Small Data Files (<1MB each)

data/gene_unknown/
├── unknown_aa_seqs.fasta   # 56 KB - JCVI Syn3.0 sequences
├── unknown_aa_seqs.npy     # 299 KB - Pre-computed embeddings
└── jcvi_syn30_unknown_gene_hits.csv  # 61 KB - Results

results/
├── fdr_thresholds.csv      # ~2 KB - Threshold lookup table
├── fnr_thresholds.csv      # ~7 KB - FNR thresholds
└── sim2prob_lookup.csv     # ~8 KB - Probability lookup

Configuration & Docs

pyproject.toml
setup.py
Dockerfile
apptainer.def
README.md
GETTING_STARTED.md
DATA.md
CLAUDE.md
docs/REPRODUCIBILITY.md
.gitignore

Model Code (NOT weights)

protein_vec_models/
├── model_protein_moe.py      # Model architecture code
├── utils_search.py           # Embedding utilities
├── data_protein_vec.py       # Data loading code
├── embed_structure_model.py
├── model_protein_vec_single_variable.py
├── train_protein_vec.py
├── __init__.py
└── *.json                    # Config files only

Zenodo Repository (You Upload This)

Zenodo URL: https://zenodo.org/records/14272215

Essential Files (Required for paper verification)

File	Size	Description
`lookup_embeddings.npy`	1.1 GB	UniProt database embeddings (540K proteins)
`lookup_embeddings_meta_data.tsv`	535 MB	Protein metadata (names, Pfam domains, etc.)
`pfam_new_proteins.npy`	2.4 GB	Calibration data for FDR/probability

Optional Files (For extended experiments)

File	Size	Description
`afdb_embeddings_protein_vec.npy`	4.7 GB	AlphaFold DB embeddings
CLEAN enzyme data	varies	For Tables 1-2 reproduction
SCOPe/DALI data	varies	For Tables 4-6 reproduction

User Must Obtain Separately

Protein-Vec Model Weights (~3 GB)

These are NOT in GitHub or Zenodo. Users get them by:

Option A: Contact authors for protein_vec_models.gz
Option B: Use pre-computed embeddings from Zenodo (no weights needed for searching)

Files needed if embedding new sequences:

protein_vec_models/
├── protein_vec.ckpt          # 804 MB - Main model
├── protein_vec_params.json   # Config
├── aspect_vec_*.ckpt         # 200-400 MB each - Aspect models
└── tm_vec_swiss_model_large.ckpt  # 391 MB

CLEAN Model Weights (if using --model clean)

Get from: https://github.com/tttianhao/CLEAN

.gitignore Must Include

# Large data files (on Zenodo)
data/*.npy
data/*.tsv
data/*.pkl

# Model weights (user obtains separately)
protein_vec_models/*.ckpt
protein_vec_models.gz

# Build artifacts
*.sif
.apptainer_cache/
logs/
.claude/

Verification: Is Everything Set Up Correctly?

Run this after cloning + downloading:

# Check GitHub files present
ls data/gene_unknown/unknown_aa_seqs.fasta  # Should exist
ls results/fdr_thresholds.csv               # Should exist

# Check Zenodo files downloaded
ls -lh data/lookup_embeddings.npy           # Should be ~1.1 GB
ls -lh data/pfam_new_proteins.npy           # Should be ~2.4 GB

# Check model weights (if embedding)
ls protein_vec_models/protein_vec.ckpt      # Should exist if embedding

# Run verification
cpr verify --check syn30
# Expected: 58-60/149 hits (39.6%)

For Repository Maintainers

When releasing a new version:

GitHub:
- Commit all code changes
- Update results/fdr_thresholds.csv with new calibration
- Tag release: git tag v1.x.x
Zenodo:
- Upload updated embedding files if changed
- Create new version linked to GitHub release

Files to NEVER commit to GitHub:

Any .npy file > 1 MB
Any .ckpt file (model weights)
Any .pkl file > 1 MB
Any .tsv or .csv > 1 MB