cpr / UPLOAD_CHECKLIST.md
ronboger's picture
feat: add one-step cpr find command and FDR-level search
7453ae1
# Upload Checklist: What Goes Where
This document specifies exactly what files go to GitHub vs Zenodo.
## Summary
| Location | What | Why |
|----------|------|-----|
| **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
| **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
| **User obtains** | Protein-Vec model weights | Large binary, separate distribution |
---
## GitHub Repository (You Commit This)
### Code & Configuration
```
protein_conformal/ # All Python code
β”œβ”€β”€ __init__.py
β”œβ”€β”€ cli.py
β”œβ”€β”€ util.py
β”œβ”€β”€ scope_utils.py
β”œβ”€β”€ embed_protein_vec.py
β”œβ”€β”€ gradio_app.py
└── backend/
scripts/ # Helper scripts
β”œβ”€β”€ verify_*.py
β”œβ”€β”€ compute_fdr_table.py
β”œβ”€β”€ slurm_*.sh
└── *.py
tests/ # Test suite
notebooks/ # Analysis notebooks
docs/ # Documentation
```
### Small Data Files (<1MB each)
```
data/gene_unknown/
β”œβ”€β”€ unknown_aa_seqs.fasta # 56 KB - JCVI Syn3.0 sequences
β”œβ”€β”€ unknown_aa_seqs.npy # 299 KB - Pre-computed embeddings
└── jcvi_syn30_unknown_gene_hits.csv # 61 KB - Results
results/
β”œβ”€β”€ fdr_thresholds.csv # ~2 KB - Threshold lookup table
β”œβ”€β”€ fnr_thresholds.csv # ~7 KB - FNR thresholds
└── sim2prob_lookup.csv # ~8 KB - Probability lookup
```
### Configuration & Docs
```
pyproject.toml
setup.py
Dockerfile
apptainer.def
README.md
GETTING_STARTED.md
DATA.md
CLAUDE.md
docs/REPRODUCIBILITY.md
.gitignore
```
### Model Code (NOT weights)
```
protein_vec_models/
β”œβ”€β”€ model_protein_moe.py # Model architecture code
β”œβ”€β”€ utils_search.py # Embedding utilities
β”œβ”€β”€ data_protein_vec.py # Data loading code
β”œβ”€β”€ embed_structure_model.py
β”œβ”€β”€ model_protein_vec_single_variable.py
β”œβ”€β”€ train_protein_vec.py
β”œβ”€β”€ __init__.py
└── *.json # Config files only
```
---
## Zenodo Repository (You Upload This)
**Zenodo URL**: https://zenodo.org/records/14272215
### Essential Files (Required for paper verification)
| File | Size | Description |
|------|------|-------------|
| `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |
### Optional Files (For extended experiments)
| File | Size | Description |
|------|------|-------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
| CLEAN enzyme data | varies | For Tables 1-2 reproduction |
| SCOPe/DALI data | varies | For Tables 4-6 reproduction |
---
## User Must Obtain Separately
### Protein-Vec Model Weights (~3 GB)
These are NOT in GitHub or Zenodo. Users get them by:
1. **Option A**: Contact authors for `protein_vec_models.gz`
2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)
Files needed if embedding new sequences:
```
protein_vec_models/
β”œβ”€β”€ protein_vec.ckpt # 804 MB - Main model
β”œβ”€β”€ protein_vec_params.json # Config
β”œβ”€β”€ aspect_vec_*.ckpt # 200-400 MB each - Aspect models
└── tm_vec_swiss_model_large.ckpt # 391 MB
```
### CLEAN Model Weights (if using --model clean)
Get from: https://github.com/tttianhao/CLEAN
---
## .gitignore Must Include
```gitignore
# Large data files (on Zenodo)
data/*.npy
data/*.tsv
data/*.pkl
# Model weights (user obtains separately)
protein_vec_models/*.ckpt
protein_vec_models.gz
# Build artifacts
*.sif
.apptainer_cache/
logs/
.claude/
```
---
## Verification: Is Everything Set Up Correctly?
Run this after cloning + downloading:
```bash
# Check GitHub files present
ls data/gene_unknown/unknown_aa_seqs.fasta # Should exist
ls results/fdr_thresholds.csv # Should exist
# Check Zenodo files downloaded
ls -lh data/lookup_embeddings.npy # Should be ~1.1 GB
ls -lh data/pfam_new_proteins.npy # Should be ~2.4 GB
# Check model weights (if embedding)
ls protein_vec_models/protein_vec.ckpt # Should exist if embedding
# Run verification
cpr verify --check syn30
# Expected: 58-60/149 hits (39.6%)
```
---
## For Repository Maintainers
### When releasing a new version:
1. **GitHub**:
- Commit all code changes
- Update `results/fdr_thresholds.csv` with new calibration
- Tag release: `git tag v1.x.x`
2. **Zenodo**:
- Upload updated embedding files if changed
- Create new version linked to GitHub release
### Files to NEVER commit to GitHub:
- Any `.npy` file > 1 MB
- Any `.ckpt` file (model weights)
- Any `.pkl` file > 1 MB
- Any `.tsv` or `.csv` > 1 MB