Spaces:
Running
Running
File size: 4,816 Bytes
7453ae1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | # Upload Checklist: What Goes Where
This document specifies exactly what files go to GitHub vs Zenodo.
## Summary
| Location | What | Why |
|----------|------|-----|
| **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
| **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
| **User obtains** | Protein-Vec model weights | Large binary, separate distribution |
---
## GitHub Repository (You Commit This)
### Code & Configuration
```
protein_conformal/ # All Python code
βββ __init__.py
βββ cli.py
βββ util.py
βββ scope_utils.py
βββ embed_protein_vec.py
βββ gradio_app.py
βββ backend/
scripts/ # Helper scripts
βββ verify_*.py
βββ compute_fdr_table.py
βββ slurm_*.sh
βββ *.py
tests/ # Test suite
notebooks/ # Analysis notebooks
docs/ # Documentation
```
### Small Data Files (<1MB each)
```
data/gene_unknown/
βββ unknown_aa_seqs.fasta # 56 KB - JCVI Syn3.0 sequences
βββ unknown_aa_seqs.npy # 299 KB - Pre-computed embeddings
βββ jcvi_syn30_unknown_gene_hits.csv # 61 KB - Results
results/
βββ fdr_thresholds.csv # ~2 KB - Threshold lookup table
βββ fnr_thresholds.csv # ~7 KB - FNR thresholds
βββ sim2prob_lookup.csv # ~8 KB - Probability lookup
```
### Configuration & Docs
```
pyproject.toml
setup.py
Dockerfile
apptainer.def
README.md
GETTING_STARTED.md
DATA.md
CLAUDE.md
docs/REPRODUCIBILITY.md
.gitignore
```
### Model Code (NOT weights)
```
protein_vec_models/
βββ model_protein_moe.py # Model architecture code
βββ utils_search.py # Embedding utilities
βββ data_protein_vec.py # Data loading code
βββ embed_structure_model.py
βββ model_protein_vec_single_variable.py
βββ train_protein_vec.py
βββ __init__.py
βββ *.json # Config files only
```
---
## Zenodo Repository (You Upload This)
**Zenodo URL**: https://zenodo.org/records/14272215
### Essential Files (Required for paper verification)
| File | Size | Description |
|------|------|-------------|
| `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |
### Optional Files (For extended experiments)
| File | Size | Description |
|------|------|-------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
| CLEAN enzyme data | varies | For Tables 1-2 reproduction |
| SCOPe/DALI data | varies | For Tables 4-6 reproduction |
---
## User Must Obtain Separately
### Protein-Vec Model Weights (~3 GB)
These are NOT in GitHub or Zenodo. Users get them by:
1. **Option A**: Contact authors for `protein_vec_models.gz`
2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)
Files needed if embedding new sequences:
```
protein_vec_models/
βββ protein_vec.ckpt # 804 MB - Main model
βββ protein_vec_params.json # Config
βββ aspect_vec_*.ckpt # 200-400 MB each - Aspect models
βββ tm_vec_swiss_model_large.ckpt # 391 MB
```
### CLEAN Model Weights (if using --model clean)
Get from: https://github.com/tttianhao/CLEAN
---
## .gitignore Must Include
```gitignore
# Large data files (on Zenodo)
data/*.npy
data/*.tsv
data/*.pkl
# Model weights (user obtains separately)
protein_vec_models/*.ckpt
protein_vec_models.gz
# Build artifacts
*.sif
.apptainer_cache/
logs/
.claude/
```
---
## Verification: Is Everything Set Up Correctly?
Run this after cloning + downloading:
```bash
# Check GitHub files present
ls data/gene_unknown/unknown_aa_seqs.fasta # Should exist
ls results/fdr_thresholds.csv # Should exist
# Check Zenodo files downloaded
ls -lh data/lookup_embeddings.npy # Should be ~1.1 GB
ls -lh data/pfam_new_proteins.npy # Should be ~2.4 GB
# Check model weights (if embedding)
ls protein_vec_models/protein_vec.ckpt # Should exist if embedding
# Run verification
cpr verify --check syn30
# Expected: 58-60/149 hits (39.6%)
```
---
## For Repository Maintainers
### When releasing a new version:
1. **GitHub**:
- Commit all code changes
- Update `results/fdr_thresholds.csv` with new calibration
- Tag release: `git tag v1.x.x`
2. **Zenodo**:
- Upload updated embedding files if changed
- Create new version linked to GitHub release
### Files to NEVER commit to GitHub:
- Any `.npy` file > 1 MB
- Any `.ckpt` file (model weights)
- Any `.pkl` file > 1 MB
- Any `.tsv` or `.csv` > 1 MB
|