File size: 4,816 Bytes
7453ae1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# Upload Checklist: What Goes Where

This document specifies exactly what files go to GitHub vs Zenodo.

## Summary

| Location | What | Why |
|----------|------|-----|
| **GitHub** | Code, small data (<1MB), configs | Version control, collaboration |
| **Zenodo** | Large data files (>1MB), embeddings | Long-term archival, DOI |
| **User obtains** | Protein-Vec model weights | Large binary, separate distribution |

---

## GitHub Repository (You Commit This)

### Code & Configuration
```
protein_conformal/          # All Python code
β”œβ”€β”€ __init__.py
β”œβ”€β”€ cli.py
β”œβ”€β”€ util.py
β”œβ”€β”€ scope_utils.py
β”œβ”€β”€ embed_protein_vec.py
β”œβ”€β”€ gradio_app.py
└── backend/

scripts/                    # Helper scripts
β”œβ”€β”€ verify_*.py
β”œβ”€β”€ compute_fdr_table.py
β”œβ”€β”€ slurm_*.sh
└── *.py

tests/                      # Test suite
notebooks/                  # Analysis notebooks
docs/                       # Documentation
```

### Small Data Files (<1MB each)
```
data/gene_unknown/
β”œβ”€β”€ unknown_aa_seqs.fasta   # 56 KB - JCVI Syn3.0 sequences
β”œβ”€β”€ unknown_aa_seqs.npy     # 299 KB - Pre-computed embeddings
└── jcvi_syn30_unknown_gene_hits.csv  # 61 KB - Results

results/
β”œβ”€β”€ fdr_thresholds.csv      # ~2 KB - Threshold lookup table
β”œβ”€β”€ fnr_thresholds.csv      # ~7 KB - FNR thresholds
└── sim2prob_lookup.csv     # ~8 KB - Probability lookup
```

### Configuration & Docs
```
pyproject.toml
setup.py
Dockerfile
apptainer.def
README.md
GETTING_STARTED.md
DATA.md
CLAUDE.md
docs/REPRODUCIBILITY.md
.gitignore
```

### Model Code (NOT weights)
```
protein_vec_models/
β”œβ”€β”€ model_protein_moe.py      # Model architecture code
β”œβ”€β”€ utils_search.py           # Embedding utilities
β”œβ”€β”€ data_protein_vec.py       # Data loading code
β”œβ”€β”€ embed_structure_model.py
β”œβ”€β”€ model_protein_vec_single_variable.py
β”œβ”€β”€ train_protein_vec.py
β”œβ”€β”€ __init__.py
└── *.json                    # Config files only
```

---

## Zenodo Repository (You Upload This)

**Zenodo URL**: https://zenodo.org/records/14272215

### Essential Files (Required for paper verification)

| File | Size | Description |
|------|------|-------------|
| `lookup_embeddings.npy` | **1.1 GB** | UniProt database embeddings (540K proteins) |
| `lookup_embeddings_meta_data.tsv` | **535 MB** | Protein metadata (names, Pfam domains, etc.) |
| `pfam_new_proteins.npy` | **2.4 GB** | Calibration data for FDR/probability |

### Optional Files (For extended experiments)

| File | Size | Description |
|------|------|-------------|
| `afdb_embeddings_protein_vec.npy` | 4.7 GB | AlphaFold DB embeddings |
| CLEAN enzyme data | varies | For Tables 1-2 reproduction |
| SCOPe/DALI data | varies | For Tables 4-6 reproduction |

---

## User Must Obtain Separately

### Protein-Vec Model Weights (~3 GB)

These are NOT in GitHub or Zenodo. Users get them by:

1. **Option A**: Contact authors for `protein_vec_models.gz`
2. **Option B**: Use pre-computed embeddings from Zenodo (no weights needed for searching)

Files needed if embedding new sequences:
```
protein_vec_models/
β”œβ”€β”€ protein_vec.ckpt          # 804 MB - Main model
β”œβ”€β”€ protein_vec_params.json   # Config
β”œβ”€β”€ aspect_vec_*.ckpt         # 200-400 MB each - Aspect models
└── tm_vec_swiss_model_large.ckpt  # 391 MB
```

### CLEAN Model Weights (if using --model clean)

Get from: https://github.com/tttianhao/CLEAN

---

## .gitignore Must Include

```gitignore
# Large data files (on Zenodo)
data/*.npy
data/*.tsv
data/*.pkl

# Model weights (user obtains separately)
protein_vec_models/*.ckpt
protein_vec_models.gz

# Build artifacts
*.sif
.apptainer_cache/
logs/
.claude/
```

---

## Verification: Is Everything Set Up Correctly?

Run this after cloning + downloading:

```bash
# Check GitHub files present
ls data/gene_unknown/unknown_aa_seqs.fasta  # Should exist
ls results/fdr_thresholds.csv               # Should exist

# Check Zenodo files downloaded
ls -lh data/lookup_embeddings.npy           # Should be ~1.1 GB
ls -lh data/pfam_new_proteins.npy           # Should be ~2.4 GB

# Check model weights (if embedding)
ls protein_vec_models/protein_vec.ckpt      # Should exist if embedding

# Run verification
cpr verify --check syn30
# Expected: 58-60/149 hits (39.6%)
```

---

## For Repository Maintainers

### When releasing a new version:

1. **GitHub**:
   - Commit all code changes
   - Update `results/fdr_thresholds.csv` with new calibration
   - Tag release: `git tag v1.x.x`

2. **Zenodo**:
   - Upload updated embedding files if changed
   - Create new version linked to GitHub release

### Files to NEVER commit to GitHub:
- Any `.npy` file > 1 MB
- Any `.ckpt` file (model weights)
- Any `.pkl` file > 1 MB
- Any `.tsv` or `.csv` > 1 MB