ronboger Claude Opus 4.5 commited on
Commit
0d63974
·
1 Parent(s): 174c120

docs: add Protein-Vec model weights Zenodo link with wget commands

Browse files

New Zenodo record: https://zenodo.org/records/18478696
- protein_vec_models.gz (2.9 GB)

Updated:
- GETTING_STARTED.md: wget/curl commands for model weights
- README.md: consolidated data download section
- DATA.md: complete wget commands for all data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (3) hide show
  1. DATA.md +16 -8
  2. GETTING_STARTED.md +26 -3
  3. README.md +14 -5
DATA.md CHANGED
@@ -5,10 +5,15 @@ This document describes the data files needed to run CPR (Conformal Protein Retr
5
  ## Quick Start
6
 
7
  ```bash
8
- # 1. Download data from Zenodo
9
- # Visit: https://zenodo.org/records/14272215
10
-
11
- # 2. Extract Protein-Vec models (if not already done)
 
 
 
 
 
12
  tar -xzf protein_vec_models.gz
13
 
14
  # 3. Verify setup
@@ -37,9 +42,14 @@ Small files that ARE committed to git:
37
  | `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
38
  | `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
39
 
40
- ### Protein-Vec Models
 
 
41
 
42
- Model weights (~3 GB compressed, ~3 GB extracted):
 
 
 
43
 
44
  | File | Size | Required For |
45
  |------|------|--------------|
@@ -48,8 +58,6 @@ Model weights (~3 GB compressed, ~3 GB extracted):
48
  | `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
49
  | `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
50
 
51
- **Source**: Contact authors or use the `protein_vec_models.gz` archive.
52
-
53
  ## Directory Structure
54
 
55
  ```
 
5
  ## Quick Start
6
 
7
  ```bash
8
+ # 1. Download required data files
9
+ cd data/
10
+ wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
11
+ wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
12
+ wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
13
+ cd ..
14
+
15
+ # 2. Download and extract Protein-Vec model weights (for embedding new sequences)
16
+ wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
17
  tar -xzf protein_vec_models.gz
18
 
19
  # 3. Verify setup
 
42
  | `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
43
  | `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
44
 
45
+ ### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))
46
+
47
+ Model weights (2.9 GB compressed):
48
 
49
+ ```bash
50
+ wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
51
+ tar -xzf protein_vec_models.gz
52
+ ```
53
 
54
  | File | Size | Required For |
55
  |------|------|--------------|
 
58
  | `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
59
  | `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
60
 
 
 
61
  ## Directory Structure
62
 
63
  ```
GETTING_STARTED.md CHANGED
@@ -73,13 +73,36 @@ curl -L -o lookup_embeddings_meta_data.tsv "https://zenodo.org/records/14272215/
73
  curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
74
  ```
75
 
76
- ### Optional Downloads
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  | File | Size | When you need it |
79
  |------|------|------------------|
80
  | `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
81
- | Protein-Vec model weights | 3 GB | Computing new embeddings from FASTA |
82
- | CLEAN model weights | 1 GB | Enzyme classification with CLEAN |
83
 
84
  ---
85
 
 
73
  curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
74
  ```
75
 
76
+ ### Protein-Vec Model Weights (Required for embedding new sequences)
77
+
78
+ If you want to embed new FASTA sequences (not just use pre-computed embeddings), download the model weights:
79
+
80
+ **Zenodo URL**: https://zenodo.org/records/18478696
81
+
82
+ ```bash
83
+ # Download and extract Protein-Vec model weights (2.9 GB compressed)
84
+ wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
85
+
86
+ # Extract to protein_vec_models/ directory
87
+ tar -xzf protein_vec_models.gz
88
+
89
+ # Verify extraction
90
+ ls protein_vec_models/
91
+ # Expected: protein_vec.ckpt, protein_vec_params.json, aspect_vec_*.ckpt, etc.
92
+ ```
93
+
94
+ Or with curl:
95
+ ```bash
96
+ curl -L -o protein_vec_models.gz "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1"
97
+ tar -xzf protein_vec_models.gz
98
+ ```
99
+
100
+ ### Other Optional Downloads
101
 
102
  | File | Size | When you need it |
103
  |------|------|------------------|
104
  | `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
105
+ | CLEAN model weights | ~1 GB | Enzyme classification with CLEAN |
 
106
 
107
  ---
108
 
README.md CHANGED
@@ -111,12 +111,21 @@ cpr verify --check clean # CLEAN enzyme classification
111
 
112
  ## Data Files
113
 
114
- Download the following files from [Zenodo](https://zenodo.org/records/14272215) and place in the `data/` directory:
115
 
116
- - `pfam_new_proteins.npy` (2.5 GB) - Pfam calibration data for FDR/FNR control
117
- - `lookup_embeddings.npy` (1.1 GB) - UniProt database embeddings (Protein-Vec)
118
- - `lookup_embeddings_meta_data.tsv` - Metadata for lookup database
119
- - `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings (optional)
 
 
 
 
 
 
 
 
 
120
 
121
  ## Protein-Vec vs CLEAN Models
122
 
 
111
 
112
  ## Data Files
113
 
114
+ ### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215))
115
 
116
+ ```bash
117
+ cd data/
118
+ wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
119
+ wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
120
+ wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
121
+ ```
122
+
123
+ ### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences
124
+
125
+ ```bash
126
+ wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
127
+ tar -xzf protein_vec_models.gz
128
+ ```
129
 
130
  ## Protein-Vec vs CLEAN Models
131