Spaces:

LoocasGoose
/

cpr

Sleeping

ronboger Claude Opus 4.5 commited on Feb 4

Commit

0d63974

1 Parent(s): 174c120

docs: add Protein-Vec model weights Zenodo link with wget commands

New Zenodo record: https://zenodo.org/records/18478696
- protein_vec_models.gz (2.9 GB)

Updated:
- GETTING_STARTED.md: wget/curl commands for model weights
- README.md: consolidated data download section
- DATA.md: complete wget commands for all data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (3) hide show

DATA.md +16 -8
GETTING_STARTED.md +26 -3
README.md +14 -5

DATA.md CHANGED Viewed

@@ -5,10 +5,15 @@ This document describes the data files needed to run CPR (Conformal Protein Retr
 ## Quick Start
 ```bash
-# 1. Download data from Zenodo
-# Visit: https://zenodo.org/records/14272215
-# 2. Extract Protein-Vec models (if not already done)
 tar -xzf protein_vec_models.gz
 # 3. Verify setup
@@ -37,9 +42,14 @@ Small files that ARE committed to git:
 | `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
 | `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
-### Protein-Vec Models
-Model weights (~3 GB compressed, ~3 GB extracted):
 | File | Size | Required For |
 |------|------|--------------|
@@ -48,8 +58,6 @@ Model weights (~3 GB compressed, ~3 GB extracted):
 | `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
 | `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
-**Source**: Contact authors or use the `protein_vec_models.gz` archive.
 ## Directory Structure
 ```

 ## Quick Start
 ```bash
+# 1. Download required data files
+cd data/
+wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
+wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
+wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
+cd ..
+# 2. Download and extract Protein-Vec model weights (for embedding new sequences)
+wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
 tar -xzf protein_vec_models.gz
 # 3. Verify setup
 | `data/gene_unknown/unknown_aa_seqs.npy` | 299 KB | Pre-computed embeddings for Syn3.0 genes |
 | `data/gene_unknown/jcvi_syn30_unknown_gene_hits.csv` | 61 KB | Results: 59 annotated genes |
+### Protein-Vec Models ([Zenodo #18478696](https://zenodo.org/records/18478696))
+Model weights (2.9 GB compressed):
+```bash
+wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
+tar -xzf protein_vec_models.gz
+```
 | File | Size | Required For |
 |------|------|--------------|
 | `aspect_vec_*.ckpt` | ~200-400 MB each | Aspect-specific models |
 | `tm_vec_swiss_model_large.ckpt` | 391 MB | TM-Vec model |
 ## Directory Structure
 ```

GETTING_STARTED.md CHANGED Viewed

@@ -73,13 +73,36 @@ curl -L -o lookup_embeddings_meta_data.tsv "https://zenodo.org/records/14272215/
 curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
 ```
-### Optional Downloads
 | File | Size | When you need it |
 |------|------|------------------|
 | `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
-| Protein-Vec model weights | 3 GB | Computing new embeddings from FASTA |
-| CLEAN model weights | 1 GB | Enzyme classification with CLEAN |
 ---

 curl -L -o pfam_new_proteins.npy "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1"
 ```
+### Protein-Vec Model Weights (Required for embedding new sequences)
+If you want to embed new FASTA sequences (not just use pre-computed embeddings), download the model weights:
+**Zenodo URL**: https://zenodo.org/records/18478696
+```bash
+# Download and extract Protein-Vec model weights (2.9 GB compressed)
+wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
+# Extract to protein_vec_models/ directory
+tar -xzf protein_vec_models.gz
+# Verify extraction
+ls protein_vec_models/
+# Expected: protein_vec.ckpt, protein_vec_params.json, aspect_vec_*.ckpt, etc.
+```
+Or with curl:
+```bash
+curl -L -o protein_vec_models.gz "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1"
+tar -xzf protein_vec_models.gz
+```
+### Other Optional Downloads
 | File | Size | When you need it |
 |------|------|------------------|
 | `afdb_embeddings_protein_vec.npy` | 4.7 GB | Searching AlphaFold Database |
+| CLEAN model weights | ~1 GB | Enzyme classification with CLEAN |
 ---

README.md CHANGED Viewed

@@ -111,12 +111,21 @@ cpr verify --check clean   # CLEAN enzyme classification
 ## Data Files
-Download the following files from [Zenodo](https://zenodo.org/records/14272215) and place in the `data/` directory:
-- `pfam_new_proteins.npy` (2.5 GB) - Pfam calibration data for FDR/FNR control
-- `lookup_embeddings.npy` (1.1 GB) - UniProt database embeddings (Protein-Vec)
-- `lookup_embeddings_meta_data.tsv` - Metadata for lookup database
-- `afdb_embeddings_protein_vec.npy` (4.7 GB) - AlphaFold DB embeddings (optional)
 ## Protein-Vec vs CLEAN Models

 ## Data Files
+### Required Data ([Zenodo #14272215](https://zenodo.org/records/14272215))
+```bash
+cd data/
+wget "https://zenodo.org/records/14272215/files/lookup_embeddings.npy?download=1" -O lookup_embeddings.npy
+wget "https://zenodo.org/records/14272215/files/lookup_embeddings_meta_data.tsv?download=1" -O lookup_embeddings_meta_data.tsv
+wget "https://zenodo.org/records/14272215/files/pfam_new_proteins.npy?download=1" -O pfam_new_proteins.npy
+```
+### Model Weights ([Zenodo #18478696](https://zenodo.org/records/18478696)) - for embedding new sequences
+```bash
+wget "https://zenodo.org/records/18478696/files/protein_vec_models.gz?download=1" -O protein_vec_models.gz
+tar -xzf protein_vec_models.gz
+```
 ## Protein-Vec vs CLEAN Models