ronboger Claude Opus 4.5 commited on
Commit
ab34d07
·
1 Parent(s): 0d63974

docs: add CLEAN setup details and Docker/Apptainer usage

Browse files

CLEAN:
- Added verification of pretrained weights
- Note about ESM-1b and 128-dim embeddings
- GPU requirement note

Docker/Apptainer:
- docker build/run commands with volume mounts
- docker-compose for Gradio UI
- apptainer exec/shell commands for HPC
- GPU support with --nv flag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. GETTING_STARTED.md +60 -19
GETTING_STARTED.md CHANGED
@@ -241,31 +241,35 @@ For enzyme-specific searches with EC number predictions:
241
  ### Setup
242
 
243
  ```bash
244
- # 1. Clone CLEAN repository
245
  git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
246
- cd CLEAN_repo && pip install -e . && cd ..
247
 
248
- # 2. Install ESM dependency
249
- pip install fair-esm
 
 
 
250
 
251
- # 3. Download CLEAN weights (if not included)
252
- # Weights should be at: CLEAN_repo/app/data/pretrained/CLEAN_pretrained/
 
253
  ```
254
 
 
 
255
  ### Usage with CPR
256
 
257
  ```bash
258
- # Generate CLEAN embeddings (128-dim)
259
  cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
260
 
261
- # Search with CLEAN
262
  cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
263
  ```
264
 
265
  ### Verify CLEAN Results (Paper Tables 1-2)
266
 
267
  ```bash
268
- # Run CLEAN verification script
269
  python scripts/verify_clean.py
270
 
271
  # Expected output:
@@ -362,23 +366,60 @@ python scripts/pfam/sva_results.py
362
 
363
  ---
364
 
365
- ## Model Weights
366
 
367
- ### Protein-Vec (General Protein Search)
368
 
369
- **Option 1: Contact authors** for the `protein_vec_models.gz` archive.
370
 
371
- **Option 2: Use pre-computed embeddings** from Zenodo (no weights needed).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
372
 
373
- If you have the weights:
374
  ```bash
375
- tar -xzf protein_vec_models.gz
376
- # Creates protein_vec_models/ with:
377
- # protein_vec.ckpt (804 MB)
378
- # protein_vec_params.json
379
- # aspect_vec_*.ckpt (200-400 MB each)
380
  ```
381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  ---
383
 
384
  ## Troubleshooting
 
241
  ### Setup
242
 
243
  ```bash
244
+ # 1. Clone CLEAN repository with pretrained weights
245
  git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
 
246
 
247
+ # 2. Install CLEAN and dependencies
248
+ cd CLEAN_repo
249
+ pip install -e .
250
+ pip install fair-esm>=2.0.0
251
+ cd ..
252
 
253
+ # 3. Verify weights are present
254
+ ls CLEAN_repo/app/data/pretrained/
255
+ # Expected: 100.pt (123 MB), 70.pt (40 MB), split100.pth, split70.pth
256
  ```
257
 
258
+ **Note**: CLEAN uses ESM-1b embeddings internally (computed automatically). The model produces 128-dimensional embeddings (vs 1024 for Protein-Vec).
259
+
260
  ### Usage with CPR
261
 
262
  ```bash
263
+ # Generate CLEAN embeddings (128-dim) - requires GPU
264
  cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
265
 
266
+ # Search with CLEAN model
267
  cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
268
  ```
269
 
270
  ### Verify CLEAN Results (Paper Tables 1-2)
271
 
272
  ```bash
 
273
  python scripts/verify_clean.py
274
 
275
  # Expected output:
 
366
 
367
  ---
368
 
369
+ ## Docker / Container Usage
370
 
371
+ Run CPR without installing dependencies locally:
372
 
373
+ ### Docker
374
 
375
+ ```bash
376
+ # Build the image
377
+ docker build -t cpr:latest .
378
+
379
+ # Run with your data mounted
380
+ docker run -it --rm \
381
+ -v $(pwd)/data:/workspace/data \
382
+ -v $(pwd)/protein_vec_models:/workspace/protein_vec_models \
383
+ -v $(pwd)/results:/workspace/results \
384
+ cpr:latest bash
385
+
386
+ # Inside container: run searches
387
+ cpr search --input data/your_sequences.fasta --output results/hits.csv --fdr 0.1
388
+
389
+ # Or launch the Gradio web interface
390
+ docker run -p 7860:7860 \
391
+ -v $(pwd)/data:/workspace/data \
392
+ cpr:latest
393
+ # Then open http://localhost:7860
394
+ ```
395
+
396
+ ### Docker Compose
397
 
 
398
  ```bash
399
+ # Start the Gradio web interface
400
+ docker-compose up
401
+
402
+ # Access at http://localhost:7860
 
403
  ```
404
 
405
+ ### Apptainer (HPC clusters)
406
+
407
+ ```bash
408
+ # Build the container
409
+ apptainer build cpr.sif apptainer.def
410
+
411
+ # Run a search
412
+ apptainer exec --nv cpr.sif cpr search \
413
+ --input data/sequences.fasta \
414
+ --output results/hits.csv \
415
+ --fdr 0.1
416
+
417
+ # Interactive shell
418
+ apptainer shell --nv cpr.sif
419
+ ```
420
+
421
+ **Note**: Use `--nv` flag for GPU support on NVIDIA systems.
422
+
423
  ---
424
 
425
  ## Troubleshooting