Spaces:
Sleeping
Sleeping
docs: add CLEAN setup details and Docker/Apptainer usage
Browse filesCLEAN:
- Added verification of pretrained weights
- Note about ESM-1b and 128-dim embeddings
- GPU requirement note
Docker/Apptainer:
- docker build/run commands with volume mounts
- docker-compose for Gradio UI
- apptainer exec/shell commands for HPC
- GPU support with --nv flag
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- GETTING_STARTED.md +60 -19
GETTING_STARTED.md
CHANGED
|
@@ -241,31 +241,35 @@ For enzyme-specific searches with EC number predictions:
|
|
| 241 |
### Setup
|
| 242 |
|
| 243 |
```bash
|
| 244 |
-
# 1. Clone CLEAN repository
|
| 245 |
git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
|
| 246 |
-
cd CLEAN_repo && pip install -e . && cd ..
|
| 247 |
|
| 248 |
-
# 2. Install
|
| 249 |
-
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
-
# 3.
|
| 252 |
-
|
|
|
|
| 253 |
```
|
| 254 |
|
|
|
|
|
|
|
| 255 |
### Usage with CPR
|
| 256 |
|
| 257 |
```bash
|
| 258 |
-
# Generate CLEAN embeddings (128-dim)
|
| 259 |
cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
|
| 260 |
|
| 261 |
-
# Search with CLEAN
|
| 262 |
cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
|
| 263 |
```
|
| 264 |
|
| 265 |
### Verify CLEAN Results (Paper Tables 1-2)
|
| 266 |
|
| 267 |
```bash
|
| 268 |
-
# Run CLEAN verification script
|
| 269 |
python scripts/verify_clean.py
|
| 270 |
|
| 271 |
# Expected output:
|
|
@@ -362,23 +366,60 @@ python scripts/pfam/sva_results.py
|
|
| 362 |
|
| 363 |
---
|
| 364 |
|
| 365 |
-
##
|
| 366 |
|
| 367 |
-
|
| 368 |
|
| 369 |
-
|
| 370 |
|
| 371 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 372 |
|
| 373 |
-
If you have the weights:
|
| 374 |
```bash
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
#
|
| 379 |
-
# aspect_vec_*.ckpt (200-400 MB each)
|
| 380 |
```
|
| 381 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 382 |
---
|
| 383 |
|
| 384 |
## Troubleshooting
|
|
|
|
| 241 |
### Setup
|
| 242 |
|
| 243 |
```bash
|
| 244 |
+
# 1. Clone CLEAN repository with pretrained weights
|
| 245 |
git clone https://github.com/tttianhao/CLEAN.git CLEAN_repo
|
|
|
|
| 246 |
|
| 247 |
+
# 2. Install CLEAN and dependencies
|
| 248 |
+
cd CLEAN_repo
|
| 249 |
+
pip install -e .
|
| 250 |
+
pip install fair-esm>=2.0.0
|
| 251 |
+
cd ..
|
| 252 |
|
| 253 |
+
# 3. Verify weights are present
|
| 254 |
+
ls CLEAN_repo/app/data/pretrained/
|
| 255 |
+
# Expected: 100.pt (123 MB), 70.pt (40 MB), split100.pth, split70.pth
|
| 256 |
```
|
| 257 |
|
| 258 |
+
**Note**: CLEAN uses ESM-1b embeddings internally (computed automatically). The model produces 128-dimensional embeddings (vs 1024 for Protein-Vec).
|
| 259 |
+
|
| 260 |
### Usage with CPR
|
| 261 |
|
| 262 |
```bash
|
| 263 |
+
# Generate CLEAN embeddings (128-dim) - requires GPU
|
| 264 |
cpr embed --input enzymes.fasta --output clean_embeddings.npy --model clean
|
| 265 |
|
| 266 |
+
# Search with CLEAN model
|
| 267 |
cpr search --input enzymes.fasta --output enzyme_results.csv --model clean --fdr 0.1
|
| 268 |
```
|
| 269 |
|
| 270 |
### Verify CLEAN Results (Paper Tables 1-2)
|
| 271 |
|
| 272 |
```bash
|
|
|
|
| 273 |
python scripts/verify_clean.py
|
| 274 |
|
| 275 |
# Expected output:
|
|
|
|
| 366 |
|
| 367 |
---
|
| 368 |
|
| 369 |
+
## Docker / Container Usage
|
| 370 |
|
| 371 |
+
Run CPR without installing dependencies locally:
|
| 372 |
|
| 373 |
+
### Docker
|
| 374 |
|
| 375 |
+
```bash
|
| 376 |
+
# Build the image
|
| 377 |
+
docker build -t cpr:latest .
|
| 378 |
+
|
| 379 |
+
# Run with your data mounted
|
| 380 |
+
docker run -it --rm \
|
| 381 |
+
-v $(pwd)/data:/workspace/data \
|
| 382 |
+
-v $(pwd)/protein_vec_models:/workspace/protein_vec_models \
|
| 383 |
+
-v $(pwd)/results:/workspace/results \
|
| 384 |
+
cpr:latest bash
|
| 385 |
+
|
| 386 |
+
# Inside container: run searches
|
| 387 |
+
cpr search --input data/your_sequences.fasta --output results/hits.csv --fdr 0.1
|
| 388 |
+
|
| 389 |
+
# Or launch the Gradio web interface
|
| 390 |
+
docker run -p 7860:7860 \
|
| 391 |
+
-v $(pwd)/data:/workspace/data \
|
| 392 |
+
cpr:latest
|
| 393 |
+
# Then open http://localhost:7860
|
| 394 |
+
```
|
| 395 |
+
|
| 396 |
+
### Docker Compose
|
| 397 |
|
|
|
|
| 398 |
```bash
|
| 399 |
+
# Start the Gradio web interface
|
| 400 |
+
docker-compose up
|
| 401 |
+
|
| 402 |
+
# Access at http://localhost:7860
|
|
|
|
| 403 |
```
|
| 404 |
|
| 405 |
+
### Apptainer (HPC clusters)
|
| 406 |
+
|
| 407 |
+
```bash
|
| 408 |
+
# Build the container
|
| 409 |
+
apptainer build cpr.sif apptainer.def
|
| 410 |
+
|
| 411 |
+
# Run a search
|
| 412 |
+
apptainer exec --nv cpr.sif cpr search \
|
| 413 |
+
--input data/sequences.fasta \
|
| 414 |
+
--output results/hits.csv \
|
| 415 |
+
--fdr 0.1
|
| 416 |
+
|
| 417 |
+
# Interactive shell
|
| 418 |
+
apptainer shell --nv cpr.sif
|
| 419 |
+
```
|
| 420 |
+
|
| 421 |
+
**Note**: Use `--nv` flag for GPU support on NVIDIA systems.
|
| 422 |
+
|
| 423 |
---
|
| 424 |
|
| 425 |
## Troubleshooting
|