Spaces:
Sleeping
Sleeping
| # CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht | |
| **Date:** April 2026 | |
| **Repository:** `/vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/` | |
| **HuggingFace Space:** https://huggingface.co/spaces/genomenet/crispr-array-detection | |
| **HuggingFace Model Repository:** https://huggingface.co/genomenet/crispr-bert-model | |
| --- | |
| ## Summary | |
| We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces and provides both interactive visualization and programmatic access to the model's predictions. | |
| --- | |
| ## Implemented Functionality | |
| ### 1. CRISPR Array Prediction (`/predict` endpoint equivalent) | |
| - **Input:** DNA sequence (minimum 1000 bp, supports FASTA format) | |
| - **Output:** Per-position CRISPR probability scores (0-1) | |
| - **Visualization:** Interactive probability curve along sequence position | |
| - **Parameters:** | |
| - Configurable stride (50-500 bp, default 500 for CPU responsiveness) | |
| - Adjustable detection threshold (0.1-0.9, default 0.3) | |
| - **Region Detection:** Automatic identification and annotation of predicted CRISPR regions above threshold | |
| ### 2. Hidden-State Embedding Extraction (`/embed` endpoint equivalent) | |
| - **Input:** DNA sequence | |
| - **Output:** 768-dimensional embedding vectors from transformer layer 21 | |
| - **Modes:** | |
| - `mean`: Mean-pooled embedding across all windows (single vector) | |
| - `max`: Max-pooled embedding (single vector) | |
| - `trajectory`: Per-window embeddings for sequence analysis | |
| - `state-dynamics`: UMAP projection with clustering visualization | |
| ### 3. State-Dynamic Plots (as described in DFG SPP 2141 report) | |
| Implemented visualization inspired by Figure 3 from the progress report: | |
| - **UMAP Projection:** Dimensionality reduction of hidden-state embeddings to 2D/3D | |
| - **Agglomerative Clustering:** Automatic identification of structural regions | |
| - **Dual Visualization:** | |
| - Left panel: Points colored by cluster assignment | |
| - Right panel: Points colored by sequence position (trajectory) | |
| - **Sequence Map:** Linear representation showing cluster assignments along the sequence | |
| - **Interactive Plots:** Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation | |
| **Key Insight:** For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters. | |
| --- | |
| ## Technical Implementation | |
| ### Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Architecture | 24-layer BERT transformer with bottleneck classification head | | |
| | Parameters | ~430 million | | |
| | Pre-training | BERT model trained on metagenomic and genomic microbial sequences | | |
| | Fine-tuning | Trained on annotated CRISPR arrays from bacterial genomes | | |
| | Input window | 1000 bp | | |
| | Embedding layer | `layer_transformer_block_21` (768 dimensions) | | |
| ### Deployment Architecture | |
| ``` | |
| HuggingFace Spaces (Docker SDK) | |
| ├── Custom Dockerfile (Python 3.10-slim) | |
| ├── TensorFlow 2.15.1 + Keras 2.15.0 | |
| ├── Model downloaded from HF Hub at startup | |
| └── Gradio 4.x frontend | |
| ``` | |
| ### Infrastructure | |
| - **Hosting:** HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7) | |
| - **Model Storage:** Separate HuggingFace Model Repository (5.15 GB) | |
| - **Cold Start:** ~2-3 minutes (model download + warm-up) | |
| - **Inference Time:** ~50-200ms per 1kb window on T4 GPU | |
| ### Dependencies | |
| ``` | |
| tensorflow==2.15.1 | |
| keras==2.15.0 | |
| gradio>=4.0.0 | |
| numpy>=1.26.0,<2.0.0 | |
| huggingface_hub>=0.20.0 | |
| umap-learn>=0.5.0 | |
| scikit-learn>=1.3.0 | |
| plotly>=5.18.0 | |
| ``` | |
| --- | |
| ## Completed Checklist Items | |
| From the original TODO: | |
| - [x] **Checkpoint beschaffen** - Model `best.h5` located and uploaded to HF Model Hub | |
| - [x] **Eigenes Repo anlegen** - Created HuggingFace Space `genomenet/crispr-array-detection` | |
| - [x] **Code-Verständnis** - Analyzed custom layers, tokenization, sliding window logic | |
| - [x] **Model-Loader (Singleton)** - Implemented with HF Hub download | |
| - [x] **Tokenizer** - Extracted and adapted for inference | |
| - [x] **Sliding-Window-Funktion** - Implemented with configurable stride | |
| - [x] **`predict_sequence()`** - Returns per-position probabilities | |
| - [x] **`embed_sequence()`** - Returns hidden-state embeddings | |
| - [x] **Per-Window-Trajectory-Variante** - Implemented as `mode="trajectory"` | |
| - [x] **State-Dynamics Visualization** - UMAP + clustering + interactive Plotly plots | |
| - [x] **Input-Validation** - Sequence validation, FASTA header stripping | |
| - [x] **Health Endpoint equivalent** - GPU status shown in UI | |
| - [x] **Deployment** - Live on HuggingFace Spaces; GPU hardware is recommended for long sequences | |
| - [x] **Acknowledgements** - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO | |
| --- | |
| ## Example Usage | |
| ### Web Interface | |
| 1. Navigate to https://huggingface.co/spaces/genomenet/crispr-array-detection | |
| 2. Paste DNA sequence (or use provided examples) | |
| 3. Click "Analyze Sequence" for CRISPR detection | |
| 4. Use "Embeddings" tab for State-Dynamic Plots | |
| ### Programmatic Access via Gradio Client | |
| ```python | |
| from gradio_client import Client | |
| client = Client("genomenet/crispr-array-detection") | |
| # Predict CRISPR regions | |
| result = client.predict( | |
| sequence="ACGT...", | |
| stride=500, | |
| threshold=0.3, | |
| api_name="/predict" | |
| ) | |
| # Get embeddings | |
| embedding = client.predict( | |
| sequence="ACGT...", | |
| mode="mean", | |
| api_name="/embed" | |
| ) | |
| ``` | |
| --- | |
| ## For the Endbericht | |
| ### Suggested Text (German) | |
| > Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/genomenet/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde. | |
| > | |
| > Der Service bietet: | |
| > - Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition | |
| > - Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell | |
| > - State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering | |
| > | |
| > Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. GPU-Hardware wird für lange Sequenzen und hohe Auflösung empfohlen. Der Service ist für die wissenschaftliche Community frei zugänglich. | |
| > | |
| > **Referenz:** https://huggingface.co/spaces/genomenet/crispr-array-detection | |
| ### Acknowledgements (for publication) | |
| ``` | |
| This work was supported by the Deutsche Forschungsgemeinschaft (DFG) | |
| within the Priority Programme SPP 2141 "Much more than Defence: | |
| the Multiple Functions and Facets of CRISPR-Cas" (project MC 172). | |
| The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis, | |
| HZI BIFO) and utilizes the BERT architecture pre-trained on microbial | |
| genomes as part of the BMBF GenomeNet initiative. | |
| ``` | |
| --- | |
| ## Future Work (Optional) | |
| 1. **Self-hosted HZI Deployment:** For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI | |
| 2. **Zenodo DOI:** Create release and obtain citable DOI | |
| 3. **Reference Dataset Integration:** Add pre-computed reference embeddings for comparative analysis | |
| 4. **Batch Processing:** Support for multi-FASTA input files | |
| --- | |
| ## Contact | |
| For questions about the deployment or technical details, contact the repository maintainer. | |