Spaces:
Sleeping
Sleeping
Add status summary for DFG SPP 2141 Endbericht
Browse files- Created comprehensive REPORT_SUMMARY.md documenting:
- Implemented functionality (prediction, embeddings, State-Dynamic plots)
- Technical implementation details
- Completed checklist items from original TODO
- German text suggestion for the Endbericht
- Example usage and acknowledgements
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- REPORT_SUMMARY.md +194 -0
- requirements.txt +1 -0
REPORT_SUMMARY.md
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht
|
| 2 |
+
|
| 3 |
+
**Date:** April 2026
|
| 4 |
+
**Repository:** `/vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/`
|
| 5 |
+
**HuggingFace Space:** https://huggingface.co/spaces/pmuench3/crispr-array-detection
|
| 6 |
+
**HuggingFace Model Repository:** https://huggingface.co/pmuench3/crispr-bert-model
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Summary
|
| 11 |
+
|
| 12 |
+
We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces with GPU acceleration (T4) and provides both interactive visualization and programmatic access to the model's predictions.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## Implemented Functionality
|
| 17 |
+
|
| 18 |
+
### 1. CRISPR Array Prediction (`/predict` endpoint equivalent)
|
| 19 |
+
|
| 20 |
+
- **Input:** DNA sequence (minimum 1000 bp, supports FASTA format)
|
| 21 |
+
- **Output:** Per-position CRISPR probability scores (0-1)
|
| 22 |
+
- **Visualization:** Interactive probability curve along sequence position
|
| 23 |
+
- **Parameters:**
|
| 24 |
+
- Configurable stride (50-500 bp, default 100)
|
| 25 |
+
- Adjustable detection threshold (0.1-0.9, default 0.3)
|
| 26 |
+
- **Region Detection:** Automatic identification and annotation of predicted CRISPR regions above threshold
|
| 27 |
+
|
| 28 |
+
### 2. Hidden-State Embedding Extraction (`/embed` endpoint equivalent)
|
| 29 |
+
|
| 30 |
+
- **Input:** DNA sequence
|
| 31 |
+
- **Output:** 768-dimensional embedding vectors from transformer layer 21
|
| 32 |
+
- **Modes:**
|
| 33 |
+
- `mean`: Mean-pooled embedding across all windows (single vector)
|
| 34 |
+
- `max`: Max-pooled embedding (single vector)
|
| 35 |
+
- `trajectory`: Per-window embeddings for sequence analysis
|
| 36 |
+
- `state-dynamics`: UMAP projection with clustering visualization
|
| 37 |
+
|
| 38 |
+
### 3. State-Dynamic Plots (as described in DFG SPP 2141 report)
|
| 39 |
+
|
| 40 |
+
Implemented visualization inspired by Figure 3 from the progress report:
|
| 41 |
+
|
| 42 |
+
- **UMAP Projection:** Dimensionality reduction of hidden-state embeddings to 2D/3D
|
| 43 |
+
- **Agglomerative Clustering:** Automatic identification of structural regions
|
| 44 |
+
- **Dual Visualization:**
|
| 45 |
+
- Left panel: Points colored by cluster assignment
|
| 46 |
+
- Right panel: Points colored by sequence position (trajectory)
|
| 47 |
+
- **Sequence Map:** Linear representation showing cluster assignments along the sequence
|
| 48 |
+
- **Interactive Plots:** Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation
|
| 49 |
+
|
| 50 |
+
**Key Insight:** For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## Technical Implementation
|
| 55 |
+
|
| 56 |
+
### Model Details
|
| 57 |
+
|
| 58 |
+
| Property | Value |
|
| 59 |
+
|----------|-------|
|
| 60 |
+
| Architecture | 24-layer BERT transformer with bottleneck classification head |
|
| 61 |
+
| Parameters | ~430 million |
|
| 62 |
+
| Pre-training | BERT model trained on metagenomic and genomic microbial sequences |
|
| 63 |
+
| Fine-tuning | Trained on annotated CRISPR arrays from bacterial genomes |
|
| 64 |
+
| Input window | 1000 bp |
|
| 65 |
+
| Embedding layer | `layer_transformer_block_21` (768 dimensions) |
|
| 66 |
+
|
| 67 |
+
### Deployment Architecture
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
HuggingFace Spaces (Docker SDK)
|
| 71 |
+
├── Custom Dockerfile (Python 3.10-slim)
|
| 72 |
+
├── TensorFlow 2.15.1 + Keras 2.15.0
|
| 73 |
+
├── Model downloaded from HF Hub at startup
|
| 74 |
+
└── Gradio 4.x frontend
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Infrastructure
|
| 78 |
+
|
| 79 |
+
- **Hosting:** HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7)
|
| 80 |
+
- **Model Storage:** Separate HuggingFace Model Repository (5.15 GB)
|
| 81 |
+
- **Cold Start:** ~2-3 minutes (model download + warm-up)
|
| 82 |
+
- **Inference Time:** ~50-200ms per 1kb window on T4 GPU
|
| 83 |
+
|
| 84 |
+
### Dependencies
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
tensorflow==2.15.1
|
| 88 |
+
keras==2.15.0
|
| 89 |
+
gradio>=4.0.0
|
| 90 |
+
numpy>=1.26.0,<2.0.0
|
| 91 |
+
huggingface_hub>=0.20.0
|
| 92 |
+
umap-learn>=0.5.0
|
| 93 |
+
scikit-learn>=1.3.0
|
| 94 |
+
plotly>=5.18.0
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Completed Checklist Items
|
| 100 |
+
|
| 101 |
+
From the original TODO:
|
| 102 |
+
|
| 103 |
+
- [x] **Checkpoint beschaffen** - Model `best.h5` located and uploaded to HF Model Hub
|
| 104 |
+
- [x] **Eigenes Repo anlegen** - Created HuggingFace Space `pmuench3/crispr-array-detection`
|
| 105 |
+
- [x] **Code-Verständnis** - Analyzed custom layers, tokenization, sliding window logic
|
| 106 |
+
- [x] **Model-Loader (Singleton)** - Implemented with HF Hub download
|
| 107 |
+
- [x] **Tokenizer** - Extracted and adapted for inference
|
| 108 |
+
- [x] **Sliding-Window-Funktion** - Implemented with configurable stride
|
| 109 |
+
- [x] **`predict_sequence()`** - Returns per-position probabilities
|
| 110 |
+
- [x] **`embed_sequence()`** - Returns hidden-state embeddings
|
| 111 |
+
- [x] **Per-Window-Trajectory-Variante** - Implemented as `mode="trajectory"`
|
| 112 |
+
- [x] **State-Dynamics Visualization** - UMAP + clustering + interactive Plotly plots
|
| 113 |
+
- [x] **Input-Validation** - Sequence validation, FASTA header stripping
|
| 114 |
+
- [x] **Health Endpoint equivalent** - GPU status shown in UI
|
| 115 |
+
- [x] **Deployment** - Live on HuggingFace Spaces with T4 GPU
|
| 116 |
+
- [x] **Acknowledgements** - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## Example Usage
|
| 121 |
+
|
| 122 |
+
### Web Interface
|
| 123 |
+
|
| 124 |
+
1. Navigate to https://huggingface.co/spaces/pmuench3/crispr-array-detection
|
| 125 |
+
2. Paste DNA sequence (or use provided examples)
|
| 126 |
+
3. Click "Analyze Sequence" for CRISPR detection
|
| 127 |
+
4. Use "Embeddings" tab for State-Dynamic Plots
|
| 128 |
+
|
| 129 |
+
### Programmatic Access via Gradio Client
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
from gradio_client import Client
|
| 133 |
+
|
| 134 |
+
client = Client("pmuench3/crispr-array-detection")
|
| 135 |
+
|
| 136 |
+
# Predict CRISPR regions
|
| 137 |
+
result = client.predict(
|
| 138 |
+
sequence="ACGT...",
|
| 139 |
+
stride=100,
|
| 140 |
+
threshold=0.3,
|
| 141 |
+
api_name="/predict"
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
# Get embeddings
|
| 145 |
+
embedding = client.predict(
|
| 146 |
+
sequence="ACGT...",
|
| 147 |
+
mode="mean",
|
| 148 |
+
api_name="/embed"
|
| 149 |
+
)
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## For the Endbericht
|
| 155 |
+
|
| 156 |
+
### Suggested Text (German)
|
| 157 |
+
|
| 158 |
+
> Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/pmuench3/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde.
|
| 159 |
+
>
|
| 160 |
+
> Der Service bietet:
|
| 161 |
+
> - Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition
|
| 162 |
+
> - Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell
|
| 163 |
+
> - State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering
|
| 164 |
+
>
|
| 165 |
+
> Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. Der Service läuft auf GPU-beschleunigter Hardware (NVIDIA T4) und ist für die wissenschaftliche Community frei zugänglich.
|
| 166 |
+
>
|
| 167 |
+
> **Referenz:** https://huggingface.co/spaces/pmuench3/crispr-array-detection
|
| 168 |
+
|
| 169 |
+
### Acknowledgements (for publication)
|
| 170 |
+
|
| 171 |
+
```
|
| 172 |
+
This work was supported by the Deutsche Forschungsgemeinschaft (DFG)
|
| 173 |
+
within the Priority Programme SPP 2141 "Much more than Defence:
|
| 174 |
+
the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
|
| 175 |
+
|
| 176 |
+
The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis,
|
| 177 |
+
HZI BIFO) and utilizes the BERT architecture pre-trained on microbial
|
| 178 |
+
genomes as part of the BMBF GenomeNet initiative.
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## Future Work (Optional)
|
| 184 |
+
|
| 185 |
+
1. **Self-hosted HZI Deployment:** For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI
|
| 186 |
+
2. **Zenodo DOI:** Create release and obtain citable DOI
|
| 187 |
+
3. **Reference Dataset Integration:** Add pre-computed reference embeddings for comparative analysis
|
| 188 |
+
4. **Batch Processing:** Support for multi-FASTA input files
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## Contact
|
| 193 |
+
|
| 194 |
+
For questions about the deployment or technical details, contact the repository maintainer.
|
requirements.txt
CHANGED
|
@@ -10,3 +10,4 @@ matplotlib>=3.7.0
|
|
| 10 |
huggingface_hub>=0.20.0
|
| 11 |
umap-learn>=0.5.0
|
| 12 |
scikit-learn>=1.3.0
|
|
|
|
|
|
| 10 |
huggingface_hub>=0.20.0
|
| 11 |
umap-learn>=0.5.0
|
| 12 |
scikit-learn>=1.3.0
|
| 13 |
+
plotly>=5.18.0
|