crispr-array-detection / REPORT_SUMMARY.md
genomenet's picture
Improve Space default prediction responsiveness
f44b2b9
# CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht
**Date:** April 2026
**Repository:** `/vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/`
**HuggingFace Space:** https://huggingface.co/spaces/genomenet/crispr-array-detection
**HuggingFace Model Repository:** https://huggingface.co/genomenet/crispr-bert-model
---
## Summary
We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces and provides both interactive visualization and programmatic access to the model's predictions.
---
## Implemented Functionality
### 1. CRISPR Array Prediction (`/predict` endpoint equivalent)
- **Input:** DNA sequence (minimum 1000 bp, supports FASTA format)
- **Output:** Per-position CRISPR probability scores (0-1)
- **Visualization:** Interactive probability curve along sequence position
- **Parameters:**
- Configurable stride (50-500 bp, default 500 for CPU responsiveness)
- Adjustable detection threshold (0.1-0.9, default 0.3)
- **Region Detection:** Automatic identification and annotation of predicted CRISPR regions above threshold
### 2. Hidden-State Embedding Extraction (`/embed` endpoint equivalent)
- **Input:** DNA sequence
- **Output:** 768-dimensional embedding vectors from transformer layer 21
- **Modes:**
- `mean`: Mean-pooled embedding across all windows (single vector)
- `max`: Max-pooled embedding (single vector)
- `trajectory`: Per-window embeddings for sequence analysis
- `state-dynamics`: UMAP projection with clustering visualization
### 3. State-Dynamic Plots (as described in DFG SPP 2141 report)
Implemented visualization inspired by Figure 3 from the progress report:
- **UMAP Projection:** Dimensionality reduction of hidden-state embeddings to 2D/3D
- **Agglomerative Clustering:** Automatic identification of structural regions
- **Dual Visualization:**
- Left panel: Points colored by cluster assignment
- Right panel: Points colored by sequence position (trajectory)
- **Sequence Map:** Linear representation showing cluster assignments along the sequence
- **Interactive Plots:** Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation
**Key Insight:** For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters.
---
## Technical Implementation
### Model Details
| Property | Value |
|----------|-------|
| Architecture | 24-layer BERT transformer with bottleneck classification head |
| Parameters | ~430 million |
| Pre-training | BERT model trained on metagenomic and genomic microbial sequences |
| Fine-tuning | Trained on annotated CRISPR arrays from bacterial genomes |
| Input window | 1000 bp |
| Embedding layer | `layer_transformer_block_21` (768 dimensions) |
### Deployment Architecture
```
HuggingFace Spaces (Docker SDK)
├── Custom Dockerfile (Python 3.10-slim)
├── TensorFlow 2.15.1 + Keras 2.15.0
├── Model downloaded from HF Hub at startup
└── Gradio 4.x frontend
```
### Infrastructure
- **Hosting:** HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7)
- **Model Storage:** Separate HuggingFace Model Repository (5.15 GB)
- **Cold Start:** ~2-3 minutes (model download + warm-up)
- **Inference Time:** ~50-200ms per 1kb window on T4 GPU
### Dependencies
```
tensorflow==2.15.1
keras==2.15.0
gradio>=4.0.0
numpy>=1.26.0,<2.0.0
huggingface_hub>=0.20.0
umap-learn>=0.5.0
scikit-learn>=1.3.0
plotly>=5.18.0
```
---
## Completed Checklist Items
From the original TODO:
- [x] **Checkpoint beschaffen** - Model `best.h5` located and uploaded to HF Model Hub
- [x] **Eigenes Repo anlegen** - Created HuggingFace Space `genomenet/crispr-array-detection`
- [x] **Code-Verständnis** - Analyzed custom layers, tokenization, sliding window logic
- [x] **Model-Loader (Singleton)** - Implemented with HF Hub download
- [x] **Tokenizer** - Extracted and adapted for inference
- [x] **Sliding-Window-Funktion** - Implemented with configurable stride
- [x] **`predict_sequence()`** - Returns per-position probabilities
- [x] **`embed_sequence()`** - Returns hidden-state embeddings
- [x] **Per-Window-Trajectory-Variante** - Implemented as `mode="trajectory"`
- [x] **State-Dynamics Visualization** - UMAP + clustering + interactive Plotly plots
- [x] **Input-Validation** - Sequence validation, FASTA header stripping
- [x] **Health Endpoint equivalent** - GPU status shown in UI
- [x] **Deployment** - Live on HuggingFace Spaces; GPU hardware is recommended for long sequences
- [x] **Acknowledgements** - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO
---
## Example Usage
### Web Interface
1. Navigate to https://huggingface.co/spaces/genomenet/crispr-array-detection
2. Paste DNA sequence (or use provided examples)
3. Click "Analyze Sequence" for CRISPR detection
4. Use "Embeddings" tab for State-Dynamic Plots
### Programmatic Access via Gradio Client
```python
from gradio_client import Client
client = Client("genomenet/crispr-array-detection")
# Predict CRISPR regions
result = client.predict(
sequence="ACGT...",
stride=500,
threshold=0.3,
api_name="/predict"
)
# Get embeddings
embedding = client.predict(
sequence="ACGT...",
mode="mean",
api_name="/embed"
)
```
---
## For the Endbericht
### Suggested Text (German)
> Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/genomenet/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde.
>
> Der Service bietet:
> - Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition
> - Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell
> - State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering
>
> Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. GPU-Hardware wird für lange Sequenzen und hohe Auflösung empfohlen. Der Service ist für die wissenschaftliche Community frei zugänglich.
>
> **Referenz:** https://huggingface.co/spaces/genomenet/crispr-array-detection
### Acknowledgements (for publication)
```
This work was supported by the Deutsche Forschungsgemeinschaft (DFG)
within the Priority Programme SPP 2141 "Much more than Defence:
the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis,
HZI BIFO) and utilizes the BERT architecture pre-trained on microbial
genomes as part of the BMBF GenomeNet initiative.
```
---
## Future Work (Optional)
1. **Self-hosted HZI Deployment:** For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI
2. **Zenodo DOI:** Create release and obtain citable DOI
3. **Reference Dataset Integration:** Add pre-computed reference embeddings for comparative analysis
4. **Batch Processing:** Support for multi-FASTA input files
---
## Contact
For questions about the deployment or technical details, contact the repository maintainer.