Spaces:
Sleeping
CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht
Date: April 2026
Repository: /vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/
HuggingFace Space: https://huggingface.co/spaces/genomenet/crispr-array-detection
HuggingFace Model Repository: https://huggingface.co/genomenet/crispr-bert-model
Summary
We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces and provides both interactive visualization and programmatic access to the model's predictions.
Implemented Functionality
1. CRISPR Array Prediction (/predict endpoint equivalent)
- Input: DNA sequence (minimum 1000 bp, supports FASTA format)
- Output: Per-position CRISPR probability scores (0-1)
- Visualization: Interactive probability curve along sequence position
- Parameters:
- Configurable stride (50-500 bp, default 500 for CPU responsiveness)
- Adjustable detection threshold (0.1-0.9, default 0.3)
- Region Detection: Automatic identification and annotation of predicted CRISPR regions above threshold
2. Hidden-State Embedding Extraction (/embed endpoint equivalent)
- Input: DNA sequence
- Output: 768-dimensional embedding vectors from transformer layer 21
- Modes:
mean: Mean-pooled embedding across all windows (single vector)max: Max-pooled embedding (single vector)trajectory: Per-window embeddings for sequence analysisstate-dynamics: UMAP projection with clustering visualization
3. State-Dynamic Plots (as described in DFG SPP 2141 report)
Implemented visualization inspired by Figure 3 from the progress report:
- UMAP Projection: Dimensionality reduction of hidden-state embeddings to 2D/3D
- Agglomerative Clustering: Automatic identification of structural regions
- Dual Visualization:
- Left panel: Points colored by cluster assignment
- Right panel: Points colored by sequence position (trajectory)
- Sequence Map: Linear representation showing cluster assignments along the sequence
- Interactive Plots: Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation
Key Insight: For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters.
Technical Implementation
Model Details
| Property | Value |
|---|---|
| Architecture | 24-layer BERT transformer with bottleneck classification head |
| Parameters | ~430 million |
| Pre-training | BERT model trained on metagenomic and genomic microbial sequences |
| Fine-tuning | Trained on annotated CRISPR arrays from bacterial genomes |
| Input window | 1000 bp |
| Embedding layer | layer_transformer_block_21 (768 dimensions) |
Deployment Architecture
HuggingFace Spaces (Docker SDK)
├── Custom Dockerfile (Python 3.10-slim)
├── TensorFlow 2.15.1 + Keras 2.15.0
├── Model downloaded from HF Hub at startup
└── Gradio 4.x frontend
Infrastructure
- Hosting: HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7)
- Model Storage: Separate HuggingFace Model Repository (5.15 GB)
- Cold Start: ~2-3 minutes (model download + warm-up)
- Inference Time: ~50-200ms per 1kb window on T4 GPU
Dependencies
tensorflow==2.15.1
keras==2.15.0
gradio>=4.0.0
numpy>=1.26.0,<2.0.0
huggingface_hub>=0.20.0
umap-learn>=0.5.0
scikit-learn>=1.3.0
plotly>=5.18.0
Completed Checklist Items
From the original TODO:
- Checkpoint beschaffen - Model
best.h5located and uploaded to HF Model Hub - Eigenes Repo anlegen - Created HuggingFace Space
genomenet/crispr-array-detection - Code-Verständnis - Analyzed custom layers, tokenization, sliding window logic
- Model-Loader (Singleton) - Implemented with HF Hub download
- Tokenizer - Extracted and adapted for inference
- Sliding-Window-Funktion - Implemented with configurable stride
-
predict_sequence()- Returns per-position probabilities -
embed_sequence()- Returns hidden-state embeddings - Per-Window-Trajectory-Variante - Implemented as
mode="trajectory" - State-Dynamics Visualization - UMAP + clustering + interactive Plotly plots
- Input-Validation - Sequence validation, FASTA header stripping
- Health Endpoint equivalent - GPU status shown in UI
- Deployment - Live on HuggingFace Spaces; GPU hardware is recommended for long sequences
- Acknowledgements - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO
Example Usage
Web Interface
- Navigate to https://huggingface.co/spaces/genomenet/crispr-array-detection
- Paste DNA sequence (or use provided examples)
- Click "Analyze Sequence" for CRISPR detection
- Use "Embeddings" tab for State-Dynamic Plots
Programmatic Access via Gradio Client
from gradio_client import Client
client = Client("genomenet/crispr-array-detection")
# Predict CRISPR regions
result = client.predict(
sequence="ACGT...",
stride=500,
threshold=0.3,
api_name="/predict"
)
# Get embeddings
embedding = client.predict(
sequence="ACGT...",
mode="mean",
api_name="/embed"
)
For the Endbericht
Suggested Text (German)
Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/genomenet/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde.
Der Service bietet:
- Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition
- Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell
- State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering
Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. GPU-Hardware wird für lange Sequenzen und hohe Auflösung empfohlen. Der Service ist für die wissenschaftliche Community frei zugänglich.
Referenz: https://huggingface.co/spaces/genomenet/crispr-array-detection
Acknowledgements (for publication)
This work was supported by the Deutsche Forschungsgemeinschaft (DFG)
within the Priority Programme SPP 2141 "Much more than Defence:
the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis,
HZI BIFO) and utilizes the BERT architecture pre-trained on microbial
genomes as part of the BMBF GenomeNet initiative.
Future Work (Optional)
- Self-hosted HZI Deployment: For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI
- Zenodo DOI: Create release and obtain citable DOI
- Reference Dataset Integration: Add pre-computed reference embeddings for comparative analysis
- Batch Processing: Support for multi-FASTA input files
Contact
For questions about the deployment or technical details, contact the repository maintainer.