# CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht **Date:** April 2026 **Repository:** `/vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/` **HuggingFace Space:** https://huggingface.co/spaces/genomenet/crispr-array-detection **HuggingFace Model Repository:** https://huggingface.co/genomenet/crispr-bert-model --- ## Summary We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces and provides both interactive visualization and programmatic access to the model's predictions. --- ## Implemented Functionality ### 1. CRISPR Array Prediction (`/predict` endpoint equivalent) - **Input:** DNA sequence (minimum 1000 bp, supports FASTA format) - **Output:** Per-position CRISPR probability scores (0-1) - **Visualization:** Interactive probability curve along sequence position - **Parameters:** - Configurable stride (50-500 bp, default 500 for CPU responsiveness) - Adjustable detection threshold (0.1-0.9, default 0.3) - **Region Detection:** Automatic identification and annotation of predicted CRISPR regions above threshold ### 2. Hidden-State Embedding Extraction (`/embed` endpoint equivalent) - **Input:** DNA sequence - **Output:** 768-dimensional embedding vectors from transformer layer 21 - **Modes:** - `mean`: Mean-pooled embedding across all windows (single vector) - `max`: Max-pooled embedding (single vector) - `trajectory`: Per-window embeddings for sequence analysis - `state-dynamics`: UMAP projection with clustering visualization ### 3. State-Dynamic Plots (as described in DFG SPP 2141 report) Implemented visualization inspired by Figure 3 from the progress report: - **UMAP Projection:** Dimensionality reduction of hidden-state embeddings to 2D/3D - **Agglomerative Clustering:** Automatic identification of structural regions - **Dual Visualization:** - Left panel: Points colored by cluster assignment - Right panel: Points colored by sequence position (trajectory) - **Sequence Map:** Linear representation showing cluster assignments along the sequence - **Interactive Plots:** Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation **Key Insight:** For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters. --- ## Technical Implementation ### Model Details | Property | Value | |----------|-------| | Architecture | 24-layer BERT transformer with bottleneck classification head | | Parameters | ~430 million | | Pre-training | BERT model trained on metagenomic and genomic microbial sequences | | Fine-tuning | Trained on annotated CRISPR arrays from bacterial genomes | | Input window | 1000 bp | | Embedding layer | `layer_transformer_block_21` (768 dimensions) | ### Deployment Architecture ``` HuggingFace Spaces (Docker SDK) ├── Custom Dockerfile (Python 3.10-slim) ├── TensorFlow 2.15.1 + Keras 2.15.0 ├── Model downloaded from HF Hub at startup └── Gradio 4.x frontend ``` ### Infrastructure - **Hosting:** HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7) - **Model Storage:** Separate HuggingFace Model Repository (5.15 GB) - **Cold Start:** ~2-3 minutes (model download + warm-up) - **Inference Time:** ~50-200ms per 1kb window on T4 GPU ### Dependencies ``` tensorflow==2.15.1 keras==2.15.0 gradio>=4.0.0 numpy>=1.26.0,<2.0.0 huggingface_hub>=0.20.0 umap-learn>=0.5.0 scikit-learn>=1.3.0 plotly>=5.18.0 ``` --- ## Completed Checklist Items From the original TODO: - [x] **Checkpoint beschaffen** - Model `best.h5` located and uploaded to HF Model Hub - [x] **Eigenes Repo anlegen** - Created HuggingFace Space `genomenet/crispr-array-detection` - [x] **Code-Verständnis** - Analyzed custom layers, tokenization, sliding window logic - [x] **Model-Loader (Singleton)** - Implemented with HF Hub download - [x] **Tokenizer** - Extracted and adapted for inference - [x] **Sliding-Window-Funktion** - Implemented with configurable stride - [x] **`predict_sequence()`** - Returns per-position probabilities - [x] **`embed_sequence()`** - Returns hidden-state embeddings - [x] **Per-Window-Trajectory-Variante** - Implemented as `mode="trajectory"` - [x] **State-Dynamics Visualization** - UMAP + clustering + interactive Plotly plots - [x] **Input-Validation** - Sequence validation, FASTA header stripping - [x] **Health Endpoint equivalent** - GPU status shown in UI - [x] **Deployment** - Live on HuggingFace Spaces; GPU hardware is recommended for long sequences - [x] **Acknowledgements** - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO --- ## Example Usage ### Web Interface 1. Navigate to https://huggingface.co/spaces/genomenet/crispr-array-detection 2. Paste DNA sequence (or use provided examples) 3. Click "Analyze Sequence" for CRISPR detection 4. Use "Embeddings" tab for State-Dynamic Plots ### Programmatic Access via Gradio Client ```python from gradio_client import Client client = Client("genomenet/crispr-array-detection") # Predict CRISPR regions result = client.predict( sequence="ACGT...", stride=500, threshold=0.3, api_name="/predict" ) # Get embeddings embedding = client.predict( sequence="ACGT...", mode="mean", api_name="/embed" ) ``` --- ## For the Endbericht ### Suggested Text (German) > Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/genomenet/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde. > > Der Service bietet: > - Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition > - Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell > - State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering > > Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. GPU-Hardware wird für lange Sequenzen und hohe Auflösung empfohlen. Der Service ist für die wissenschaftliche Community frei zugänglich. > > **Referenz:** https://huggingface.co/spaces/genomenet/crispr-array-detection ### Acknowledgements (for publication) ``` This work was supported by the Deutsche Forschungsgemeinschaft (DFG) within the Priority Programme SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172). The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis, HZI BIFO) and utilizes the BERT architecture pre-trained on microbial genomes as part of the BMBF GenomeNet initiative. ``` --- ## Future Work (Optional) 1. **Self-hosted HZI Deployment:** For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI 2. **Zenodo DOI:** Create release and obtain citable DOI 3. **Reference Dataset Integration:** Add pre-computed reference embeddings for comparative analysis 4. **Batch Processing:** Support for multi-FASTA input files --- ## Contact For questions about the deployment or technical details, contact the repository maintainer.