crispr-array-detection / REPORT_SUMMARY.md
genomenet's picture
Improve Space default prediction responsiveness
f44b2b9

CRISPR Array Detection - Status Update for DFG SPP 2141 Endbericht

Date: April 2026 Repository: /vol/hpcprojects/pmuench/crispr_tool/crispr-hf-space/ HuggingFace Space: https://huggingface.co/spaces/genomenet/crispr-array-detection HuggingFace Model Repository: https://huggingface.co/genomenet/crispr-bert-model


Summary

We have successfully deployed a publicly accessible web application for CRISPR array detection based on the BERT-based deep learning model developed in Ziyu Mu's Master's thesis. The application is hosted on HuggingFace Spaces and provides both interactive visualization and programmatic access to the model's predictions.


Implemented Functionality

1. CRISPR Array Prediction (/predict endpoint equivalent)

  • Input: DNA sequence (minimum 1000 bp, supports FASTA format)
  • Output: Per-position CRISPR probability scores (0-1)
  • Visualization: Interactive probability curve along sequence position
  • Parameters:
    • Configurable stride (50-500 bp, default 500 for CPU responsiveness)
    • Adjustable detection threshold (0.1-0.9, default 0.3)
  • Region Detection: Automatic identification and annotation of predicted CRISPR regions above threshold

2. Hidden-State Embedding Extraction (/embed endpoint equivalent)

  • Input: DNA sequence
  • Output: 768-dimensional embedding vectors from transformer layer 21
  • Modes:
    • mean: Mean-pooled embedding across all windows (single vector)
    • max: Max-pooled embedding (single vector)
    • trajectory: Per-window embeddings for sequence analysis
    • state-dynamics: UMAP projection with clustering visualization

3. State-Dynamic Plots (as described in DFG SPP 2141 report)

Implemented visualization inspired by Figure 3 from the progress report:

  • UMAP Projection: Dimensionality reduction of hidden-state embeddings to 2D/3D
  • Agglomerative Clustering: Automatic identification of structural regions
  • Dual Visualization:
    • Left panel: Points colored by cluster assignment
    • Right panel: Points colored by sequence position (trajectory)
  • Sequence Map: Linear representation showing cluster assignments along the sequence
  • Interactive Plots: Plotly-based visualization with zoom, pan, hover tooltips, and 3D rotation

Key Insight: For CRISPR arrays, the State-Dynamic Plot shows alternating color patterns where repeats cluster together (conserved sequences) and spacers form distinct clusters.


Technical Implementation

Model Details

Property Value
Architecture 24-layer BERT transformer with bottleneck classification head
Parameters ~430 million
Pre-training BERT model trained on metagenomic and genomic microbial sequences
Fine-tuning Trained on annotated CRISPR arrays from bacterial genomes
Input window 1000 bp
Embedding layer layer_transformer_block_21 (768 dimensions)

Deployment Architecture

HuggingFace Spaces (Docker SDK)
├── Custom Dockerfile (Python 3.10-slim)
├── TensorFlow 2.15.1 + Keras 2.15.0
├── Model downloaded from HF Hub at startup
└── Gradio 4.x frontend

Infrastructure

  • Hosting: HuggingFace Spaces with T4 GPU ($0.60/hour, ~$43/month for 24/7)
  • Model Storage: Separate HuggingFace Model Repository (5.15 GB)
  • Cold Start: ~2-3 minutes (model download + warm-up)
  • Inference Time: ~50-200ms per 1kb window on T4 GPU

Dependencies

tensorflow==2.15.1
keras==2.15.0
gradio>=4.0.0
numpy>=1.26.0,<2.0.0
huggingface_hub>=0.20.0
umap-learn>=0.5.0
scikit-learn>=1.3.0
plotly>=5.18.0

Completed Checklist Items

From the original TODO:

  • Checkpoint beschaffen - Model best.h5 located and uploaded to HF Model Hub
  • Eigenes Repo anlegen - Created HuggingFace Space genomenet/crispr-array-detection
  • Code-Verständnis - Analyzed custom layers, tokenization, sliding window logic
  • Model-Loader (Singleton) - Implemented with HF Hub download
  • Tokenizer - Extracted and adapted for inference
  • Sliding-Window-Funktion - Implemented with configurable stride
  • predict_sequence() - Returns per-position probabilities
  • embed_sequence() - Returns hidden-state embeddings
  • Per-Window-Trajectory-Variante - Implemented as mode="trajectory"
  • State-Dynamics Visualization - UMAP + clustering + interactive Plotly plots
  • Input-Validation - Sequence validation, FASTA header stripping
  • Health Endpoint equivalent - GPU status shown in UI
  • Deployment - Live on HuggingFace Spaces; GPU hardware is recommended for long sequences
  • Acknowledgements - Ziyu Mu, DFG SPP 2141, BMBF GenomeNet, HZI BIFO

Example Usage

Web Interface

  1. Navigate to https://huggingface.co/spaces/genomenet/crispr-array-detection
  2. Paste DNA sequence (or use provided examples)
  3. Click "Analyze Sequence" for CRISPR detection
  4. Use "Embeddings" tab for State-Dynamic Plots

Programmatic Access via Gradio Client

from gradio_client import Client

client = Client("genomenet/crispr-array-detection")

# Predict CRISPR regions
result = client.predict(
    sequence="ACGT...",
    stride=500,
    threshold=0.3,
    api_name="/predict"
)

# Get embeddings
embedding = client.predict(
    sequence="ACGT...",
    mode="mean",
    api_name="/embed"
)

For the Endbericht

Suggested Text (German)

Im Rahmen des SPP 2141 wurde ein öffentlich zugänglicher Webservice zur CRISPR-Array-Detektion entwickelt und auf HuggingFace Spaces bereitgestellt (https://huggingface.co/spaces/genomenet/crispr-array-detection). Das System basiert auf einem BERT-basierten Deep-Learning-Modell (~430 Mio. Parameter), das auf metagenomischen und genomischen mikrobiellen Sequenzen vortrainiert und anschließend auf annotierten CRISPR-Arrays feinabgestimmt wurde.

Der Service bietet:

  • Vorhersage von CRISPR-Array-Wahrscheinlichkeiten entlang der Sequenzposition
  • Extraktion von Hidden-State-Embeddings aus dem Transformer-Modell
  • State-Dynamic-Plots zur Visualisierung der Einbettungstrajektorien mittels UMAP und Clustering

Die State-Dynamic-Visualisierung ermöglicht die Identifikation wiederkehrender Strukturelemente (z.B. Repeats vs. Spacer) durch die Analyse der Aktivierungsmuster im neuronalen Netzwerk. GPU-Hardware wird für lange Sequenzen und hohe Auflösung empfohlen. Der Service ist für die wissenschaftliche Community frei zugänglich.

Referenz: https://huggingface.co/spaces/genomenet/crispr-array-detection

Acknowledgements (for publication)

This work was supported by the Deutsche Forschungsgemeinschaft (DFG)
within the Priority Programme SPP 2141 "Much more than Defence:
the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).

The CRISPR detection model is based on work by Ziyu Mu (Master's Thesis,
HZI BIFO) and utilizes the BERT architecture pre-trained on microbial
genomes as part of the BMBF GenomeNet initiative.

Future Work (Optional)

  1. Self-hosted HZI Deployment: For lower latency and no cold-start, deploy on HZI T4 machine with FastAPI
  2. Zenodo DOI: Create release and obtain citable DOI
  3. Reference Dataset Integration: Add pre-computed reference embeddings for comparative analysis
  4. Batch Processing: Support for multi-FASTA input files

Contact

For questions about the deployment or technical details, contact the repository maintainer.