genomenet
/

crispr-bert-model

+---
+license: mit
+library_name: keras
+pipeline_tag: token-classification
+tags:
+  - biology
+  - genomics
+  - dna
+  - crispr
+  - crispr-cas
+  - tensorflow
+  - keras
+  - sequence-classification
+  - token-classification
+model-index:
+  - name: CRISPR-BERT
+    results:
+      - task:
+          type: token-classification
+          name: Per-position CRISPR array detection
+        dataset:
+          name: Ground-truth CRISPR array test split
+          type: genomic-crispr-array-benchmark
+        metrics:
+          - type: auprc
+            name: Micro AUPRC
+            value: 0.9802
+          - type: auroc
+            name: Micro AUROC
+            value: 0.9910
+          - type: f1
+            name: Best micro F1 over threshold grid
+            value: 0.9543
+          - type: f1
+            name: Window-level F1 at threshold 0.5
+            value: 0.9255
+---
+# CRISPR-BERT
+CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.
+The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.
+## Model Details
+| Property | Value |
+|---|---|
+| Model type | BERT-like transformer for DNA sequence labeling |
+| Framework | TensorFlow / Keras |
+| Checkpoint | `best.h5` |
+| Artifact size | 5.15 GB, stored with Git LFS |
+| Input | 1000 bp DNA window |
+| Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
+| Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
+| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
+| Fine-tuning objective | Per-position binary CRISPR array detection |
+The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.
+## Intended Use
+Use this model to:
+- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
+- Detect candidate CRISPR array regions using sliding windows and thresholding.
+- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.
+This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.
+## Training Data
+The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.
+## Evaluation
+The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:
+| Quantity | Value |
+|---|---:|
+| Windows | 2,840 |
+| Total bases | 2,840,000 |
+| Positive bases | 375,163 |
+| Positive-base prevalence | 13.21% |
+Metrics from the available benchmark report:
+| Metric | Value |
+|---|---:|
+| Micro AUPRC | 0.9802 |
+| Micro AUROC | 0.9910 |
+| Precision at threshold 0.5 | 0.9809 |
+| Recall at threshold 0.5 | 0.8828 |
+| F1 at threshold 0.5 | 0.9293 |
+| Best F1 over threshold grid | 0.9543 |
+| Best threshold over grid | 0.31 |
+| Window-level detection F1 at threshold 0.5 | 0.9255 |
+Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.
+## Usage
+The easiest way to run the model is through the companion Hugging Face Space:
+```text
+https://huggingface.co/spaces/genomenet/crispr-array-detection
+```
+For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:
+```python
+import tensorflow as tf
+from huggingface_hub import hf_hub_download
+from inference.custom_layers import get_custom_objects
+model_path = hf_hub_download(
+    repo_id="genomenet/crispr-bert-model",
+    filename="best.h5",
+)
+model = tf.keras.models.load_model(
+    model_path,
+    custom_objects=get_custom_objects(),
+    compile=False,
+)
+```
+Input sequences should be converted to integer tokens using the same tokenizer used during training:
+```python
+TOKEN = {
+    "A": 1,
+    "C": 2,
+    "G": 3,
+    "T": 4,
+}
+# Unknown and ambiguous IUPAC bases are encoded as 5.
+# Padding/OOV is encoded as 0.
+```
+For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.
+## Limitations
+- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
+- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
+- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
+- Ambiguous bases are supported but may reduce confidence if frequent.
+- Evaluation metrics depend on the benchmark split and annotation quality.
+## Citation and Acknowledgements
+If you use this model, please cite or acknowledge:
+- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
+- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
+- BMBF de.NBI / GenomeNet.
+## Contact
+For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.