File size: 5,531 Bytes
b15619b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
library_name: keras
pipeline_tag: token-classification
tags:
  - biology
  - genomics
  - dna
  - crispr
  - crispr-cas
  - tensorflow
  - keras
  - sequence-classification
  - token-classification
model-index:
  - name: CRISPR-BERT
    results:
      - task:
          type: token-classification
          name: Per-position CRISPR array detection
        dataset:
          name: Ground-truth CRISPR array test split
          type: genomic-crispr-array-benchmark
        metrics:
          - type: auprc
            name: Micro AUPRC
            value: 0.9802
          - type: auroc
            name: Micro AUROC
            value: 0.9910
          - type: f1
            name: Best micro F1 over threshold grid
            value: 0.9543
          - type: f1
            name: Window-level F1 at threshold 0.5
            value: 0.9255
---

# CRISPR-BERT

CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.

The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.

## Model Details

| Property | Value |
|---|---|
| Model type | BERT-like transformer for DNA sequence labeling |
| Framework | TensorFlow / Keras |
| Checkpoint | `best.h5` |
| Artifact size | 5.15 GB, stored with Git LFS |
| Input | 1000 bp DNA window |
| Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
| Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
| Fine-tuning objective | Per-position binary CRISPR array detection |

The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.

## Intended Use

Use this model to:

- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
- Detect candidate CRISPR array regions using sliding windows and thresholding.
- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.

This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.

## Training Data

The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.

## Evaluation

The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:

| Quantity | Value |
|---|---:|
| Windows | 2,840 |
| Total bases | 2,840,000 |
| Positive bases | 375,163 |
| Positive-base prevalence | 13.21% |

Metrics from the available benchmark report:

| Metric | Value |
|---|---:|
| Micro AUPRC | 0.9802 |
| Micro AUROC | 0.9910 |
| Precision at threshold 0.5 | 0.9809 |
| Recall at threshold 0.5 | 0.8828 |
| F1 at threshold 0.5 | 0.9293 |
| Best F1 over threshold grid | 0.9543 |
| Best threshold over grid | 0.31 |
| Window-level detection F1 at threshold 0.5 | 0.9255 |

Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.

## Usage

The easiest way to run the model is through the companion Hugging Face Space:

```text
https://huggingface.co/spaces/genomenet/crispr-array-detection
```

For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:

```python
import tensorflow as tf
from huggingface_hub import hf_hub_download

from inference.custom_layers import get_custom_objects

model_path = hf_hub_download(
    repo_id="genomenet/crispr-bert-model",
    filename="best.h5",
)

model = tf.keras.models.load_model(
    model_path,
    custom_objects=get_custom_objects(),
    compile=False,
)
```

Input sequences should be converted to integer tokens using the same tokenizer used during training:

```python
TOKEN = {
    "A": 1,
    "C": 2,
    "G": 3,
    "T": 4,
}
# Unknown and ambiguous IUPAC bases are encoded as 5.
# Padding/OOV is encoded as 0.
```

For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.

## Limitations

- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
- Ambiguous bases are supported but may reduce confidence if frequent.
- Evaluation metrics depend on the benchmark split and annotation quality.

## Citation and Acknowledgements

If you use this model, please cite or acknowledge:

- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
- BMBF de.NBI / GenomeNet.

## Contact

For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.