---
license: mit
library_name: keras
pipeline_tag: token-classification
tags:
  - biology
  - genomics
  - dna
  - crispr
  - crispr-cas
  - tensorflow
  - keras
  - sequence-classification
  - token-classification
model-index:
  - name: CRISPR-BERT
    results:
      - task:
          type: token-classification
          name: Per-position CRISPR array detection
        dataset:
          name: Ground-truth CRISPR array test split
          type: genomic-crispr-array-benchmark
        metrics:
          - type: auprc
            name: Micro AUPRC
            value: 0.9802
          - type: auroc
            name: Micro AUROC
            value: 0.9910
          - type: f1
            name: Best micro F1 over threshold grid
            value: 0.9543
          - type: f1
            name: Window-level F1 at threshold 0.5
            value: 0.9255
---

# CRISPR-BERT

CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.

The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.

## Model Details

| Property | Value |
|---|---|
| Model type | BERT-like transformer for DNA sequence labeling |
| Framework | TensorFlow / Keras |
| Checkpoint | `best.h5` |
| Artifact size | 5.15 GB, stored with Git LFS |
| Input | 1000 bp DNA window |
| Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
| Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
| Fine-tuning objective | Per-position binary CRISPR array detection |

The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.

## Intended Use

Use this model to:

- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
- Detect candidate CRISPR array regions using sliding windows and thresholding.
- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.

This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.

## Training Data

The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.

## Evaluation

The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:

| Quantity | Value |
|---|---:|
| Windows | 2,840 |
| Total bases | 2,840,000 |
| Positive bases | 375,163 |
| Positive-base prevalence | 13.21% |

Metrics from the available benchmark report:

| Metric | Value |
|---|---:|
| Micro AUPRC | 0.9802 |
| Micro AUROC | 0.9910 |
| Precision at threshold 0.5 | 0.9809 |
| Recall at threshold 0.5 | 0.8828 |
| F1 at threshold 0.5 | 0.9293 |
| Best F1 over threshold grid | 0.9543 |
| Best threshold over grid | 0.31 |
| Window-level detection F1 at threshold 0.5 | 0.9255 |

Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.

## Usage

The easiest way to run the model is through the companion Hugging Face Space:

```text
https://huggingface.co/spaces/genomenet/crispr-array-detection
```

For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:

```python
import tensorflow as tf
from huggingface_hub import hf_hub_download

from inference.custom_layers import get_custom_objects

model_path = hf_hub_download(
    repo_id="genomenet/crispr-bert-model",
    filename="best.h5",
)

model = tf.keras.models.load_model(
    model_path,
    custom_objects=get_custom_objects(),
    compile=False,
)
```

Input sequences should be converted to integer tokens using the same tokenizer used during training:

```python
TOKEN = {
    "A": 1,
    "C": 2,
    "G": 3,
    "T": 4,
}
# Unknown and ambiguous IUPAC bases are encoded as 5.
# Padding/OOV is encoded as 0.
```

For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.

## Limitations

- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
- Ambiguous bases are supported but may reduce confidence if frequent.
- Evaluation metrics depend on the benchmark split and annotation quality.

## Citation and Acknowledgements

If you use this model, please cite or acknowledge:

- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
- BMBF de.NBI / GenomeNet.

## Contact

For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.