Token Classification
Keras
biology
genomics
dna
crispr
crispr-cas
tensorflow
sequence-classification
Eval Results (legacy)
Instructions to use genomenet/crispr-bert-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use genomenet/crispr-bert-model with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://genomenet/crispr-bert-model") - Notebooks
- Google Colab
- Kaggle
File size: 5,531 Bytes
b15619b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
license: mit
library_name: keras
pipeline_tag: token-classification
tags:
- biology
- genomics
- dna
- crispr
- crispr-cas
- tensorflow
- keras
- sequence-classification
- token-classification
model-index:
- name: CRISPR-BERT
results:
- task:
type: token-classification
name: Per-position CRISPR array detection
dataset:
name: Ground-truth CRISPR array test split
type: genomic-crispr-array-benchmark
metrics:
- type: auprc
name: Micro AUPRC
value: 0.9802
- type: auroc
name: Micro AUROC
value: 0.9910
- type: f1
name: Best micro F1 over threshold grid
value: 0.9543
- type: f1
name: Window-level F1 at threshold 0.5
value: 0.9255
---
# CRISPR-BERT
CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.
The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.
## Model Details
| Property | Value |
|---|---|
| Model type | BERT-like transformer for DNA sequence labeling |
| Framework | TensorFlow / Keras |
| Checkpoint | `best.h5` |
| Artifact size | 5.15 GB, stored with Git LFS |
| Input | 1000 bp DNA window |
| Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
| Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
| Fine-tuning objective | Per-position binary CRISPR array detection |
The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.
## Intended Use
Use this model to:
- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
- Detect candidate CRISPR array regions using sliding windows and thresholding.
- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.
This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.
## Training Data
The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.
## Evaluation
The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:
| Quantity | Value |
|---|---:|
| Windows | 2,840 |
| Total bases | 2,840,000 |
| Positive bases | 375,163 |
| Positive-base prevalence | 13.21% |
Metrics from the available benchmark report:
| Metric | Value |
|---|---:|
| Micro AUPRC | 0.9802 |
| Micro AUROC | 0.9910 |
| Precision at threshold 0.5 | 0.9809 |
| Recall at threshold 0.5 | 0.8828 |
| F1 at threshold 0.5 | 0.9293 |
| Best F1 over threshold grid | 0.9543 |
| Best threshold over grid | 0.31 |
| Window-level detection F1 at threshold 0.5 | 0.9255 |
Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.
## Usage
The easiest way to run the model is through the companion Hugging Face Space:
```text
https://huggingface.co/spaces/genomenet/crispr-array-detection
```
For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:
```python
import tensorflow as tf
from huggingface_hub import hf_hub_download
from inference.custom_layers import get_custom_objects
model_path = hf_hub_download(
repo_id="genomenet/crispr-bert-model",
filename="best.h5",
)
model = tf.keras.models.load_model(
model_path,
custom_objects=get_custom_objects(),
compile=False,
)
```
Input sequences should be converted to integer tokens using the same tokenizer used during training:
```python
TOKEN = {
"A": 1,
"C": 2,
"G": 3,
"T": 4,
}
# Unknown and ambiguous IUPAC bases are encoded as 5.
# Padding/OOV is encoded as 0.
```
For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.
## Limitations
- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
- Ambiguous bases are supported but may reduce confidence if frequent.
- Evaluation metrics depend on the benchmark split and annotation quality.
## Citation and Acknowledgements
If you use this model, please cite or acknowledge:
- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
- BMBF de.NBI / GenomeNet.
## Contact
For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.
|