Token Classification
Keras
biology
genomics
dna
crispr
crispr-cas
tensorflow
sequence-classification
Eval Results (legacy)
Instructions to use genomenet/crispr-bert-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use genomenet/crispr-bert-model with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://genomenet/crispr-bert-model") - Notebooks
- Google Colab
- Kaggle
Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: keras
|
| 4 |
+
pipeline_tag: token-classification
|
| 5 |
+
tags:
|
| 6 |
+
- biology
|
| 7 |
+
- genomics
|
| 8 |
+
- dna
|
| 9 |
+
- crispr
|
| 10 |
+
- crispr-cas
|
| 11 |
+
- tensorflow
|
| 12 |
+
- keras
|
| 13 |
+
- sequence-classification
|
| 14 |
+
- token-classification
|
| 15 |
+
model-index:
|
| 16 |
+
- name: CRISPR-BERT
|
| 17 |
+
results:
|
| 18 |
+
- task:
|
| 19 |
+
type: token-classification
|
| 20 |
+
name: Per-position CRISPR array detection
|
| 21 |
+
dataset:
|
| 22 |
+
name: Ground-truth CRISPR array test split
|
| 23 |
+
type: genomic-crispr-array-benchmark
|
| 24 |
+
metrics:
|
| 25 |
+
- type: auprc
|
| 26 |
+
name: Micro AUPRC
|
| 27 |
+
value: 0.9802
|
| 28 |
+
- type: auroc
|
| 29 |
+
name: Micro AUROC
|
| 30 |
+
value: 0.9910
|
| 31 |
+
- type: f1
|
| 32 |
+
name: Best micro F1 over threshold grid
|
| 33 |
+
value: 0.9543
|
| 34 |
+
- type: f1
|
| 35 |
+
name: Window-level F1 at threshold 0.5
|
| 36 |
+
value: 0.9255
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
# CRISPR-BERT
|
| 40 |
+
|
| 41 |
+
CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.
|
| 42 |
+
|
| 43 |
+
The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.
|
| 44 |
+
|
| 45 |
+
## Model Details
|
| 46 |
+
|
| 47 |
+
| Property | Value |
|
| 48 |
+
|---|---|
|
| 49 |
+
| Model type | BERT-like transformer for DNA sequence labeling |
|
| 50 |
+
| Framework | TensorFlow / Keras |
|
| 51 |
+
| Checkpoint | `best.h5` |
|
| 52 |
+
| Artifact size | 5.15 GB, stored with Git LFS |
|
| 53 |
+
| Input | 1000 bp DNA window |
|
| 54 |
+
| Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
|
| 55 |
+
| Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
|
| 56 |
+
| Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
|
| 57 |
+
| Fine-tuning objective | Per-position binary CRISPR array detection |
|
| 58 |
+
|
| 59 |
+
The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.
|
| 60 |
+
|
| 61 |
+
## Intended Use
|
| 62 |
+
|
| 63 |
+
Use this model to:
|
| 64 |
+
|
| 65 |
+
- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
|
| 66 |
+
- Detect candidate CRISPR array regions using sliding windows and thresholding.
|
| 67 |
+
- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.
|
| 68 |
+
|
| 69 |
+
This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.
|
| 70 |
+
|
| 71 |
+
## Training Data
|
| 72 |
+
|
| 73 |
+
The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.
|
| 74 |
+
|
| 75 |
+
## Evaluation
|
| 76 |
+
|
| 77 |
+
The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:
|
| 78 |
+
|
| 79 |
+
| Quantity | Value |
|
| 80 |
+
|---|---:|
|
| 81 |
+
| Windows | 2,840 |
|
| 82 |
+
| Total bases | 2,840,000 |
|
| 83 |
+
| Positive bases | 375,163 |
|
| 84 |
+
| Positive-base prevalence | 13.21% |
|
| 85 |
+
|
| 86 |
+
Metrics from the available benchmark report:
|
| 87 |
+
|
| 88 |
+
| Metric | Value |
|
| 89 |
+
|---|---:|
|
| 90 |
+
| Micro AUPRC | 0.9802 |
|
| 91 |
+
| Micro AUROC | 0.9910 |
|
| 92 |
+
| Precision at threshold 0.5 | 0.9809 |
|
| 93 |
+
| Recall at threshold 0.5 | 0.8828 |
|
| 94 |
+
| F1 at threshold 0.5 | 0.9293 |
|
| 95 |
+
| Best F1 over threshold grid | 0.9543 |
|
| 96 |
+
| Best threshold over grid | 0.31 |
|
| 97 |
+
| Window-level detection F1 at threshold 0.5 | 0.9255 |
|
| 98 |
+
|
| 99 |
+
Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.
|
| 100 |
+
|
| 101 |
+
## Usage
|
| 102 |
+
|
| 103 |
+
The easiest way to run the model is through the companion Hugging Face Space:
|
| 104 |
+
|
| 105 |
+
```text
|
| 106 |
+
https://huggingface.co/spaces/genomenet/crispr-array-detection
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
import tensorflow as tf
|
| 113 |
+
from huggingface_hub import hf_hub_download
|
| 114 |
+
|
| 115 |
+
from inference.custom_layers import get_custom_objects
|
| 116 |
+
|
| 117 |
+
model_path = hf_hub_download(
|
| 118 |
+
repo_id="genomenet/crispr-bert-model",
|
| 119 |
+
filename="best.h5",
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
model = tf.keras.models.load_model(
|
| 123 |
+
model_path,
|
| 124 |
+
custom_objects=get_custom_objects(),
|
| 125 |
+
compile=False,
|
| 126 |
+
)
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
Input sequences should be converted to integer tokens using the same tokenizer used during training:
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
TOKEN = {
|
| 133 |
+
"A": 1,
|
| 134 |
+
"C": 2,
|
| 135 |
+
"G": 3,
|
| 136 |
+
"T": 4,
|
| 137 |
+
}
|
| 138 |
+
# Unknown and ambiguous IUPAC bases are encoded as 5.
|
| 139 |
+
# Padding/OOV is encoded as 0.
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.
|
| 143 |
+
|
| 144 |
+
## Limitations
|
| 145 |
+
|
| 146 |
+
- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
|
| 147 |
+
- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
|
| 148 |
+
- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
|
| 149 |
+
- Ambiguous bases are supported but may reduce confidence if frequent.
|
| 150 |
+
- Evaluation metrics depend on the benchmark split and annotation quality.
|
| 151 |
+
|
| 152 |
+
## Citation and Acknowledgements
|
| 153 |
+
|
| 154 |
+
If you use this model, please cite or acknowledge:
|
| 155 |
+
|
| 156 |
+
- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
|
| 157 |
+
- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
|
| 158 |
+
- BMBF de.NBI / GenomeNet.
|
| 159 |
+
|
| 160 |
+
## Contact
|
| 161 |
+
|
| 162 |
+
For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.
|