Token Classification
Keras
biology
genomics
dna
crispr
crispr-cas
tensorflow
sequence-classification
Eval Results (legacy)
Instructions to use genomenet/crispr-bert-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use genomenet/crispr-bert-model with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://genomenet/crispr-bert-model") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: keras | |
| pipeline_tag: token-classification | |
| tags: | |
| - biology | |
| - genomics | |
| - dna | |
| - crispr | |
| - crispr-cas | |
| - tensorflow | |
| - keras | |
| - sequence-classification | |
| - token-classification | |
| model-index: | |
| - name: CRISPR-BERT | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Per-position CRISPR array detection | |
| dataset: | |
| name: Ground-truth CRISPR array test split | |
| type: genomic-crispr-array-benchmark | |
| metrics: | |
| - type: auprc | |
| name: Micro AUPRC | |
| value: 0.9802 | |
| - type: auroc | |
| name: Micro AUROC | |
| value: 0.9910 | |
| - type: f1 | |
| name: Best micro F1 over threshold grid | |
| value: 0.9543 | |
| - type: f1 | |
| name: Window-level F1 at threshold 0.5 | |
| value: 0.9255 | |
| # CRISPR-BERT | |
| CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array. | |
| The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code. | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | Model type | BERT-like transformer for DNA sequence labeling | | |
| | Framework | TensorFlow / Keras | | |
| | Checkpoint | `best.h5` | | |
| | Artifact size | 5.15 GB, stored with Git LFS | | |
| | Input | 1000 bp DNA window | | |
| | Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` | | |
| | Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` | | |
| | Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters | | |
| | Fine-tuning objective | Per-position binary CRISPR array detection | | |
| The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code. | |
| ## Intended Use | |
| Use this model to: | |
| - Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences. | |
| - Detect candidate CRISPR array regions using sliding windows and thresholding. | |
| - Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure. | |
| This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use. | |
| ## Training Data | |
| The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions. | |
| ## Evaluation | |
| The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with: | |
| | Quantity | Value | | |
| |---|---:| | |
| | Windows | 2,840 | | |
| | Total bases | 2,840,000 | | |
| | Positive bases | 375,163 | | |
| | Positive-base prevalence | 13.21% | | |
| Metrics from the available benchmark report: | |
| | Metric | Value | | |
| |---|---:| | |
| | Micro AUPRC | 0.9802 | | |
| | Micro AUROC | 0.9910 | | |
| | Precision at threshold 0.5 | 0.9809 | | |
| | Recall at threshold 0.5 | 0.8828 | | |
| | F1 at threshold 0.5 | 0.9293 | | |
| | Best F1 over threshold grid | 0.9543 | | |
| | Best threshold over grid | 0.31 | | |
| | Window-level detection F1 at threshold 0.5 | 0.9255 | | |
| Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery. | |
| ## Usage | |
| The easiest way to run the model is through the companion Hugging Face Space: | |
| ```text | |
| https://huggingface.co/spaces/genomenet/crispr-array-detection | |
| ``` | |
| For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is: | |
| ```python | |
| import tensorflow as tf | |
| from huggingface_hub import hf_hub_download | |
| from inference.custom_layers import get_custom_objects | |
| model_path = hf_hub_download( | |
| repo_id="genomenet/crispr-bert-model", | |
| filename="best.h5", | |
| ) | |
| model = tf.keras.models.load_model( | |
| model_path, | |
| custom_objects=get_custom_objects(), | |
| compile=False, | |
| ) | |
| ``` | |
| Input sequences should be converted to integer tokens using the same tokenizer used during training: | |
| ```python | |
| TOKEN = { | |
| "A": 1, | |
| "C": 2, | |
| "G": 3, | |
| "T": 4, | |
| } | |
| # Unknown and ambiguous IUPAC bases are encoded as 5. | |
| # Padding/OOV is encoded as 0. | |
| ``` | |
| For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates. | |
| ## Limitations | |
| - The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments. | |
| - The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended. | |
| - Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows. | |
| - Ambiguous bases are supported but may reduce confidence if frequent. | |
| - Evaluation metrics depend on the benchmark split and annotation quality. | |
| ## Citation and Acknowledgements | |
| If you use this model, please cite or acknowledge: | |
| - Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024. | |
| - DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172). | |
| - BMBF de.NBI / GenomeNet. | |
| ## Contact | |
| For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository. | |