--- license: mit library_name: keras pipeline_tag: token-classification tags: - biology - genomics - dna - crispr - crispr-cas - tensorflow - keras - sequence-classification - token-classification model-index: - name: CRISPR-BERT results: - task: type: token-classification name: Per-position CRISPR array detection dataset: name: Ground-truth CRISPR array test split type: genomic-crispr-array-benchmark metrics: - type: auprc name: Micro AUPRC value: 0.9802 - type: auroc name: Micro AUROC value: 0.9910 - type: f1 name: Best micro F1 over threshold grid value: 0.9543 - type: f1 name: Window-level F1 at threshold 0.5 value: 0.9255 --- # CRISPR-BERT CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array. The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code. ## Model Details | Property | Value | |---|---| | Model type | BERT-like transformer for DNA sequence labeling | | Framework | TensorFlow / Keras | | Checkpoint | `best.h5` | | Artifact size | 5.15 GB, stored with Git LFS | | Input | 1000 bp DNA window | | Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` | | Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` | | Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters | | Fine-tuning objective | Per-position binary CRISPR array detection | The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code. ## Intended Use Use this model to: - Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences. - Detect candidate CRISPR array regions using sliding windows and thresholding. - Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure. This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use. ## Training Data The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions. ## Evaluation The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with: | Quantity | Value | |---|---:| | Windows | 2,840 | | Total bases | 2,840,000 | | Positive bases | 375,163 | | Positive-base prevalence | 13.21% | Metrics from the available benchmark report: | Metric | Value | |---|---:| | Micro AUPRC | 0.9802 | | Micro AUROC | 0.9910 | | Precision at threshold 0.5 | 0.9809 | | Recall at threshold 0.5 | 0.8828 | | F1 at threshold 0.5 | 0.9293 | | Best F1 over threshold grid | 0.9543 | | Best threshold over grid | 0.31 | | Window-level detection F1 at threshold 0.5 | 0.9255 | Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery. ## Usage The easiest way to run the model is through the companion Hugging Face Space: ```text https://huggingface.co/spaces/genomenet/crispr-array-detection ``` For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is: ```python import tensorflow as tf from huggingface_hub import hf_hub_download from inference.custom_layers import get_custom_objects model_path = hf_hub_download( repo_id="genomenet/crispr-bert-model", filename="best.h5", ) model = tf.keras.models.load_model( model_path, custom_objects=get_custom_objects(), compile=False, ) ``` Input sequences should be converted to integer tokens using the same tokenizer used during training: ```python TOKEN = { "A": 1, "C": 2, "G": 3, "T": 4, } # Unknown and ambiguous IUPAC bases are encoded as 5. # Padding/OOV is encoded as 0. ``` For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates. ## Limitations - The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments. - The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended. - Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows. - Ambiguous bases are supported but may reduce confidence if frequent. - Evaluation metrics depend on the benchmark split and annotation quality. ## Citation and Acknowledgements If you use this model, please cite or acknowledge: - Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024. - DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172). - BMBF de.NBI / GenomeNet. ## Contact For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.