Add model card

b15619b 24 days ago

5.53 kB

	---
	license: mit
	library_name: keras
	pipeline_tag: token-classification
	tags:
	- biology
	- genomics
	- dna
	- crispr
	- crispr-cas
	- tensorflow
	- keras
	- sequence-classification
	- token-classification
	model-index:
	- name: CRISPR-BERT
	results:
	- task:
	type: token-classification
	name: Per-position CRISPR array detection
	dataset:
	name: Ground-truth CRISPR array test split
	type: genomic-crispr-array-benchmark
	metrics:
	- type: auprc
	name: Micro AUPRC
	value: 0.9802
	- type: auroc
	name: Micro AUROC
	value: 0.9910
	- type: f1
	name: Best micro F1 over threshold grid
	value: 0.9543
	- type: f1
	name: Window-level F1 at threshold 0.5
	value: 0.9255
	---

	# CRISPR-BERT

	CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.

	The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Model type \| BERT-like transformer for DNA sequence labeling \|
	\| Framework \| TensorFlow / Keras \|
	\| Checkpoint \| `best.h5` \|
	\| Artifact size \| 5.15 GB, stored with Git LFS \|
	\| Input \| 1000 bp DNA window \|
	\| Output \| Per-position CRISPR probability, shape `(batch, 1000, 1)` \|
	\| Tokenization \| `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` \|
	\| Architecture inspected from local training reports \| 24 transformer blocks, hidden size 600, about 300.7M parameters \|
	\| Fine-tuning objective \| Per-position binary CRISPR array detection \|

	The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.

	## Intended Use

	Use this model to:

	- Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
	- Detect candidate CRISPR array regions using sliding windows and thresholding.
	- Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.

	This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.

	## Training Data

	The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.

	## Evaluation

	The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:

	\| Quantity \| Value \|
	\|---\|---:\|
	\| Windows \| 2,840 \|
	\| Total bases \| 2,840,000 \|
	\| Positive bases \| 375,163 \|
	\| Positive-base prevalence \| 13.21% \|

	Metrics from the available benchmark report:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Micro AUPRC \| 0.9802 \|
	\| Micro AUROC \| 0.9910 \|
	\| Precision at threshold 0.5 \| 0.9809 \|
	\| Recall at threshold 0.5 \| 0.8828 \|
	\| F1 at threshold 0.5 \| 0.9293 \|
	\| Best F1 over threshold grid \| 0.9543 \|
	\| Best threshold over grid \| 0.31 \|
	\| Window-level detection F1 at threshold 0.5 \| 0.9255 \|

	Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.

	## Usage

	The easiest way to run the model is through the companion Hugging Face Space:

	```text
	https://huggingface.co/spaces/genomenet/crispr-array-detection
	```

	For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:

	```python
	import tensorflow as tf
	from huggingface_hub import hf_hub_download

	from inference.custom_layers import get_custom_objects

	model_path = hf_hub_download(
	repo_id="genomenet/crispr-bert-model",
	filename="best.h5",
	)

	model = tf.keras.models.load_model(
	model_path,
	custom_objects=get_custom_objects(),
	compile=False,
	)
	```

	Input sequences should be converted to integer tokens using the same tokenizer used during training:

	```python
	TOKEN = {
	"A": 1,
	"C": 2,
	"G": 3,
	"T": 4,
	}
	# Unknown and ambiguous IUPAC bases are encoded as 5.
	# Padding/OOV is encoded as 0.
	```

	For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.

	## Limitations

	- The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
	- The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
	- Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
	- Ambiguous bases are supported but may reduce confidence if frequent.
	- Evaluation metrics depend on the benchmark split and annotation quality.

	## Citation and Acknowledgements

	If you use this model, please cite or acknowledge:

	- Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
	- DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
	- BMBF de.NBI / GenomeNet.

	## Contact

	For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.