genomenet commited on
Commit
b15619b
·
1 Parent(s): 35b6185

Add model card

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: keras
4
+ pipeline_tag: token-classification
5
+ tags:
6
+ - biology
7
+ - genomics
8
+ - dna
9
+ - crispr
10
+ - crispr-cas
11
+ - tensorflow
12
+ - keras
13
+ - sequence-classification
14
+ - token-classification
15
+ model-index:
16
+ - name: CRISPR-BERT
17
+ results:
18
+ - task:
19
+ type: token-classification
20
+ name: Per-position CRISPR array detection
21
+ dataset:
22
+ name: Ground-truth CRISPR array test split
23
+ type: genomic-crispr-array-benchmark
24
+ metrics:
25
+ - type: auprc
26
+ name: Micro AUPRC
27
+ value: 0.9802
28
+ - type: auroc
29
+ name: Micro AUROC
30
+ value: 0.9910
31
+ - type: f1
32
+ name: Best micro F1 over threshold grid
33
+ value: 0.9543
34
+ - type: f1
35
+ name: Window-level F1 at threshold 0.5
36
+ value: 0.9255
37
+ ---
38
+
39
+ # CRISPR-BERT
40
+
41
+ CRISPR-BERT is a TensorFlow/Keras model for detecting CRISPR arrays in prokaryotic DNA sequences. It scores each nucleotide position in a 1000 bp input window with a probability in `[0, 1]`, where higher values indicate stronger evidence that the position belongs to a CRISPR array.
42
+
43
+ The checkpoint in this repository is used by the CRISPR Array Detection web app and companion inference code.
44
+
45
+ ## Model Details
46
+
47
+ | Property | Value |
48
+ |---|---|
49
+ | Model type | BERT-like transformer for DNA sequence labeling |
50
+ | Framework | TensorFlow / Keras |
51
+ | Checkpoint | `best.h5` |
52
+ | Artifact size | 5.15 GB, stored with Git LFS |
53
+ | Input | 1000 bp DNA window |
54
+ | Output | Per-position CRISPR probability, shape `(batch, 1000, 1)` |
55
+ | Tokenization | `PAD/OOV=0`, `A=1`, `C=2`, `G=3`, `T=4`, ambiguous IUPAC bases as `5` |
56
+ | Architecture inspected from local training reports | 24 transformer blocks, hidden size 600, about 300.7M parameters |
57
+ | Fine-tuning objective | Per-position binary CRISPR array detection |
58
+
59
+ The model uses custom Keras layers and should be loaded with the custom layer definitions from the companion inference code.
60
+
61
+ ## Intended Use
62
+
63
+ Use this model to:
64
+
65
+ - Predict CRISPR-array probability scores along bacterial or archaeal DNA sequences.
66
+ - Detect candidate CRISPR array regions using sliding windows and thresholding.
67
+ - Extract hidden-state embeddings for state-dynamic visualizations of repeat/spacer structure.
68
+
69
+ This model is intended for research use. It is not intended for clinical, diagnostic, or safety-critical use.
70
+
71
+ ## Training Data
72
+
73
+ The model was initialized from a BERT-style model pretrained on metagenomic contigs and complete microbial genomes, then fine-tuned on annotated CRISPR array sequences. Positive examples come from annotated CRISPR arrays; negative examples are sampled from non-CRISPR genomic regions.
74
+
75
+ ## Evaluation
76
+
77
+ The local benchmark report for the fine-tuned checkpoint family used a ground-truth test split with:
78
+
79
+ | Quantity | Value |
80
+ |---|---:|
81
+ | Windows | 2,840 |
82
+ | Total bases | 2,840,000 |
83
+ | Positive bases | 375,163 |
84
+ | Positive-base prevalence | 13.21% |
85
+
86
+ Metrics from the available benchmark report:
87
+
88
+ | Metric | Value |
89
+ |---|---:|
90
+ | Micro AUPRC | 0.9802 |
91
+ | Micro AUROC | 0.9910 |
92
+ | Precision at threshold 0.5 | 0.9809 |
93
+ | Recall at threshold 0.5 | 0.8828 |
94
+ | F1 at threshold 0.5 | 0.9293 |
95
+ | Best F1 over threshold grid | 0.9543 |
96
+ | Best threshold over grid | 0.31 |
97
+ | Window-level detection F1 at threshold 0.5 | 0.9255 |
98
+
99
+ Threshold choice affects sensitivity and specificity. The companion app defaults to a threshold around 0.3 for sensitive region discovery.
100
+
101
+ ## Usage
102
+
103
+ The easiest way to run the model is through the companion Hugging Face Space:
104
+
105
+ ```text
106
+ https://huggingface.co/spaces/genomenet/crispr-array-detection
107
+ ```
108
+
109
+ For local inference, use the custom inference code from the CRISPR Array Detection app. A minimal loading pattern is:
110
+
111
+ ```python
112
+ import tensorflow as tf
113
+ from huggingface_hub import hf_hub_download
114
+
115
+ from inference.custom_layers import get_custom_objects
116
+
117
+ model_path = hf_hub_download(
118
+ repo_id="genomenet/crispr-bert-model",
119
+ filename="best.h5",
120
+ )
121
+
122
+ model = tf.keras.models.load_model(
123
+ model_path,
124
+ custom_objects=get_custom_objects(),
125
+ compile=False,
126
+ )
127
+ ```
128
+
129
+ Input sequences should be converted to integer tokens using the same tokenizer used during training:
130
+
131
+ ```python
132
+ TOKEN = {
133
+ "A": 1,
134
+ "C": 2,
135
+ "G": 3,
136
+ "T": 4,
137
+ }
138
+ # Unknown and ambiguous IUPAC bases are encoded as 5.
139
+ # Padding/OOV is encoded as 0.
140
+ ```
141
+
142
+ For sequences longer than 1000 bp, use sliding windows and aggregate overlapping per-window predictions back to sequence coordinates.
143
+
144
+ ## Limitations
145
+
146
+ - The model was developed for prokaryotic genomic sequence contexts; performance may differ on eukaryotic, viral, synthetic, or heavily assembled/metagenomic fragments.
147
+ - The model emits probability-like scores, not curated biological annotations. Downstream thresholding and manual review are recommended.
148
+ - Very short sequences are not supported by the deployed inference app; the model expects 1000 bp windows.
149
+ - Ambiguous bases are supported but may reduce confidence if frequent.
150
+ - Evaluation metrics depend on the benchmark split and annotation quality.
151
+
152
+ ## Citation and Acknowledgements
153
+
154
+ If you use this model, please cite or acknowledge:
155
+
156
+ - Ziyu Mu, Master's Thesis, Helmholtz Centre for Infection Research (HZI BIFO), 2024.
157
+ - DFG SPP 2141 "Much more than Defence: the Multiple Functions and Facets of CRISPR-Cas" (project MC 172).
158
+ - BMBF de.NBI / GenomeNet.
159
+
160
+ ## Contact
161
+
162
+ For questions about the model or deployment, contact the GenomeNet / HZI maintainers of this repository.