Shoriful025 commited on
Commit
a3d784e
·
verified ·
1 Parent(s): bac7f30

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - biology
6
+ - genomics
7
+ - dnabert
8
+ - sequence-analysis
9
+ ---
10
+
11
+ # Genomic DNA Sequence Transformer
12
+
13
+ ## Overview
14
+ This model is a BERT-based encoder pre-trained on the human reference genome (GRCh38). It utilizes a k-mer tokenization approach to learn the underlying semantics of DNA, enabling high-accuracy downstream tasks such as promoter identification, splice site prediction, and variant effect scoring.
15
+
16
+
17
+
18
+ ## Model Architecture
19
+ Based on the **DNABERT** framework:
20
+ - **Tokenization**: Sequences are converted into 6-mer tokens (e.g., `ATGCGT`).
21
+ - **Pre-training**: Masked Language Modeling (MLM) was performed on over 3 billion base pairs.
22
+ - **Encoding**: The bidirectional attention mechanism allows each nucleotide position to attend to the entire sequence context, capturing complex regulatory motifs.
23
+ - **Metric**: The pre-training objective minimizes the negative log-likelihood:
24
+ $$\mathcal{L}_{MLM} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \text{masked}} \log p(x_i | x_{\setminus i}) \right]$$
25
+
26
+ ## Intended Use
27
+ - **Motif Discovery**: Locating transcription factor binding sites.
28
+ - **Functional Annotation**: Predicting the biological function of non-coding regions.
29
+ - **Comparative Genomics**: Evaluating evolutionary conservation at a sequence level.
30
+
31
+ ## Limitations
32
+ - **Sequence Length**: Restricted to 512 tokens (~517 base pairs including overlaps), making it unsuitable for analyzing whole chromosomes without sliding windows.
33
+ - **Species Specificity**: Performance may vary on non-human genomes (e.g., extremophile bacteria or complex plant genomes) without further fine-tuning.
34
+ - **Structural Variants**: Primarily focused on single-nucleotide patterns rather than large-scale structural re-arrangements.