|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- biology |
|
|
- genomics |
|
|
- dnabert |
|
|
- sequence-analysis |
|
|
--- |
|
|
|
|
|
# Genomic DNA Sequence Transformer |
|
|
|
|
|
## Overview |
|
|
This model is a BERT-based encoder pre-trained on the human reference genome (GRCh38). It utilizes a k-mer tokenization approach to learn the underlying semantics of DNA, enabling high-accuracy downstream tasks such as promoter identification, splice site prediction, and variant effect scoring. |
|
|
|
|
|
|
|
|
|
|
|
## Model Architecture |
|
|
Based on the **DNABERT** framework: |
|
|
- **Tokenization**: Sequences are converted into 6-mer tokens (e.g., `ATGCGT`). |
|
|
- **Pre-training**: Masked Language Modeling (MLM) was performed on over 3 billion base pairs. |
|
|
- **Encoding**: The bidirectional attention mechanism allows each nucleotide position to attend to the entire sequence context, capturing complex regulatory motifs. |
|
|
- **Metric**: The pre-training objective minimizes the negative log-likelihood: |
|
|
$$\mathcal{L}_{MLM} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \text{masked}} \log p(x_i | x_{\setminus i}) \right]$$ |
|
|
|
|
|
## Intended Use |
|
|
- **Motif Discovery**: Locating transcription factor binding sites. |
|
|
- **Functional Annotation**: Predicting the biological function of non-coding regions. |
|
|
- **Comparative Genomics**: Evaluating evolutionary conservation at a sequence level. |
|
|
|
|
|
## Limitations |
|
|
- **Sequence Length**: Restricted to 512 tokens (~517 base pairs including overlaps), making it unsuitable for analyzing whole chromosomes without sliding windows. |
|
|
- **Species Specificity**: Performance may vary on non-human genomes (e.g., extremophile bacteria or complex plant genomes) without further fine-tuning. |
|
|
- **Structural Variants**: Primarily focused on single-nucleotide patterns rather than large-scale structural re-arrangements. |