Shoriful025
/

genomic_dna_sequence_transformer_base

sequence-analysis

Model card Files Files and versions

genomic_dna_sequence_transformer_base / README.md

Shoriful025's picture

Create README.md

a3d784e verified 27 days ago

|

history blame contribute delete

1.77 kB

	---
	language: en
	license: apache-2.0
	tags:
	- biology
	- genomics
	- dnabert
	- sequence-analysis
	---

	# Genomic DNA Sequence Transformer

	## Overview
	This model is a BERT-based encoder pre-trained on the human reference genome (GRCh38). It utilizes a k-mer tokenization approach to learn the underlying semantics of DNA, enabling high-accuracy downstream tasks such as promoter identification, splice site prediction, and variant effect scoring.



	## Model Architecture
	Based on the DNABERT framework:
	- Tokenization: Sequences are converted into 6-mer tokens (e.g., `ATGCGT`).
	- Pre-training: Masked Language Modeling (MLM) was performed on over 3 billion base pairs.
	- Encoding: The bidirectional attention mechanism allows each nucleotide position to attend to the entire sequence context, capturing complex regulatory motifs.
	- Metric: The pre-training objective minimizes the negative log-likelihood:
	$$\mathcal{L}_{MLM} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \text{masked}} \log p(x_i \| x_{\setminus i}) \right]$$

	## Intended Use
	- Motif Discovery: Locating transcription factor binding sites.
	- Functional Annotation: Predicting the biological function of non-coding regions.
	- Comparative Genomics: Evaluating evolutionary conservation at a sequence level.

	## Limitations
	- Sequence Length: Restricted to 512 tokens (~517 base pairs including overlaps), making it unsuitable for analyzing whole chromosomes without sliding windows.
	- Species Specificity: Performance may vary on non-human genomes (e.g., extremophile bacteria or complex plant genomes) without further fine-tuning.
	- Structural Variants: Primarily focused on single-nucleotide patterns rather than large-scale structural re-arrangements.