GENATATOR-ModernGENA (Human Transcript Boundary Model)
Overview
GENATATOR-ModernGENA is a DNA language model fine-tuned for transcript boundary detection directly from genomic DNA sequences.
The model performs token-level multilabel classification to identify strand-aware transcript boundaries, enabling reconstruction of transcript intervals for both:
- mRNA genes
- lncRNA genes
This model focuses on transcript position discovery, namely:
- transcription start sites (TSS)
- transcript termination sites (PolyA)
Model
Model name on Hugging Face:
genatator-moderngena-base-human-edge-model
Architecture properties:
- backbone: ModernBERT (ModernGENA)
- layers: 22
- hidden size: 768
- parameters: ~135M
- tokenization: BPE
- output head: linear projection to 4 classes
- output resolution: token-level
The model predicts four classes.
The correct order of output classes is:
["TSS+", "TSS-", "PolyA+", "PolyA-"]
Where:
TSS+— transcription start site on the forward strandTSS-— transcription start site on the reverse strandPolyA+— transcript termination site on the forward strandPolyA-— transcript termination site on the reverse strand
Training Data
This model was fine-tuned on full genomic sequences, including intergenic regions.
Training data includes annotations for:
- mRNA transcripts
- lncRNA transcripts
Dataset characteristics:
- human genome only
- strand-aware annotations
- genome-wide supervision
- human chromosomes 8, 20, and 21 held out
- for non-held-out data, chromosomes longer than 100 kbp were included
The model is therefore trained to distinguish both transcript boundaries and non-transcribed background sequence.
Method Overview
This model is the edge model in the ModernGENA transcript discovery pipeline.
It predicts boundary probability tracks for:
- transcript start sites
- transcript termination sites
- both strands independently
The full pipeline consists of:
Edge model predicts strand-specific transcript boundary tracks
Region model predicts strand-specific intragenic coverage
Post-processing
- signal denoising
- peak calling
- interval construction
- filtering using region predictions
This design allows recovery of full transcript intervals and multiple transcript boundary isoforms for the same gene.
Key Properties
- strand-aware predictions
- supports both mRNA and lncRNA
- trained on human genome
- ab initio inference from DNA only
- designed for genome-wide transcript discovery
Important Notes
- This model predicts transcript boundaries only
- It does not predict exon–intron structure
- It does not produce final transcript annotations by itself
- Full inference requires post-processing and, typically, the paired region model
The model outputs boundary logits, which are intended to be converted into probability tracks and further processed into transcript intervals.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "shmelev/genatator-moderngena-base-human-edge-model"
tokenizer = AutoTokenizer.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForTokenClassification.from_pretrained(
repo_id,
trust_remote_code=True,
)
model.eval()
Example Inference
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
repo_id = "shmelev/genatator-moderngena-base-human-edge-model"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)
sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"
enc = tokenizer(sequence, return_tensors="pt")
with torch.no_grad():
outputs = model(**enc)
logits = outputs.logits
print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)
Example output:
Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 4])
Each token receives 4 logits corresponding to the four boundary classes.
Recommended Inference Workflow
To obtain transcript intervals from this model:
- run the model on genomic sequence
- convert logits to probabilities
- map token predictions to nucleotide resolution
- smooth the tracks
- call peaks
- connect TSS and PolyA peaks on the same strand
- filter candidates using region-model predictions
Limitations
- predicts boundaries, not full transcript structure
- requires post-processing to obtain transcript intervals
- does not model alternative splicing structure
- output quality depends on downstream peak-calling settings
Summary
GENATATOR-ModernGENA edge model is a ModernBERT-based DNA language model for strand-aware transcript boundary detection in the human genome.
It is intended as the first stage of a transcript discovery pipeline and supports:
- mRNA boundary detection
- lncRNA boundary detection
- genome-wide transcript discovery
- multi-isoform transcript discovery through downstream post-processing
- Downloads last month
- 7