GENATATOR-ModernGENA (Human Transcript Boundary Model)

Overview

GENATATOR-ModernGENA is a DNA language model fine-tuned for transcript boundary detection directly from genomic DNA sequences.

The model performs token-level multilabel classification to identify strand-aware transcript boundaries, enabling reconstruction of transcript intervals for both:

mRNA genes
lncRNA genes

This model focuses on transcript position discovery, namely:

transcription start sites (TSS)
transcript termination sites (PolyA)

Model

Model name on Hugging Face:

genatator-moderngena-base-human-edge-model

Architecture properties:

backbone: ModernBERT (ModernGENA)
layers: 22
hidden size: 768
parameters: ~135M
tokenization: BPE
output head: linear projection to 4 classes
output resolution: token-level

The model predicts four classes.

The correct order of output classes is:

["TSS+", "TSS-", "PolyA+", "PolyA-"]

Where:

TSS+ — transcription start site on the forward strand
TSS- — transcription start site on the reverse strand
PolyA+ — transcript termination site on the forward strand
PolyA- — transcript termination site on the reverse strand

Training Data

This model was fine-tuned on full genomic sequences, including intergenic regions.

Training data includes annotations for:

mRNA transcripts
lncRNA transcripts

Dataset characteristics:

human genome only
strand-aware annotations
genome-wide supervision
human chromosomes 8, 20, and 21 held out
for non-held-out data, chromosomes longer than 100 kbp were included

The model is therefore trained to distinguish both transcript boundaries and non-transcribed background sequence.

Method Overview

This model is the edge model in the ModernGENA transcript discovery pipeline.

It predicts boundary probability tracks for:

transcript start sites
transcript termination sites
both strands independently

The full pipeline consists of:

Edge model predicts strand-specific transcript boundary tracks
Region model predicts strand-specific intragenic coverage
Post-processing
- signal denoising
- peak calling
- interval construction
- filtering using region predictions

This design allows recovery of full transcript intervals and multiple transcript boundary isoforms for the same gene.

Key Properties

strand-aware predictions
supports both mRNA and lncRNA
trained on human genome
ab initio inference from DNA only
designed for genome-wide transcript discovery

Important Notes

This model predicts transcript boundaries only
It does not predict exon–intron structure
It does not produce final transcript annotations by itself
Full inference requires post-processing and, typically, the paired region model

The model outputs boundary logits, which are intended to be converted into probability tracks and further processed into transcript intervals.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-edge-model"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-edge-model"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)

sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"

enc = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**enc)

logits = outputs.logits

print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)

Example output:

Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 4])

Each token receives 4 logits corresponding to the four boundary classes.

Recommended Inference Workflow

To obtain transcript intervals from this model:

run the model on genomic sequence
convert logits to probabilities
map token predictions to nucleotide resolution
smooth the tracks
call peaks
connect TSS and PolyA peaks on the same strand
filter candidates using region-model predictions

Limitations

predicts boundaries, not full transcript structure
requires post-processing to obtain transcript intervals
does not model alternative splicing structure
output quality depends on downstream peak-calling settings

Summary

GENATATOR-ModernGENA edge model is a ModernBERT-based DNA language model for strand-aware transcript boundary detection in the human genome.

It is intended as the first stage of a transcript discovery pipeline and supports:

mRNA boundary detection
lncRNA boundary detection
genome-wide transcript discovery
multi-isoform transcript discovery through downstream post-processing

Downloads last month: 7

Safetensors

Model size

0.1B params

Tensor type

F32