GENATATOR-ModernGENA (Human Transcript Boundary Model)

Overview

GENATATOR-ModernGENA is a DNA language model fine-tuned for transcript boundary detection directly from genomic DNA sequences.

The model performs token-level multilabel classification to identify strand-aware transcript boundaries, enabling reconstruction of transcript intervals for both:

  • mRNA genes
  • lncRNA genes

This model focuses on transcript position discovery, namely:

  • transcription start sites (TSS)
  • transcript termination sites (PolyA)

Model

Model name on Hugging Face:

genatator-moderngena-base-human-edge-model

Architecture properties:

  • backbone: ModernBERT (ModernGENA)
  • layers: 22
  • hidden size: 768
  • parameters: ~135M
  • tokenization: BPE
  • output head: linear projection to 4 classes
  • output resolution: token-level

The model predicts four classes.

The correct order of output classes is:

["TSS+", "TSS-", "PolyA+", "PolyA-"]

Where:

  • TSS+ — transcription start site on the forward strand
  • TSS- — transcription start site on the reverse strand
  • PolyA+ — transcript termination site on the forward strand
  • PolyA- — transcript termination site on the reverse strand

Training Data

This model was fine-tuned on full genomic sequences, including intergenic regions.

Training data includes annotations for:

  • mRNA transcripts
  • lncRNA transcripts

Dataset characteristics:

  • human genome only
  • strand-aware annotations
  • genome-wide supervision
  • human chromosomes 8, 20, and 21 held out
  • for non-held-out data, chromosomes longer than 100 kbp were included

The model is therefore trained to distinguish both transcript boundaries and non-transcribed background sequence.


Method Overview

This model is the edge model in the ModernGENA transcript discovery pipeline.

It predicts boundary probability tracks for:

  • transcript start sites
  • transcript termination sites
  • both strands independently

The full pipeline consists of:

  1. Edge model predicts strand-specific transcript boundary tracks

  2. Region model predicts strand-specific intragenic coverage

  3. Post-processing

    • signal denoising
    • peak calling
    • interval construction
    • filtering using region predictions

This design allows recovery of full transcript intervals and multiple transcript boundary isoforms for the same gene.


Key Properties

  • strand-aware predictions
  • supports both mRNA and lncRNA
  • trained on human genome
  • ab initio inference from DNA only
  • designed for genome-wide transcript discovery

Important Notes

  • This model predicts transcript boundaries only
  • It does not predict exon–intron structure
  • It does not produce final transcript annotations by itself
  • Full inference requires post-processing and, typically, the paired region model

The model outputs boundary logits, which are intended to be converted into probability tracks and further processed into transcript intervals.


Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-edge-model"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-moderngena-base-human-edge-model"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)

sequence = "ACGTACGTACGTACGTACGTACGTACGTACGT"

enc = tokenizer(sequence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**enc)

logits = outputs.logits

print("Input shape:", enc["input_ids"].shape)
print("Logits shape:", logits.shape)

Example output:

Input shape: torch.Size([1, sequence_length])
Logits shape: torch.Size([1, sequence_length, 4])

Each token receives 4 logits corresponding to the four boundary classes.


Recommended Inference Workflow

To obtain transcript intervals from this model:

  1. run the model on genomic sequence
  2. convert logits to probabilities
  3. map token predictions to nucleotide resolution
  4. smooth the tracks
  5. call peaks
  6. connect TSS and PolyA peaks on the same strand
  7. filter candidates using region-model predictions

Limitations

  • predicts boundaries, not full transcript structure
  • requires post-processing to obtain transcript intervals
  • does not model alternative splicing structure
  • output quality depends on downstream peak-calling settings

Summary

GENATATOR-ModernGENA edge model is a ModernBERT-based DNA language model for strand-aware transcript boundary detection in the human genome.

It is intended as the first stage of a transcript discovery pipeline and supports:

  • mRNA boundary detection
  • lncRNA boundary detection
  • genome-wide transcript discovery
  • multi-isoform transcript discovery through downstream post-processing
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support