GENATATOR-Caduceus-PS (Multispecies Transcript Type Classification)

Overview

GENATATOR-Caduceus-PS is a DNA language model fine-tuned for sequence classification of transcript sequences.

The model performs binary classification to distinguish between two transcript types:

  • mRNA
  • lncRNA

It takes a DNA sequence corresponding to a transcript and produces one logit per sequence.


Model

Model name on Hugging Face:

genatator-caduceus-ps-multispecies-transcript-type

Architecture properties:

  • backbone: Caduceus PS
  • layers: 16
  • hidden size: 512
  • tokenization: single nucleotide
  • classification head: linear projection to 1 output
  • maximum supported sequence length: 250,000 nucleotides

This is a sequence classification model, not a segmentation model.


Training Data

This model was fine-tuned on gene sequences only, not on complete genomes.

Training data includes:

  • mRNA transcripts
  • lncRNA transcripts

Dataset characteristics:

  • one transcript per gene
  • no intergenic regions
  • multispecies training dataset

Each training sample corresponds to a single transcript sequence labeled as either mRNA or lncRNA.


Labels

The model predicts a single binary output:

  • 0: mRNA
  • 1: lncRNA

The model returns one logit per input sequence.

To obtain probabilities for the lncRNA class, apply a sigmoid to the logits.


Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

repo_id = "shmelev/genatator-caduceus-ps-multispecies-transcript-type"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForSequenceClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

repo_id = "shmelev/genatator-caduceus-ps-multispecies-transcript-type"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(repo_id, trust_remote_code=True)

sequences = [
    "ACGTACGTACGTACGTACGTACGTACGT",
    "TTGCGATCGATCGATCGATCGATCGATCGATCGATCGAATCG",
]

enc = tokenizer(sequences)
input_ids = torch.tensor(enc["input_ids"])

with torch.no_grad():
    outputs = model(input_ids=input_ids)

logits = outputs["logits"]
probs = torch.sigmoid(logits)

print("Input shape:", input_ids.shape)
print("Logits shape:", logits.shape)
print("Probabilities shape:", probs.shape)
print("Probabilities of lncRNA:", probs.squeeze(-1))

Example output:

Input shape: torch.Size([2, sequence_length])
Logits shape: torch.Size([2, 1])
Probabilities shape: torch.Size([2, 1])
Probabilities of lncRNA: tensor([0.08, 0.91])
Downloads last month
43
Safetensors
Model size
14M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support