GENATATOR-Caduceus-PS (Multispecies Transcript Type Classification)
Overview
GENATATOR-Caduceus-PS is a DNA language model fine-tuned for sequence classification of transcript sequences.
The model performs binary classification to distinguish between two transcript types:
- mRNA
- lncRNA
It takes a DNA sequence corresponding to a transcript and produces one logit per sequence.
Model
Model name on Hugging Face:
genatator-caduceus-ps-multispecies-transcript-type
Architecture properties:
- backbone: Caduceus PS
- layers: 16
- hidden size: 512
- tokenization: single nucleotide
- classification head: linear projection to 1 output
- maximum supported sequence length: 250,000 nucleotides
This is a sequence classification model, not a segmentation model.
Training Data
This model was fine-tuned on gene sequences only, not on complete genomes.
Training data includes:
- mRNA transcripts
- lncRNA transcripts
Dataset characteristics:
- one transcript per gene
- no intergenic regions
- multispecies training dataset
Each training sample corresponds to a single transcript sequence labeled as either mRNA or lncRNA.
Labels
The model predicts a single binary output:
- 0: mRNA
- 1: lncRNA
The model returns one logit per input sequence.
To obtain probabilities for the lncRNA class, apply a sigmoid to the logits.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
repo_id = "shmelev/genatator-caduceus-ps-multispecies-transcript-type"
tokenizer = AutoTokenizer.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
repo_id,
trust_remote_code=True,
)
model.eval()
Example Inference
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
repo_id = "shmelev/genatator-caduceus-ps-multispecies-transcript-type"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(repo_id, trust_remote_code=True)
sequences = [
"ACGTACGTACGTACGTACGTACGTACGT",
"TTGCGATCGATCGATCGATCGATCGATCGATCGATCGAATCG",
]
enc = tokenizer(sequences)
input_ids = torch.tensor(enc["input_ids"])
with torch.no_grad():
outputs = model(input_ids=input_ids)
logits = outputs["logits"]
probs = torch.sigmoid(logits)
print("Input shape:", input_ids.shape)
print("Logits shape:", logits.shape)
print("Probabilities shape:", probs.shape)
print("Probabilities of lncRNA:", probs.squeeze(-1))
Example output:
Input shape: torch.Size([2, sequence_length])
Logits shape: torch.Size([2, 1])
Probabilities shape: torch.Size([2, 1])
Probabilities of lncRNA: tensor([0.08, 0.91])
- Downloads last month
- 43