RNAElectra / README.md
FreakingPotato's picture
Updated README
1735c76
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - genomics
  - rna
  - nucleotide
  - sequence-modeling
  - biology
  - bioinformatics
  - electra
pipeline_tag: feature-extraction

RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning

RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.

Model Details

  • Model Type: Transformer-based discriminator model
  • Training Objective: ELECTRA-style replaced-token detection
  • Resolution: Single-nucleotide
  • Domain: RNA and transcriptomic sequences
  • Architecture: ModernBERT-style backbone adapted for nucleotide sequences

RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.

Key Features

  • Single-nucleotide tokenization
  • Contextual RNA sequence embeddings
  • ELECTRA-style discriminative pre-training
  • Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks

Usage

Basic Feature Extraction

import torch
from transformers import AutoModel
from tokenizer import NucEL_Tokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    "FreakingPotato/RNAElectra",
    trust_remote_code=True
).to(device)
model.eval()

tokenizer = NucEL_Tokenizer.from_pretrained(
    "FreakingPotato/RNAElectra",
    trust_remote_code=True
)

sequence = "AUGCAUGCAUGCAUGC"

inputs = tokenizer(sequence, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")

Installation

pip install transformers torch

Requirements

  • transformers >= 5.0.0
  • torch >= 2.10.0
  • Python >= 3.12.3

GPU is recommended for large-scale inference.

Pre-training Overview

RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.

Intended Use

RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.

License

This model is released under the Apache 2.0 License.