Updated README

1735c76 about 17 hours ago

3.06 kB

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - genomics
  - rna
  - nucleotide
  - sequence-modeling
  - biology
  - bioinformatics
  - electra
pipeline_tag: feature-extraction

RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning

RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.

Model Details

Model Type: Transformer-based discriminator model
Training Objective: ELECTRA-style replaced-token detection
Resolution: Single-nucleotide
Domain: RNA and transcriptomic sequences
Architecture: ModernBERT-style backbone adapted for nucleotide sequences

RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.

Key Features

Single-nucleotide tokenization
Contextual RNA sequence embeddings
ELECTRA-style discriminative pre-training
Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks

Usage

Basic Feature Extraction

import torch
from transformers import AutoModel
from tokenizer import NucEL_Tokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    "FreakingPotato/RNAElectra",
    trust_remote_code=True
).to(device)
model.eval()

tokenizer = NucEL_Tokenizer.from_pretrained(
    "FreakingPotato/RNAElectra",
    trust_remote_code=True
)

sequence = "AUGCAUGCAUGCAUGC"

inputs = tokenizer(sequence, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")

Installation

pip install transformers torch

Requirements

transformers >= 5.0.0
torch >= 2.10.0
Python >= 3.12.3

GPU is recommended for large-scale inference.

Pre-training Overview

RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.

Intended Use

RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.

License

This model is released under the Apache 2.0 License.