| | ---
|
| | language:
|
| | - en
|
| | license: apache-2.0
|
| | library_name: transformers
|
| | tags:
|
| | - genomics
|
| | - rna
|
| | - nucleotide
|
| | - sequence-modeling
|
| | - biology
|
| | - bioinformatics
|
| | - electra
|
| | pipeline_tag: feature-extraction
|
| | ---
|
| |
|
| | # RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning
|
| |
|
| | RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.
|
| |
|
| | ## Model Details
|
| |
|
| | - **Model Type**: Transformer-based discriminator model
|
| | - **Training Objective**: ELECTRA-style replaced-token detection
|
| | - **Resolution**: Single-nucleotide
|
| | - **Domain**: RNA and transcriptomic sequences
|
| | - **Architecture**: ModernBERT-style backbone adapted for nucleotide sequences
|
| |
|
| | RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.
|
| |
|
| | ## Key Features
|
| |
|
| | - Single-nucleotide tokenization
|
| | - Contextual RNA sequence embeddings
|
| | - ELECTRA-style discriminative pre-training
|
| | - Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks
|
| |
|
| | ## Usage
|
| |
|
| | ### Basic Feature Extraction
|
| |
|
| | ```python
|
| | import torch
|
| | from transformers import AutoModel
|
| | from tokenizer import NucEL_Tokenizer
|
| |
|
| | device = "cuda" if torch.cuda.is_available() else "cpu"
|
| |
|
| | model = AutoModel.from_pretrained(
|
| | "FreakingPotato/RNAElectra",
|
| | trust_remote_code=True
|
| | ).to(device)
|
| | model.eval()
|
| |
|
| | tokenizer = NucEL_Tokenizer.from_pretrained(
|
| | "FreakingPotato/RNAElectra",
|
| | trust_remote_code=True
|
| | )
|
| |
|
| | sequence = "AUGCAUGCAUGCAUGC"
|
| |
|
| | inputs = tokenizer(sequence, return_tensors="pt")
|
| | inputs = {k: v.to(device) for k, v in inputs.items()}
|
| |
|
| | with torch.no_grad():
|
| | outputs = model(**inputs)
|
| |
|
| | embeddings = outputs.last_hidden_state
|
| | print(f"Sequence embeddings shape: {embeddings.shape}")
|
| | ```
|
| |
|
| | ## Installation
|
| |
|
| | ```bash
|
| | pip install transformers torch
|
| | ```
|
| |
|
| | ## Requirements
|
| |
|
| | - transformers >= 5.0.0
|
| | - torch >= 2.10.0
|
| | - Python >= 3.12.3
|
| |
|
| | GPU is recommended for large-scale inference.
|
| |
|
| | ## Pre-training Overview
|
| |
|
| | RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.
|
| |
|
| | ## Intended Use
|
| |
|
| | RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.
|
| |
|
| | ## License
|
| |
|
| | This model is released under the Apache 2.0 License. |