FreakingPotato
/

RNAElectra

Feature Extraction

sequence-modeling

Model card Files Files and versions

RNAElectra / README.md

FreakingPotato's picture

Updated README

1735c76 about 19 hours ago

|

history blame contribute delete

3.06 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- genomics
	- rna
	- nucleotide
	- sequence-modeling
	- biology
	- bioinformatics
	- electra
	pipeline_tag: feature-extraction
	---

	# RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning

	RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.

	## Model Details

	- Model Type: Transformer-based discriminator model
	- Training Objective: ELECTRA-style replaced-token detection
	- Resolution: Single-nucleotide
	- Domain: RNA and transcriptomic sequences
	- Architecture: ModernBERT-style backbone adapted for nucleotide sequences

	RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.

	## Key Features

	- Single-nucleotide tokenization
	- Contextual RNA sequence embeddings
	- ELECTRA-style discriminative pre-training
	- Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks

	## Usage

	### Basic Feature Extraction

	```python
	import torch
	from transformers import AutoModel
	from tokenizer import NucEL_Tokenizer

	device = "cuda" if torch.cuda.is_available() else "cpu"

	model = AutoModel.from_pretrained(
	"FreakingPotato/RNAElectra",
	trust_remote_code=True
	).to(device)
	model.eval()

	tokenizer = NucEL_Tokenizer.from_pretrained(
	"FreakingPotato/RNAElectra",
	trust_remote_code=True
	)

	sequence = "AUGCAUGCAUGCAUGC"

	inputs = tokenizer(sequence, return_tensors="pt")
	inputs = {k: v.to(device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model(**inputs)

	embeddings = outputs.last_hidden_state
	print(f"Sequence embeddings shape: {embeddings.shape}")
	```

	## Installation

	```bash
	pip install transformers torch
	```

	## Requirements

	- transformers >= 5.0.0
	- torch >= 2.10.0
	- Python >= 3.12.3

	GPU is recommended for large-scale inference.

	## Pre-training Overview

	RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.

	## Intended Use

	RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.

	## License

	This model is released under the Apache 2.0 License.