File size: 3,064 Bytes
f39c34e
1735c76
 
f39c34e
1735c76
 
 
 
 
 
 
 
 
 
f39c34e
10234c4
1735c76
10234c4
1735c76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10234c4
 
 
 
 
 
 
 
 
 
 
 
1735c76
10234c4
 
 
 
 
 
 
1735c76
10234c4
 
 
 
 
 
 
1735c76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---

language:
- en
license: apache-2.0
library_name: transformers
tags:
- genomics
- rna
- nucleotide
- sequence-modeling
- biology
- bioinformatics
- electra
pipeline_tag: feature-extraction
---


# RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning

RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.

## Model Details

- **Model Type**: Transformer-based discriminator model  
- **Training Objective**: ELECTRA-style replaced-token detection  
- **Resolution**: Single-nucleotide  
- **Domain**: RNA and transcriptomic sequences  
- **Architecture**: ModernBERT-style backbone adapted for nucleotide sequences  

RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.

## Key Features

- Single-nucleotide tokenization  
- Contextual RNA sequence embeddings  
- ELECTRA-style discriminative pre-training  
- Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks  

## Usage

### Basic Feature Extraction

```python

import torch

from transformers import AutoModel

from tokenizer import NucEL_Tokenizer



device = "cuda" if torch.cuda.is_available() else "cpu"



model = AutoModel.from_pretrained(

    "FreakingPotato/RNAElectra",

    trust_remote_code=True

).to(device)

model.eval()



tokenizer = NucEL_Tokenizer.from_pretrained(

    "FreakingPotato/RNAElectra",

    trust_remote_code=True

)



sequence = "AUGCAUGCAUGCAUGC"



inputs = tokenizer(sequence, return_tensors="pt")

inputs = {k: v.to(device) for k, v in inputs.items()}



with torch.no_grad():

    outputs = model(**inputs)



embeddings = outputs.last_hidden_state

print(f"Sequence embeddings shape: {embeddings.shape}")

```

## Installation

```bash

pip install transformers torch

```

## Requirements

- transformers >= 5.0.0
- torch >= 2.10.0
- Python >= 3.12.3

GPU is recommended for large-scale inference.

## Pre-training Overview

RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.

## Intended Use

RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.

## License

This model is released under the Apache 2.0 License.