FreakingPotato commited on
Commit
1735c76
·
1 Parent(s): 10234c4

Updated README

Browse files
Files changed (1) hide show
  1. README.md +65 -4
README.md CHANGED
@@ -1,11 +1,43 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
4
- # RNAElectra
5
 
6
- RNAElectra is a pretrained RNA language model for nucleotide-level sequence representation learning.
7
 
8
- ## Load model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ```python
11
  import torch
@@ -18,6 +50,7 @@ model = AutoModel.from_pretrained(
18
  "FreakingPotato/RNAElectra",
19
  trust_remote_code=True
20
  ).to(device)
 
21
 
22
  tokenizer = NucEL_Tokenizer.from_pretrained(
23
  "FreakingPotato/RNAElectra",
@@ -25,6 +58,7 @@ tokenizer = NucEL_Tokenizer.from_pretrained(
25
  )
26
 
27
  sequence = "AUGCAUGCAUGCAUGC"
 
28
  inputs = tokenizer(sequence, return_tensors="pt")
29
  inputs = {k: v.to(device) for k, v in inputs.items()}
30
 
@@ -32,4 +66,31 @@ with torch.no_grad():
32
  outputs = model(**inputs)
33
 
34
  embeddings = outputs.last_hidden_state
35
- print(embeddings.shape)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - genomics
8
+ - rna
9
+ - nucleotide
10
+ - sequence-modeling
11
+ - biology
12
+ - bioinformatics
13
+ - electra
14
+ pipeline_tag: feature-extraction
15
  ---
 
16
 
17
+ # RNAElectra: Single-Nucleotide ELECTRA-Style Pre-training for RNA Representation Learning
18
 
19
+ RNAElectra is a nucleotide-resolution RNA language model trained using an ELECTRA-style objective for efficient and discriminative representation learning. The model produces contextualized embeddings for RNA sequences and is designed for downstream transcriptomic and regulatory modeling tasks.
20
+
21
+ ## Model Details
22
+
23
+ - **Model Type**: Transformer-based discriminator model
24
+ - **Training Objective**: ELECTRA-style replaced-token detection
25
+ - **Resolution**: Single-nucleotide
26
+ - **Domain**: RNA and transcriptomic sequences
27
+ - **Architecture**: ModernBERT-style backbone adapted for nucleotide sequences
28
+
29
+ RNAElectra focuses on efficient pre-training by learning to discriminate corrupted tokens rather than reconstruct them, leading to strong representations with improved training efficiency.
30
+
31
+ ## Key Features
32
+
33
+ - Single-nucleotide tokenization
34
+ - Contextual RNA sequence embeddings
35
+ - ELECTRA-style discriminative pre-training
36
+ - Suitable for RNA function prediction, RBP binding modeling, stability prediction, regulatory element analysis, and downstream fine-tuning tasks
37
+
38
+ ## Usage
39
+
40
+ ### Basic Feature Extraction
41
 
42
  ```python
43
  import torch
 
50
  "FreakingPotato/RNAElectra",
51
  trust_remote_code=True
52
  ).to(device)
53
+ model.eval()
54
 
55
  tokenizer = NucEL_Tokenizer.from_pretrained(
56
  "FreakingPotato/RNAElectra",
 
58
  )
59
 
60
  sequence = "AUGCAUGCAUGCAUGC"
61
+
62
  inputs = tokenizer(sequence, return_tensors="pt")
63
  inputs = {k: v.to(device) for k, v in inputs.items()}
64
 
 
66
  outputs = model(**inputs)
67
 
68
  embeddings = outputs.last_hidden_state
69
+ print(f"Sequence embeddings shape: {embeddings.shape}")
70
+ ```
71
+
72
+ ## Installation
73
+
74
+ ```bash
75
+ pip install transformers torch
76
+ ```
77
+
78
+ ## Requirements
79
+
80
+ - transformers >= 5.0.0
81
+ - torch >= 2.10.0
82
+ - Python >= 3.12.3
83
+
84
+ GPU is recommended for large-scale inference.
85
+
86
+ ## Pre-training Overview
87
+
88
+ RNAElectra was trained using an ELECTRA-style generator–discriminator framework. A generator predicts corrupted tokens, and a discriminator learns to detect replaced tokens. Only the discriminator weights are released in this repository. This objective improves training efficiency compared to masked language modeling while preserving strong contextual representations.
89
+
90
+ ## Intended Use
91
+
92
+ RNAElectra is intended for feature extraction, downstream fine-tuning, and representation learning in RNA and transcriptomic modeling tasks. It is not intended for clinical decision-making or medical diagnostics.
93
+
94
+ ## License
95
+
96
+ This model is released under the Apache 2.0 License.