File size: 5,021 Bytes
9e231ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49f9060
9e231ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aac58b8
9e231ea
 
 
7430600
da85d8c
9e231ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
license: apache-2.0
---

# RNAErnie2

RNAErnie2 is a BERT-based RNA language model trained from scratch on a large-scale RNA
sequence dataset with up to 2048-nucleotide context length. It is a retrained successor
to RNAErnie that replaces the PaddlePaddle-based ERNIE backbone with a standard PyTorch
BERT architecture, extends the pretraining corpus to RNACentral v22 (~31M sequences,
length <= 2048), and switches to an RNA-native vocabulary (U instead of T).

## Architecture

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 11 |
| Positional encoding | Absolute learned |
| Architecture | Post-LN BERT / BertForMaskedLM |
| Max sequence length | 2048 |

**Vocabulary:** `[PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5, A=6, U=7, C=8, G=9, N=10`

## Pretraining

- **Objective:** Masked language modelling (MLM)
- **Data:** RNACentral v22, ~31 million RNA sequences with length <= 2048
- **Source checkpoint:** [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) on HuggingFace Hub
- **Tokenisation note:** Sequences use U (not T). Input T is silently converted to U by the tokenizer.

### Checkpoint selection

There is a single publicly released RNAErnie2 checkpoint. The weights are taken from
[`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) with one minor
adjustment: `cls.predictions.decoder.bias` is stored explicitly (it was implicitly
tied to `cls.predictions.bias` in the original save and was absent from the file).

## Parity Verification

Hidden-state representations and MLM logits verified identical (max abs diff < 2e-5)
to the original `BertForMaskedLM` at all 13 representation levels (embedding + 12 layers).
Verified on GPU with PyTorch 2.7 / CUDA 12.

## Implementation Notes

Custom BERT implementation (`modeling_rnaernie2.py`) with eager, SDPA, and Flash
Attention 2 backends, following the architecture of
[`Taykhoom/BERT-updated`](https://huggingface.co/Taykhoom/BERT-updated).
The original [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) used
standard HF BERT with no custom attention backends.

## Related Models

See the full [RNAErnie collection](https://huggingface.co/collections/Taykhoom/rnaernie-6a219927c11fdcccedb243db).

| Model | Context | Training data | Notes |
|---|---|---|---|
| [RNAErnie](https://huggingface.co/Taykhoom/RNAErnie) | 512 | RNACentral (nts<=512) | Original; PaddlePaddle backbone |
| **[RNAErnie2](https://huggingface.co/Taykhoom/RNAErnie2)** | **2048** | **RNACentral v22 (~31M seqs)** | **This model; PyTorch BERT** |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()

sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
token_emb = out.last_hidden_state           # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]       # (batch, seq_len, 768)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()

enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 11)
```

### SDPA / Flash Attention 2

```python
model = AutoModel.from_pretrained(
    "Taykhoom/RNAErnie2",
    attn_implementation="sdpa",   # or "flash_attention_2"
    trust_remote_code=True,
)
```

### Fine-tuning

Standard HF conventions. For sequence-level tasks, use the CLS token embedding
(`last_hidden_state[:, 0, :]`) as input to a classification head.

## Citation

```bibtex
@article{wang2024_rnaernie,
  title   = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning},
  author  = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  pages   = {548--557},
  year    = {2024},
  doi     = {10.1038/s42256-024-00836-4}
}
```

## Credits

Original model and code by Wang et al. Source: [GitHub](https://github.com/CatIIIIIIII/RNAErnie) /
[HuggingFace](https://huggingface.co/LLM-EDA/RNAErnie).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

Apache 2.0, following the original repository.