File size: 5,820 Bytes
7b1fe62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1bebb8
 
 
 
 
 
 
 
7b1fe62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5973c83
4d7c6dd
7b1fe62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
license: apache-2.0
---

# RNAErnie

RNAErnie is a BERT-based RNA language model pretrained on RNACentral using a
motif-aware masking strategy with type-guided fine-tuning. It uses a DNA-style
vocabulary (T instead of U) and extends the token vocabulary with 28 ncRNA
type labels to enable type-guided learning.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 39 |
| Positional encoding | Absolute learned |
| Architecture | Post-LN BERT / ERNIE |
| Max sequence length | 512 |

**Vocabulary:** Special tokens `[PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, [DEL]=5, [IND]=6`;
ncRNA type labels at indices 7-34 (RNaseMRPRNA, RNasePRNA, SRPRNA, YRNA, antisenseRNA,
autocatalyticallysplicedintron, guideRNA, hammerheadribozyme, lncRNA, miRNA, miscRNA,
ncRNA, other, piRNA, premiRNA, precursorRNA, rRNA, ribozyme, sRNA, scRNA, scaRNA,
siRNA, snRNA, snoRNA, tRNA, telomeraseRNA, tmRNA, vaultRNA);
nucleotides `A=35, T=36, C=37, G=38`.

**Tokenisation note:** Input U is silently converted to T. The model was pretrained
with DNA-style T notation.

## Pretraining

- **Objective:** Masked language modelling (MLM) with motif-aware masking
- **Data:** RNACentral (sequences with length <= 512)
- **Source checkpoint:** `model_state.pdparams` from the original PaddlePaddle repository

### Checkpoint selection

There is a single publicly released RNAErnie checkpoint
(`output/BERT,ERNIE,MOTIF,PROMPT/checkpoint_final/model_state.pdparams`),
corresponding to the `BERT,ERNIE,MOTIF,PROMPT` pretraining variant described in the
paper.

## Parity Verification

Hidden-state representations verified identical (max abs diff < 7e-6) at all
13 representation levels (embedding + 12 layers) against a standalone
pure-PyTorch reference that implements the PaddlePaddle ERNIE forward pass
directly from the raw `.pdparams` weights — without running PaddlePaddle.
The reference uses PaddlePaddle's linear convention (`x @ W`, weight stored
`(in, out)`) and loads weights from the original checkpoint file identically to
the conversion script, so the comparison is mathematically equivalent to a live
PaddlePaddle run. Verified on GPU with PyTorch 2.7 / CUDA 12.

**Note on weight conversion:** PaddlePaddle stores `nn.Linear` weights as
`(in_features, out_features)`, the transpose of PyTorch's `(out_features, in_features)`.
All linear layer weights (attention projections, FFN, pooler, MLM transform) are
transposed during conversion; embedding tables and bias vectors are copied as-is.

## Implementation Notes

The original implementation uses PaddlePaddle's ERNIE/TransformerEncoderLayer
backbone. This HF port re-implements the identical Post-LN BERT architecture in
pure PyTorch and adds `attn_implementation="sdpa"` and
`attn_implementation="flash_attention_2"` support, which were not part of the
original codebase.

## Related Models

See the full [RNAErnie collection](https://huggingface.co/collections/Taykhoom/rnaernie-6a219927c11fdcccedb243db).

| Model | Context | Training data | Notes |
|---|---|---|---|
| **[RNAErnie](https://huggingface.co/Taykhoom/RNAErnie)** | **512** | **RNACentral (nts<=512)** | **This model; PaddlePaddle ERNIE backbone** |
| [RNAErnie2](https://huggingface.co/Taykhoom/RNAErnie2) | 2048 | RNACentral v22 (~31M seqs) | Retrained; PyTorch BERT |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNAErnie", trust_remote_code=True)
model.eval()

sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
token_emb = out.last_hidden_state           # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]       # (batch, seq_len, 768)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie", trust_remote_code=True)
model.eval()

enc = tokenizer(["ATG[MASK]ATG"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 39)
```

### SDPA / Flash Attention 2

```python
model = AutoModel.from_pretrained(
    "Taykhoom/RNAErnie",
    attn_implementation="sdpa",   # or "flash_attention_2"
    trust_remote_code=True,
)
```

### Fine-tuning

Standard HF conventions. For sequence-level tasks, use the CLS token embedding
(`last_hidden_state[:, 0, :]`) as input to a classification head. For type-guided
fine-tuning (as in the paper), prepend the ncRNA type label token to the input.

## Citation

```bibtex
@article{wang2024_rnaernie,
  title   = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning},
  author  = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  pages   = {548--557},
  year    = {2024},
  doi     = {10.1038/s42256-024-00836-4}
}
```

## Credits

Original model and code by Wang et al. Source: [GitHub](https://github.com/CatIIIIIIII/RNAErnie).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

Apache 2.0, following the original repository.