Instructions to use Taykhoom/RNAErnie2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/RNAErnie2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/RNAErnie2", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,021 Bytes
9e231ea 49f9060 9e231ea aac58b8 9e231ea 7430600 da85d8c 9e231ea | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
license: apache-2.0
---
# RNAErnie2
RNAErnie2 is a BERT-based RNA language model trained from scratch on a large-scale RNA
sequence dataset with up to 2048-nucleotide context length. It is a retrained successor
to RNAErnie that replaces the PaddlePaddle-based ERNIE backbone with a standard PyTorch
BERT architecture, extends the pretraining corpus to RNACentral v22 (~31M sequences,
length <= 2048), and switches to an RNA-native vocabulary (U instead of T).
## Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 11 |
| Positional encoding | Absolute learned |
| Architecture | Post-LN BERT / BertForMaskedLM |
| Max sequence length | 2048 |
**Vocabulary:** `[PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5, A=6, U=7, C=8, G=9, N=10`
## Pretraining
- **Objective:** Masked language modelling (MLM)
- **Data:** RNACentral v22, ~31 million RNA sequences with length <= 2048
- **Source checkpoint:** [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) on HuggingFace Hub
- **Tokenisation note:** Sequences use U (not T). Input T is silently converted to U by the tokenizer.
### Checkpoint selection
There is a single publicly released RNAErnie2 checkpoint. The weights are taken from
[`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) with one minor
adjustment: `cls.predictions.decoder.bias` is stored explicitly (it was implicitly
tied to `cls.predictions.bias` in the original save and was absent from the file).
## Parity Verification
Hidden-state representations and MLM logits verified identical (max abs diff < 2e-5)
to the original `BertForMaskedLM` at all 13 representation levels (embedding + 12 layers).
Verified on GPU with PyTorch 2.7 / CUDA 12.
## Implementation Notes
Custom BERT implementation (`modeling_rnaernie2.py`) with eager, SDPA, and Flash
Attention 2 backends, following the architecture of
[`Taykhoom/BERT-updated`](https://huggingface.co/Taykhoom/BERT-updated).
The original [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) used
standard HF BERT with no custom attention backends.
## Related Models
See the full [RNAErnie collection](https://huggingface.co/collections/Taykhoom/rnaernie-6a219927c11fdcccedb243db).
| Model | Context | Training data | Notes |
|---|---|---|---|
| [RNAErnie](https://huggingface.co/Taykhoom/RNAErnie) | 512 | RNACentral (nts<=512) | Original; PaddlePaddle backbone |
| **[RNAErnie2](https://huggingface.co/Taykhoom/RNAErnie2)** | **2048** | **RNACentral v22 (~31M seqs)** | **This model; PyTorch BERT** |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()
sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 768)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
```
### MLM logits
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True)
model.eval()
enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 11)
```
### SDPA / Flash Attention 2
```python
model = AutoModel.from_pretrained(
"Taykhoom/RNAErnie2",
attn_implementation="sdpa", # or "flash_attention_2"
trust_remote_code=True,
)
```
### Fine-tuning
Standard HF conventions. For sequence-level tasks, use the CLS token embedding
(`last_hidden_state[:, 0, :]`) as input to a classification head.
## Citation
```bibtex
@article{wang2024_rnaernie,
title = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning},
author = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi},
journal = {Nature Machine Intelligence},
volume = {6},
pages = {548--557},
year = {2024},
doi = {10.1038/s42256-024-00836-4}
}
```
## Credits
Original model and code by Wang et al. Source: [GitHub](https://github.com/CatIIIIIIII/RNAErnie) /
[HuggingFace](https://huggingface.co/LLM-EDA/RNAErnie).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
Apache 2.0, following the original repository.
|