Instructions to use Taykhoom/RNAErnie2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/RNAErnie2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/RNAErnie2", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - rna | |
| library_name: transformers | |
| tags: | |
| - RNA | |
| - language-model | |
| license: apache-2.0 | |
| # RNAErnie2 | |
| RNAErnie2 is a BERT-based RNA language model trained from scratch on a large-scale RNA | |
| sequence dataset with up to 2048-nucleotide context length. It is a retrained successor | |
| to RNAErnie that replaces the PaddlePaddle-based ERNIE backbone with a standard PyTorch | |
| BERT architecture, extends the pretraining corpus to RNACentral v22 (~31M sequences, | |
| length <= 2048), and switches to an RNA-native vocabulary (U instead of T). | |
| ## Architecture | |
| | Parameter | Value | | |
| |---|---| | |
| | Layers | 12 | | |
| | Attention heads | 12 | | |
| | Embedding dimension | 768 | | |
| | Intermediate size | 3072 | | |
| | Vocabulary size | 11 | | |
| | Positional encoding | Absolute learned | | |
| | Architecture | Post-LN BERT / BertForMaskedLM | | |
| | Max sequence length | 2048 | | |
| **Vocabulary:** `[PAD]=0, [UNK]=1, [CLS]=2, [EOS]=3, [SEP]=4, [MASK]=5, A=6, U=7, C=8, G=9, N=10` | |
| ## Pretraining | |
| - **Objective:** Masked language modelling (MLM) | |
| - **Data:** RNACentral v22, ~31 million RNA sequences with length <= 2048 | |
| - **Source checkpoint:** [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) on HuggingFace Hub | |
| - **Tokenisation note:** Sequences use U (not T). Input T is silently converted to U by the tokenizer. | |
| ### Checkpoint selection | |
| There is a single publicly released RNAErnie2 checkpoint. The weights are taken from | |
| [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) with one minor | |
| adjustment: `cls.predictions.decoder.bias` is stored explicitly (it was implicitly | |
| tied to `cls.predictions.bias` in the original save and was absent from the file). | |
| ## Parity Verification | |
| Hidden-state representations and MLM logits verified identical (max abs diff < 2e-5) | |
| to the original `BertForMaskedLM` at all 13 representation levels (embedding + 12 layers). | |
| Verified on GPU with PyTorch 2.7 / CUDA 12. | |
| ## Implementation Notes | |
| Custom BERT implementation (`modeling_rnaernie2.py`) with eager, SDPA, and Flash | |
| Attention 2 backends, following the architecture of | |
| [`Taykhoom/BERT-updated`](https://huggingface.co/Taykhoom/BERT-updated). | |
| The original [`LLM-EDA/RNAErnie`](https://huggingface.co/LLM-EDA/RNAErnie) used | |
| standard HF BERT with no custom attention backends. | |
| ## Related Models | |
| See the full [RNAErnie collection](https://huggingface.co/collections/Taykhoom/rnaernie-6a219927c11fdcccedb243db). | |
| | Model | Context | Training data | Notes | | |
| |---|---|---|---| | |
| | [RNAErnie](https://huggingface.co/Taykhoom/RNAErnie) | 512 | RNACentral (nts<=512) | Original; PaddlePaddle backbone | | |
| | **[RNAErnie2](https://huggingface.co/Taykhoom/RNAErnie2)** | **2048** | **RNACentral v22 (~31M seqs)** | **This model; PyTorch BERT** | | |
| ## Usage | |
| ### Embedding generation | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True) | |
| model = AutoModel.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True) | |
| model.eval() | |
| sequences = ["AUGCAUGCAUGC", "GCUGCAUGCUAGC"] | |
| enc = tokenizer(sequences, return_tensors="pt", padding=True) | |
| with torch.no_grad(): | |
| out = model(**enc) | |
| cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token | |
| token_emb = out.last_hidden_state # (batch, seq_len, 768) | |
| # Intermediate layers | |
| out_all = model(**enc, output_hidden_states=True) | |
| layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768) | |
| ``` | |
| ### MLM logits | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True) | |
| model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RNAErnie2", trust_remote_code=True) | |
| model.eval() | |
| enc = tokenizer(["AUG[MASK]AUG"], return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**enc).logits # (1, seq_len, 11) | |
| ``` | |
| ### SDPA / Flash Attention 2 | |
| ```python | |
| model = AutoModel.from_pretrained( | |
| "Taykhoom/RNAErnie2", | |
| attn_implementation="sdpa", # or "flash_attention_2" | |
| trust_remote_code=True, | |
| ) | |
| ``` | |
| ### Fine-tuning | |
| Standard HF conventions. For sequence-level tasks, use the CLS token embedding | |
| (`last_hidden_state[:, 0, :]`) as input to a classification head. | |
| ## Citation | |
| ```bibtex | |
| @article{wang2024_rnaernie, | |
| title = {Multi-purpose {RNA} language modelling with motif-aware pretraining and type-guided fine-tuning}, | |
| author = {Wang, Ning and Bian, Jiang and Li, Yuchen and Li, Xuhong and Mumtaz, Shahid and Kong, Linghe and Xiong, Haoyi}, | |
| journal = {Nature Machine Intelligence}, | |
| volume = {6}, | |
| pages = {548--557}, | |
| year = {2024}, | |
| doi = {10.1038/s42256-024-00836-4} | |
| } | |
| ``` | |
| ## Credits | |
| Original model and code by Wang et al. Source: [GitHub](https://github.com/CatIIIIIIII/RNAErnie) / | |
| [HuggingFace](https://huggingface.co/LLM-EDA/RNAErnie). | |
| The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) | |
| and reviewed manually by Taykhoom Dalal. | |
| ## License | |
| Apache 2.0, following the original repository. | |