RiNALMo-mega

Minimal HuggingFace port of the mega (~150M parameter) variant of RiNALMo -- a general-purpose RNA language model pre-trained on 36 million non-coding RNA sequences.

Architecture

Parameter	Value
Layers	30
Attention heads	20
Embedding dimension	640
FFN hidden dimension	1706 (SwiGLU, floor(2/3 x 4 x embed))
Vocabulary size	22
Positional encoding	RoPE (base=10000, non-interleaved)
Architecture	Pre-LN Transformer with SwiGLU FFN
Max sequence length	~8192 (practical; RoPE has no hard limit)

Vocabulary (index order): <cls> (0), <pad> (1), <eos> (2), <unk> (3), <mask> (4), A (5), C (6), G (7), T (8), I (9), R (10), Y (11), K (12), M (13), S (14), W (15), B (16), D (17), H (18), V (19), N (20), - (21).

Note: the tokenizer converts U -> T before encoding (the model was trained on T).

Pretraining

Objective: Masked language modeling (BERT-style, 15% mask rate)
Data: 36 million non-coding RNA sequences from multiple databases
Source checkpoint: rinalmo_mega_pretrained.pt from Zenodo 15043668

Checkpoint selection

The mega variant (150M params) offers a strong quality/cost tradeoff. Use micro for fast inference; use giga for maximum representation quality.

Parity Verification

All 31 representation levels (embedding + 30 transformer layers) verified to be bit-exact (max abs diff = 0.00) against a pure-PyTorch reference that loads the original weights. Weight mapping verified for all 390 per-block tensors. Eager and SDPA implementations agree within 4e-6 on padded batches.

Related Models

See the full RiNALMo collection.

Model	Parameters	Notes
RiNALMo-micro	35M	Smallest variant
RiNALMo-mega	150M	This model
RiNALMo-giga	650M	Full model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()

sequences = ["ACUUUGGCCA", "CCCGGU"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 640) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 640)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer15_emb = out_all.hidden_states[15]       # after block 15

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACU<mask>UGGCCA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 22)

Faster attention backends

# SDPA (PyTorch 2.0+)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   dtype=torch.bfloat16)

Fine-tuning

Standard HF conventions. For sequence-level tasks, pool over non-padding positions or use the CLS token embedding as input to a prediction head.

Implementation Notes

The original RiNALMo uses flash_attn for attention during training. This HF port implements eager (standard PyTorch), SDPA, and flash_attention_2 as separate backends selectable via attn_implementation. The SDPA and flash_attention_2 backends were not part of the original codebase.

The model uses a non-standard Pre-LN residual: the attention residual connection is taken from the normalized input (i.e., x = attn_ln(x); x = x + attn(x)) rather than the original. The FFN uses standard Pre-LN.

TokenDropout rescales embeddings by (1 - mask_ratio_train) / (1 - mask_ratio_observed) even at inference, consistent with the original training code.

Citation

@article{penic2025_rinalmo,
  title   = {RiNALMo: general-purpose {RNA} language models can generalize well on structure prediction tasks},
  author  = {Penić, Rafael Josip and Vlašić, Tin and Huber, Roland G. and Wan, Yue and Šikić, Mile},
  journal = {Nature Communications},
  volume  = {16},
  pages   = {5671},
  year    = {2025},
  doi     = {10.1038/s41467-025-60872-5}
}

Credits

Original model and code by Penic et al. Source: GitHub lbcb-sci/RiNALMo. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0 (code) / CC BY 4.0 (model weights), following the original repository.

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including Taykhoom/RiNALMo-mega

RiNALMo

Collection

HF ports of RiNALMo: 3 model versions ranging from 33M to 651M parameters. • 3 items • Updated Jun 5

Paper for Taykhoom/RiNALMo-mega

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks

Paper • 2403.00043 • Published Feb 29, 2024