How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Taykhoom/RiNALMo-mega", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True, dtype="auto")
Quick Links

RiNALMo-mega

Minimal HuggingFace port of the mega (~150M parameter) variant of RiNALMo -- a general-purpose RNA language model pre-trained on 36 million non-coding RNA sequences.

Architecture

Parameter Value
Layers 30
Attention heads 20
Embedding dimension 640
FFN hidden dimension 1706 (SwiGLU, floor(2/3 x 4 x embed))
Vocabulary size 22
Positional encoding RoPE (base=10000, non-interleaved)
Architecture Pre-LN Transformer with SwiGLU FFN
Max sequence length ~8192 (practical; RoPE has no hard limit)

Vocabulary (index order): <cls> (0), <pad> (1), <eos> (2), <unk> (3), <mask> (4), A (5), C (6), G (7), T (8), I (9), R (10), Y (11), K (12), M (13), S (14), W (15), B (16), D (17), H (18), V (19), N (20), - (21).

Note: the tokenizer converts U -> T before encoding (the model was trained on T).

Pretraining

  • Objective: Masked language modeling (BERT-style, 15% mask rate)
  • Data: 36 million non-coding RNA sequences from multiple databases
  • Source checkpoint: rinalmo_mega_pretrained.pt from Zenodo 15043668

Checkpoint selection

The mega variant (150M params) offers a strong quality/cost tradeoff. Use micro for fast inference; use giga for maximum representation quality.

Parity Verification

All 31 representation levels (embedding + 30 transformer layers) verified to be bit-exact (max abs diff = 0.00) against a pure-PyTorch reference that loads the original weights. Weight mapping verified for all 390 per-block tensors. Eager and SDPA implementations agree within 4e-6 on padded batches.

Related Models

See the full RiNALMo collection.

Model Parameters Notes
RiNALMo-micro 35M Smallest variant
RiNALMo-mega 150M This model
RiNALMo-giga 650M Full model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()

sequences = ["ACUUUGGCCA", "CCCGGU"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 640) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 640)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer15_emb = out_all.hidden_states[15]       # after block 15

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACU<mask>UGGCCA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 22)

Faster attention backends

# SDPA (PyTorch 2.0+)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   dtype=torch.bfloat16)

Fine-tuning

Standard HF conventions. For sequence-level tasks, pool over non-padding positions or use the CLS token embedding as input to a prediction head.

Implementation Notes

The original RiNALMo uses flash_attn for attention during training. This HF port implements eager (standard PyTorch), SDPA, and flash_attention_2 as separate backends selectable via attn_implementation. The SDPA and flash_attention_2 backends were not part of the original codebase.

The model uses a non-standard Pre-LN residual: the attention residual connection is taken from the normalized input (i.e., x = attn_ln(x); x = x + attn(x)) rather than the original. The FFN uses standard Pre-LN.

TokenDropout rescales embeddings by (1 - mask_ratio_train) / (1 - mask_ratio_observed) even at inference, consistent with the original training code.

Citation

@article{penic2025_rinalmo,
  title={RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks},
  author={Penić, Rafael Josip and Vlašić, Tin and Huber, Roland G. and Wan, Yue and Šikić, Mile},
  journal={Nature Communications},
  volume={16},
  pages={5671},
  year={2025},
  doi={10.1038/s41467-025-60872-5}
}

Credits

Original model and code by Penic et al. Source: GitHub lbcb-sci/RiNALMo. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0 (code) / CC BY 4.0 (model weights), following the original repository.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/RiNALMo-mega

Paper for Taykhoom/RiNALMo-mega