Instructions to use Taykhoom/RiNALMo-mega with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/RiNALMo-mega with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/RiNALMo-mega", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
RiNALMo-mega
Minimal HuggingFace port of the mega (~150M parameter) variant of RiNALMo -- a general-purpose RNA language model pre-trained on 36 million non-coding RNA sequences.
Architecture
| Parameter | Value |
|---|---|
| Layers | 30 |
| Attention heads | 20 |
| Embedding dimension | 640 |
| FFN hidden dimension | 1706 (SwiGLU, floor(2/3 x 4 x embed)) |
| Vocabulary size | 22 |
| Positional encoding | RoPE (base=10000, non-interleaved) |
| Architecture | Pre-LN Transformer with SwiGLU FFN |
| Max sequence length | ~8192 (practical; RoPE has no hard limit) |
Vocabulary (index order): <cls> (0), <pad> (1), <eos> (2), <unk> (3),
<mask> (4), A (5), C (6), G (7), T (8), I (9), R (10), Y (11), K (12), M (13),
S (14), W (15), B (16), D (17), H (18), V (19), N (20), - (21).
Note: the tokenizer converts U -> T before encoding (the model was trained on T).
Pretraining
- Objective: Masked language modeling (BERT-style, 15% mask rate)
- Data: 36 million non-coding RNA sequences from multiple databases
- Source checkpoint:
rinalmo_mega_pretrained.ptfrom Zenodo 15043668
Checkpoint selection
The mega variant (150M params) offers a strong quality/cost tradeoff. Use micro for fast inference; use giga for maximum representation quality.
Parity Verification
All 31 representation levels (embedding + 30 transformer layers) verified to be bit-exact (max abs diff = 0.00) against a pure-PyTorch reference that loads the original weights. Weight mapping verified for all 390 per-block tensors. Eager and SDPA implementations agree within 4e-6 on padded batches.
Related Models
See the full RiNALMo collection.
| Model | Parameters | Notes |
|---|---|---|
| RiNALMo-micro | 35M | Smallest variant |
| RiNALMo-mega | 150M | This model |
| RiNALMo-giga | 650M | Full model |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()
sequences = ["ACUUUGGCCA", "CCCGGU"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 640) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 640)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer15_emb = out_all.hidden_states[15] # after block 15
MLM logits
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True)
model.eval()
enc = tokenizer(["ACU<mask>UGGCCA"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 22)
Faster attention backends
# SDPA (PyTorch 2.0+)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
attn_implementation="sdpa")
# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-mega", trust_remote_code=True,
attn_implementation="flash_attention_2",
dtype=torch.bfloat16)
Fine-tuning
Standard HF conventions. For sequence-level tasks, pool over non-padding positions or use the CLS token embedding as input to a prediction head.
Implementation Notes
The original RiNALMo uses flash_attn for attention during training. This HF port
implements eager (standard PyTorch), SDPA, and flash_attention_2 as separate backends
selectable via attn_implementation. The SDPA and flash_attention_2 backends were not
part of the original codebase.
The model uses a non-standard Pre-LN residual: the attention residual connection is
taken from the normalized input (i.e., x = attn_ln(x); x = x + attn(x)) rather
than the original. The FFN uses standard Pre-LN.
TokenDropout rescales embeddings by (1 - mask_ratio_train) / (1 - mask_ratio_observed)
even at inference, consistent with the original training code.
Citation
@article{penic2025_rinalmo,
title={RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks},
author={Penić, Rafael Josip and Vlašić, Tin and Huber, Roland G. and Wan, Yue and Šikić, Mile},
journal={Nature Communications},
volume={16},
pages={5671},
year={2025},
doi={10.1038/s41467-025-60872-5}
}
Credits
Original model and code by Penic et al. Source: GitHub lbcb-sci/RiNALMo. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
Apache 2.0 (code) / CC BY 4.0 (model weights), following the original repository.
- Downloads last month
- -