RiNALMo-micro

Minimal HuggingFace port of the micro (35M parameter) variant of RiNALMo — a general-purpose RNA language model pre-trained on 36 million non-coding RNA sequences.

Architecture

Parameter	Value
Layers	12
Attention heads	20
Embedding dimension	480
FFN hidden dimension	1280 (SwiGLU, 2/3 x 4 x embed)
Vocabulary size	22
Positional encoding	RoPE (base=10000, non-interleaved)
Architecture	Pre-LN Transformer with SwiGLU FFN
Max sequence length	~8192 (practical; RoPE has no hard limit)

Vocabulary (index order): <cls> (0), <pad> (1), <eos> (2), <unk> (3), <mask> (4), A (5), C (6), G (7), T (8), I (9), R (10), Y (11), K (12), M (13), S (14), W (15), B (16), D (17), H (18), V (19), N (20), - (21).

Note: the tokenizer converts U -> T before encoding (the model was trained on T).

Pretraining

Objective: Masked language modeling (BERT-style, 15% mask rate)
Data: 36 million non-coding RNA sequences from multiple databases
Source checkpoint: rinalmo_micro_pretrained.pt from Zenodo 15043668

Checkpoint selection

The micro variant is the smallest (35M params) and fastest to use. Choose mega or giga for stronger representations on challenging tasks.

Parity Verification

All 13 representation levels (embedding + 12 transformer layers) verified to be bit-exact (max abs diff = 0.00) against a pure-PyTorch reference that loads the original weights. Weight mapping verified for all 156 per-block tensors. Eager and SDPA implementations agree within 4e-6 on padded batches.

Related Models

See the full RiNALMo collection.

Model	Parameters	Notes
RiNALMo-micro	35M	This model
RiNALMo-mega	150M	Medium variant
RiNALMo-giga	650M	Full model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True)
model.eval()

sequences = ["ACUUUGGCCA", "CCCGGU"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 480) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 480)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # after block 6

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACU<mask>UGGCCA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 22)

Faster attention backends

# SDPA (PyTorch 2.0+)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True,
                                   attn_implementation="sdpa")

# Flash Attention 2 (requires flash-attn package)
model = AutoModel.from_pretrained("Taykhoom/RiNALMo-micro", trust_remote_code=True,
                                   attn_implementation="flash_attention_2",
                                   torch_dtype=torch.bfloat16)

Fine-tuning

Standard HF conventions. For sequence-level tasks, pool over non-padding positions or use the CLS token embedding as input to a prediction head.

Implementation Notes

The original RiNALMo uses flash_attn for attention during training. This HF port implements eager (standard PyTorch), SDPA, and flash_attention_2 as separate backends selectable via attn_implementation. The SDPA and flash_attention_2 backends were not part of the original codebase.

The model uses a non-standard Pre-LN residual: the attention residual connection is taken from the normalized input (i.e., x = attn_ln(x); x = x + attn(x)) rather than the original. The FFN uses standard Pre-LN.

TokenDropout rescales embeddings by (1 - mask_ratio_train) / (1 - mask_ratio_observed) even at inference, consistent with the original training code.

Citation

@article{penic2025_rinalmo,
  title   = {RiNALMo: general-purpose {RNA} language models can generalize well on structure prediction tasks},
  author  = {Penić, Rafael Josip and Vlašić, Tin and Huber, Roland G. and Wan, Yue and Šikić, Mile},
  journal = {Nature Communications},
  volume  = {16},
  pages   = {5671},
  year    = {2025},
  doi     = {10.1038/s41467-025-60872-5}
}

Credits

Original model and code by Penic et al. Source: GitHub lbcb-sci/RiNALMo. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

Apache 2.0 (code) / CC BY 4.0 (model weights), following the original repository.

Downloads last month: 6

Safetensors

Model size

33.5M params

Tensor type

F32

Collection including Taykhoom/RiNALMo-micro

RiNALMo

Collection

HF ports of RiNALMo: 3 model versions ranging from 33M to 651M parameters. • 3 items • Updated Jun 5

Paper for Taykhoom/RiNALMo-micro

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks

Paper • 2403.00043 • Published Feb 29, 2024