AIDO.RNA-1.6B

1.6B-parameter base model -- the largest AIDO.RNA variant. This is a standalone HuggingFace port that loads without the ModelGenerator package.

Architecture

Parameter	Value
Layers	32
Attention heads	32
Embedding dimension	2048
Intermediate (MLP) size	5440
Vocabulary size	16
Positional encoding	RoPE (rotary_percent=1.0)
Normalization	LayerNorm
MLP activation	SwiGLU
Architecture	Pre-LN Transformer
Max sequence length	1024 (training truncation; RoPE has no hard limit)

Vocabulary: [PAD], [MASK], [CLS], [SEP], [UNK], A, G, C, T, U, N, [BOS], [EOS], [UNUSED1], [UNUSED2], [UNUSED3]

Pretraining

Objective: Masked language modeling (MLM) on RNA sequences
Data: 42M non-coding RNA sequences
Source checkpoint: genbio-ai/AIDO.RNA-1.6B

Checkpoint selection

Largest base model; use when embedding quality is more important than speed.

Parity Verification

Hidden-state representations compared against the original genbio-ai/AIDO.RNA-1.6B weights at all 33 representation levels (embedding + 32 transformer layers). Intermediate layer differences are floating-point accumulation noise normalised away by the final layer norm; the final output matches the original within 1e-5. not verified (sharded checkpoint; architecture identical to 650M). Verified on GPU with PyTorch 2.7 / CUDA 12.

Related Models

See the full AIDO.RNA collection.

Model	Parameters	Data	Notes
AIDO.RNA-1M-MARS	1M	MARS ncRNA	Smallest MARS variant
AIDO.RNA-25M-MARS	25M	MARS ncRNA	Mid-size MARS variant
AIDO.RNA-300M-MARS	300M	MARS ncRNA	Large MARS variant
AIDO.RNA-650M	650M	42M ncRNA	Base model
AIDO.RNA-650M-CDS	650M	42M ncRNA + CDS	CDS-adapted
AIDO.RNA-1.6B	1.6B	42M ncRNA	Largest base model
AIDO.RNA-1.6B-CDS	1.6B	42M ncRNA + CDS	Largest CDS-adapted

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-1.6B", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/AIDO.RNA-1.6B", trust_remote_code=True)
model.eval()

sequences = ["ACGUGCUAGCUAGCUA", "AUGCUAGCUAGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 2048) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 2048)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-1.6B", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.RNA-1.6B", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGU[MASK]GCUA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 16)

Fine-tuning

Standard HF conventions. Use cls_emb = out.last_hidden_state[:, 0, :] (CLS token) as input to a task-specific head for sequence-level tasks.

Implementation Notes

The original genbio-ai/AIDO.RNA-1.6B checkpoint requires the ModelGenerator package to load. This port is a clean standalone re-implementation:

All model logic is contained in modeling_aidorna.py and configuration_aidorna.py.
attn_implementation="sdpa" and attn_implementation="flash_attention_2" are added (not present in the original genbio-ai implementation).
Architecture: pre-LN Transformer with SwiGLU MLP and RoPE positional embeddings.

Citation

@article{zou2024_aidorna,
  title   = {A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
  author  = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
  journal = {bioRxiv},
  year    = {2024},
  doi     = {10.1101/2024.11.28.625345}
}

Credits

Original model and code by Zou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

GenBio AI Community License, following the original repository. See LICENSE for details.

Downloads last month: 15

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/AIDO.RNA-1.6B

AIDO.RNA

Collection

HF ports of AIDO.RNA: 7 model versions ranging from 1M to 1.6B parameters. • 7 items • Updated 1 day ago