AIDO.RNA-650M-CDS

650M-parameter model fine-tuned on coding sequences (CDS). This is a standalone HuggingFace port that loads without the ModelGenerator package.

Architecture

Parameter Value
Layers 33
Attention heads 20
Embedding dimension 1280
Intermediate (MLP) size 3392
Vocabulary size 16
Positional encoding RoPE (rotary_percent=1.0)
Normalization LayerNorm
MLP activation SwiGLU
Architecture Pre-LN Transformer
Max sequence length 1024 (training truncation; RoPE has no hard limit)

Vocabulary: [PAD], [MASK], [CLS], [SEP], [UNK], A, G, C, T, U, N, [BOS], [EOS], [UNUSED1], [UNUSED2], [UNUSED3]

Pretraining

  • Objective: Masked language modeling (MLM) on RNA sequences
  • Data: 42M ncRNA sequences (pre-training) + CDS fine-tuning
  • Source checkpoint: genbio-ai/AIDO.RNA-650M-CDS

Checkpoint selection

Preferred over the base 650M for tasks involving coding regions or translation.

Parity Verification

Hidden-state representations compared against the original genbio-ai/AIDO.RNA-650M-CDS weights at all 34 representation levels (embedding + 33 transformer layers). Intermediate layer differences are floating-point accumulation noise normalised away by the final layer norm; the final output matches the original within 1e-5. max diff = 4.77e-06 (final output), 3.81e-05 (intermediate layers). Verified on GPU with PyTorch 2.7 / CUDA 12.

Related Models

See the full AIDO.RNA collection.

Model Parameters Data Notes
AIDO.RNA-1M-MARS 1M MARS ncRNA Smallest MARS variant
AIDO.RNA-25M-MARS 25M MARS ncRNA Mid-size MARS variant
AIDO.RNA-300M-MARS 300M MARS ncRNA Large MARS variant
AIDO.RNA-650M 650M 42M ncRNA Base model
AIDO.RNA-650M-CDS 650M 42M ncRNA + CDS CDS-adapted
AIDO.RNA-1.6B 1.6B 42M ncRNA Largest base model
AIDO.RNA-1.6B-CDS 1.6B 42M ncRNA + CDS Largest CDS-adapted

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
model.eval()

sequences = ["ACGUGCUAGCUAGCUA", "AUGCUAGCUAGCUAGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 1280) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 1280)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]

MLM logits

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
model.eval()

enc = tokenizer(["ACGU[MASK]GCUA"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 16)

Fine-tuning

Standard HF conventions. Use cls_emb = out.last_hidden_state[:, 0, :] (CLS token) as input to a task-specific head for sequence-level tasks.

Implementation Notes

The original genbio-ai/AIDO.RNA-650M-CDS checkpoint requires the ModelGenerator package to load. This port is a clean standalone re-implementation:

  • All model logic is contained in modeling_aidorna.py and configuration_aidorna.py.
  • attn_implementation="sdpa" and attn_implementation="flash_attention_2" are added (not present in the original genbio-ai implementation).
  • Architecture: pre-LN Transformer with SwiGLU MLP and RoPE positional embeddings.

Citation

@article{zou2024_aidorna,
  title   = {A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
  author  = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
  journal = {bioRxiv},
  year    = {2024},
  doi     = {10.1101/2024.11.28.625345}
}

Credits

Original model and code by Zou et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

GenBio AI Community License, following the original repository. See LICENSE for details.

Downloads last month
17
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/AIDO.RNA-650M-CDS