YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

structRFM: Structure-guided RNA Foundation Model

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0.1-ee4c2c.svg)](https://pytorch.org/)

bioRxiv | PDF | GitHub | PyPI

Overview

structRFM is a fully open-source structure-guided RNA foundation model that integrates sequence and structural knowledge through innovative pre-training strategies. By leveraging 21 million sequence-structure pairs and a novel Structure-guided Masked Language Modeling (SgMLM) approach, structRFM achieves state-of-the-art performance across a broad spectrum of RNA structural and functional inference tasks, setting new benchmarks for reliability and generalizability.

Figure: Overview of architecture and downstream applications

Key Features

  • Structure-Guided Pre-Training: SgMLM strategy dynamically balances sequence-level and structure-level masking, capturing base-pair interactions without task-specific biases.
  • Multi-Source Structure Ensemble: MUSES (Multi-source ensemble of secondary structures) integrates thermodynamics-based, probability-based, and deep learning-based predictors to mitigate annotation biases.
  • Versatile Feature Output: Generates classification-level, sequence-level, and pairwise matrix features to support sequence-wise, nucleotide-wise, and structure-wise tasks.
  • State-of-the-Art Performance: Archieves state-of-the-art performances on zero-shot, secondary structure prediction, tertiary structure prediction, function prediction tasks.
  • Zero-Shot Capability: Ranks top 4 in zero-shot homology classification across Rfam and ArchiveII datasets, with strong secondary structure prediction without labeled data.
  • Long RNA Handling: Overlapping sliding window strategy enables high-accuracy classification of long non-coding RNAs (lncRNAs) up to 3,000 nt.
  • Fully Open Resources: 21M sequence-structure dataset, pre-trained models, and fine-tuned checkpoints are publicly available for the research community.

Quick Start

AutoModel and AutoTokenizer

Install package: pip install transformers

import os

from transformers import AutoModel, AutoTokenizer

model_path = 'heqin-zhu/structRFM'
# model_path = os.getenv('structRFM_checkpoint')

model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# single sequence
seq = 'GUCCCAACUCUUGCGGGGAGGGAU'
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
print('>>> single seq, length:', len(seq))
for k, v in outputs.items():
    print(k, v.shape)
print(outputs.last_hidden_state.shape)

# batch mode
seqs = ["GUCCCAA", 'AGUGUUG', 'AUGUAGUTCUN']
inputs = tokenizer(
             seqs,
             add_special_tokens=True,
             max_length=512,
             padding='max_length',
             truncation=True,
             return_tensors='pt'
        )
outputs = model(**inputs) # note that the output sequential features are padded to max-length
print('>>> batch seqs, batch:', len(seqs))
for k, v in outputs.items():
    print(k, v.shape)

'''
>>> single seq, length: 24
last_hidden_state torch.Size([1, 24, 768])
pooler_output torch.Size([1, 768])
torch.Size([1, 24, 768])
>>> batch seqs, batch: 3
last_hidden_state torch.Size([3, 512, 768])
pooler_output torch.Size([3, 768])
'''

Mannually Usage

  1. Install packages
pip install transformers structRFM BPfold
  1. Download and decompress pretrained structRFM (~300 MB).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf structRFM_checkpoint.tar.gz
  1. Set environment varible structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting

Extracting Features

import os

from structRFM.infer import structRFM_infer

from_pretrained = os.getenv('structRFM_checkpoint')
model_paras = dict(max_length=514, dim=768, layer=12, num_attention_heads=12)
model = structRFM_infer(from_pretrained=from_pretrained, **model_paras)

seq = 'AGUACGUAGUA'

print('seq len:', len(seq))
feat_dic = model.extract_feature(seq)
for k, v in feat_dic.items():
    print(k, v.shape)

'''
seq len: 11
cls_feat torch.Size([768])
seq_feat torch.Size([11, 768])
mat_feat torch.Size([11, 11])
'''

Building Model and Tokenizer

import os

from structRFM.model import get_structRFM
from structRFM.data import preprocess_and_load_dataset, get_mlm_tokenizer

from_pretrained = os.getenv('structRFM_checkpoint')

tokenizer = get_mlm_tokenizer(max_length=514)
model = get_structRFM(dim=768, layer=12, num_attention_heads=12, from_pretrained=from_pretrained, pretrained_length=None, max_length=514, tokenizer=tokenizer)

Pre-training and Fine-tuning

Requirements

  • python3.8+
  • anaconda

Installation 0. Clone GitHub repo.

git clone https://github.com/heqin-zhu/structRFM.git
cd structRFM
  1. Create and activate conda environment.
conda env create -f structRFM_environment.yaml
conda activate structRFM
  1. Download and decompress pretrained structRFM (~300 MB).
wget https://github.com/heqin-zhu/structRFM/releases/latest/download/structRFM_checkpoint.tar.gz
tar -xzf structRFM_checkpoint.tar.gz
  1. Set environment varible structRFM_checkpoint.
export structRFM_checkpoint=PATH_TO_CHECKPOINT # modify ~/.bashrc for permanent setting

Download sequence-structure dataset

The pretrianing sequence-structure dataset is constructed using RNAcentral and BPfold. We filter sequences with a length limited to 512, resulting about 21 millions sequence-structure paired data. It can be downloaded at Zenodo (4.5 GB).

Run Pre-training

  • Modify variables USER_DIR and PROGRAM_DIR in scripts/run.sh,
  • Specify DATA_PATH and run_name in the following command,

Then run:

bash scripts/run.sh --batch_size 96 --epoch 100 --lr 0.0001 --tag mlm --mlm_structure --max_length 514 --model_scale base --data_path DATA_PATH --run_name structRFM_512

For more information, run python3 main.py -h.

Downstream Tasks Fine-tuning

Download all data (3.7 GB) and checkpoints (2.2 GB) from Zenodo, and then place them into corresponding folder of each task.

Acknowledgement

We appreciate the following open-source projects for their valuable contributions:

LICENSE

MIT LICENSE

Citation

If you find our work helpful, please cite our paper:

@article {structRFM,
    author = {Zhu, Heqin and Li, Ruifeng and Zhang, Feng and Tang, Fenghe and Ye, Tong and Li, Xin and Gu, Yujie and Xiong, Peng and Zhou, S Kevin},
    title = {A fully-open structure-guided RNA foundation model for robust structural and functional inference},
    elocation-id = {2025.08.06.668731},
    year = {2025},
    doi = {10.1101/2025.08.06.668731},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/08/07/2025.08.06.668731},
    journal = {bioRxiv}
}
Downloads last month
17
Safetensors
Model size
86.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support