RNAIX
A deep learning model for RNA 3D structure prediction using diffusion and multi-modal embeddings. Developed for the Stanford RNA 3D Folding Kaggle competition.
RNAIX is heavily insipired and builds upon RibonanzaNet and Protenix models.
Description
RNAIX is a deep learning model designed to predict RNA 3D structures by integrating multiple sources of information, including sequence data, MSA-derived embeddings, frequency profiles, and structural priors from external predictions. It is built around a Pairformer-based encoder and uses a diffusion process to generate 3D coordinates.
Features
- Sequence and MSA-based token embeddings
- Alignment-derived frequency profile embeddings
- Structural priors from Protenix predictions
- Pairformer backbone with recycling
- Diffusion-based coordinate generation head
Architecture
The model consists of the following core modules:
- RNASequenceEmbedder β token-level embedding of the input RNA sequence
- MSAEmbedder β embedding of multiple sequence alignment (MSA) inputs
- MSAProfileEmbedder β embedding of alignment-derived frequency profiles
- ProtenixStructuralEncoder β encodes template structure predictions from Protenix
- RNAIX β feature fusion and prediction model using a Pairformer backbone with recycling and loss on 3D coordinates
Inputs
RNAIX takes a single RNA target with the following inputs:
- RNA sequence: Tokenized primary RNA sequence
- MSA matrix: Tokenized multiple sequence alignment
- Protenix coordinates: Predicted 3D structure used as a structural prior
Assumes that MSA alignments are precomputed.
Usage
import torch
from rnaix.model.model import RNAIX
path_checkpoint = "../sample_model/model_v01.pt"
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint = torch.load(path_checkpoint, map_location=device, weights_only=False)
model = RNAIX(checkpoint["config"])
model.load_state_dict(checkpoint["model_state_dict"])
model.to(device)
model.eval()
Training
RNAIX was trained on the Stanford RNA 3D Folding dataset, using only sequences with complete 3D coordinate annotations. Sequences with missing coordinates were excluded during training.
Limitations
- Requires precomputed MSA alignments and Protenix structure predictions
- Model does not support training on sequences with partially missing coordinates