SpliceBERT-1024nt / README.md
Taykhoom's picture
Upload README.md with huggingface_hub
1200db8 verified
metadata
language:
  - rna
library_name: transformers
tags:
  - RNA
  - language-model
  - splicing
license: mit

SpliceBERT-1024nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 512
Intermediate dimension 2048
Vocabulary size 10
Positional encoding Learned absolute
Architecture BERT encoder
Max sequence length 1024
Parameters ~44M

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9

Pretraining

  • Objective: Masked language modeling (MLM)
  • Data: >2 million vertebrate primary RNA sequences from 72 species
  • Sequence format: Single-nucleotide tokenization with spaces; U converted to T
  • Source checkpoint: SpliceBERT.1024nt/pytorch_model.bin (from zenodo:7995778)

Checkpoint selection

The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are trained on fixed-length fragments and require exact 510nt input.

Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both eager and sdpa attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8.

Related Models

See the full SpliceBERT collection.

Model Context Training data Notes
SpliceBERT-1024nt 1024 nt 72 vertebrates This model
SpliceBERT-510nt 510 nt (fixed) 72 vertebrates Fixed-length; requires exact 510 nt input
SpliceBERT-human-510nt 510 nt (fixed) Human only Human-specific; requires exact 510 nt input

Usage

Embedding generation

The tokenizer automatically handles U->T conversion and single-nucleotide spacing. Pass raw sequences directly.

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "ACGUACGUACGUACGU"  # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0]  # (seq_len+2, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0)   # (512,)

# Intermediate layers
layer3_emb = out.hidden_states[3]  # (1, seq_len+2, 512)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 10)

Fine-tuning

Standard HF conventions. For sequence-level tasks, use mean pooling of non-special token positions (positions 1 to -1) as input to a prediction head.

Implementation Notes

The original checkpoint was saved as BertForMaskedLM with transformers==4.24.0. This port uses BERT-updated, which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support not present in the original codebase.

model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
                                  trust_remote_code=True,
                                  attn_implementation="sdpa")

Citation

@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}

Credits

Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.