SpliceBERT-510nt / README.md
Taykhoom's picture
Upload README.md with huggingface_hub
46eba30 verified
metadata
language:
  - rna
library_name: transformers
tags:
  - RNA
  - language-model
  - splicing
license: mit

SpliceBERT-510nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 510nt vertebrate variant is trained exclusively on fixed-length 510 nt fragments.

WARNING: This model requires exactly 510 nt of input (excluding [CLS] and [SEP]). Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning. For general-purpose RNA embedding, use SpliceBERT-1024nt instead.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 512
Intermediate dimension 2048
Vocabulary size 10
Positional encoding Learned absolute
Architecture BERT encoder
Max sequence length 510 (fixed-length training)
Parameters ~44M

Vocabulary: [PAD]=0, [UNK]=1, [CLS]=2, [SEP]=3, [MASK]=4, N=5, A=6, C=7, G=8, T/U=9

Pretraining

  • Objective: Masked language modeling (MLM)
  • Data: >2 million vertebrate primary RNA sequences from 72 species
  • Sequence format: Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
  • Source checkpoint: SpliceBERT.510nt/pytorch_model.bin (from zenodo:7995778)

Checkpoint selection

The 510nt vertebrate variant is intended for splice site prediction tasks where exact 510 nt windows are used (e.g., centered on a splice site). For variable-length sequences use SpliceBERT-1024nt.

Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both eager and sdpa attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8.

Related Models

See the full SpliceBERT collection.

Model Context Training data Notes
SpliceBERT-1024nt 1024 nt 72 vertebrates Variable-length; general purpose
SpliceBERT-510nt 510 nt (fixed) 72 vertebrates This model
SpliceBERT-human-510nt 510 nt (fixed) Human only Human-specific

Usage

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model.eval()

# Sequence must be exactly 510 nt; tokenizer handles U->T automatically
seq = ("ATCGATCG" * 64)[:510]  # exactly 510 nt
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

hidden = out.last_hidden_state[0]  # (512, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP] -> (510, 512)
mean_emb = token_emb.mean(dim=0)   # (512,)

Fine-tuning

Standard HF conventions. For splice site prediction, token-level classification using all 510 token positions (excluding special tokens) is the typical setup.

Implementation Notes

The original checkpoint was saved as BertForMaskedLM with transformers==4.20.1. This port uses BERT-updated, which adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support not present in the original codebase.

Citation

@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}

Credits

Original model and code by Chen et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.