--- language: - rna library_name: transformers tags: - RNA - language-model - splicing license: mit --- # SpliceBERT-1024nt SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates. ## Architecture | Parameter | Value | |---|---| | Layers | 6 | | Attention heads | 16 | | Embedding dimension | 512 | | Intermediate dimension | 2048 | | Vocabulary size | 10 | | Positional encoding | Learned absolute | | Architecture | BERT encoder | | Max sequence length | 1024 | | Parameters | ~44M | Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9 ## Pretraining - **Objective:** Masked language modeling (MLM) - **Data:** >2 million vertebrate primary RNA sequences from 72 species - **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T - **Source checkpoint:** `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778)) ### Checkpoint selection The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are trained on fixed-length fragments and require exact 510nt input. ## Parity Verification Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both `eager` and `sdpa` attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8. ## Related Models See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa). | Model | Context | Training data | Notes | |---|---|---|---| | **[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt)** | 1024 nt | 72 vertebrates | This model | | [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input | | [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input | ## Usage ### Embedding generation The tokenizer automatically handles U->T conversion and single-nucleotide spacing. Pass raw sequences directly. ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) model.eval() seq = "ACGUACGUACGUACGU" # U->T handled automatically enc = tokenizer(seq, return_tensors="pt") with torch.no_grad(): out = model(**enc, output_hidden_states=True) # Mean pooling over non-special tokens hidden = out.last_hidden_state[0] # (seq_len+2, 512) token_emb = hidden[1:-1] # strip [CLS] and [SEP] mean_emb = token_emb.mean(dim=0) # (512,) # Intermediate layers layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512) ``` ### MLM logits ```python import torch from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) model.eval() seq = "A C G [MASK] A C G T" enc = tokenizer(seq, return_tensors="pt") with torch.no_grad(): logits = model(**enc).logits # (1, seq_len, 10) ``` ### Fine-tuning Standard HF conventions. For sequence-level tasks, use mean pooling of non-special token positions (positions 1 to -1) as input to a prediction head. ## Implementation Notes The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`. This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support not present in the original codebase. ```python model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True, attn_implementation="sdpa") ``` ## Citation ```bibtex @article{chen2024_splicebert, title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction}, author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong}, journal = {Briefings in Bioinformatics}, volume = {25}, number = {3}, pages = {bbae163}, year = {2024}, doi = {10.1093/bib/bbae163} } ``` ## Credits Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal. ## License MIT, following the original repository.