Instructions to use Taykhoom/SpliceBERT-1024nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/SpliceBERT-1024nt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - rna | |
| library_name: transformers | |
| tags: | |
| - RNA | |
| - language-model | |
| - splicing | |
| license: mit | |
| # SpliceBERT-1024nt | |
| SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate | |
| primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt | |
| variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates. | |
| ## Architecture | |
| | Parameter | Value | | |
| |---|---| | |
| | Layers | 6 | | |
| | Attention heads | 16 | | |
| | Embedding dimension | 512 | | |
| | Intermediate dimension | 2048 | | |
| | Vocabulary size | 10 | | |
| | Positional encoding | Learned absolute | | |
| | Architecture | BERT encoder | | |
| | Max sequence length | 1024 | | |
| | Parameters | ~44M | | |
| Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9 | |
| ## Pretraining | |
| - **Objective:** Masked language modeling (MLM) | |
| - **Data:** >2 million vertebrate primary RNA sequences from 72 species | |
| - **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T | |
| - **Source checkpoint:** `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778)) | |
| ### Checkpoint selection | |
| The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate | |
| sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are | |
| trained on fixed-length fragments and require exact 510nt input. | |
| ## Parity Verification | |
| Hidden-state representations verified (max abs diff < 1e-5) against the original | |
| checkpoint at all 7 representation levels (embedding + 6 transformer layers), | |
| for both `eager` and `sdpa` attention backends. | |
| Verified on GPU with PyTorch 2.7 / CUDA 11.8. | |
| ## Related Models | |
| See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa). | |
| | Model | Context | Training data | Notes | | |
| |---|---|---|---| | |
| | **[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt)** | 1024 nt | 72 vertebrates | This model | | |
| | [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input | | |
| | [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input | | |
| ## Usage | |
| ### Embedding generation | |
| The tokenizer automatically handles U->T conversion and single-nucleotide spacing. | |
| Pass raw sequences directly. | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) | |
| model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) | |
| model.eval() | |
| seq = "ACGUACGUACGUACGU" # U->T handled automatically | |
| enc = tokenizer(seq, return_tensors="pt") | |
| with torch.no_grad(): | |
| out = model(**enc, output_hidden_states=True) | |
| # Mean pooling over non-special tokens | |
| hidden = out.last_hidden_state[0] # (seq_len+2, 512) | |
| token_emb = hidden[1:-1] # strip [CLS] and [SEP] | |
| mean_emb = token_emb.mean(dim=0) # (512,) | |
| # Intermediate layers | |
| layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512) | |
| ``` | |
| ### MLM logits | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) | |
| model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True) | |
| model.eval() | |
| seq = "A C G [MASK] A C G T" | |
| enc = tokenizer(seq, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**enc).logits # (1, seq_len, 10) | |
| ``` | |
| ### Fine-tuning | |
| Standard HF conventions. For sequence-level tasks, use mean pooling of non-special | |
| token positions (positions 1 to -1) as input to a prediction head. | |
| ## Implementation Notes | |
| The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`. | |
| This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which | |
| adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support | |
| not present in the original codebase. | |
| ```python | |
| model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", | |
| trust_remote_code=True, | |
| attn_implementation="sdpa") | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{chen2024_splicebert, | |
| title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction}, | |
| author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong}, | |
| journal = {Briefings in Bioinformatics}, | |
| volume = {25}, | |
| number = {3}, | |
| pages = {bbae163}, | |
| year = {2024}, | |
| doi = {10.1093/bib/bbae163} | |
| } | |
| ``` | |
| ## Credits | |
| Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT). | |
| The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) | |
| and reviewed manually by Taykhoom Dalal. | |
| ## License | |
| MIT, following the original repository. | |