Instructions to use Taykhoom/SpliceBERT-510nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/SpliceBERT-510nt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/SpliceBERT-510nt", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 4,567 Bytes
b28609c 46eba30 b28609c 46eba30 b28609c 46eba30 b28609c 46eba30 b28609c 46eba30 b28609c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | ---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- splicing
license: mit
---
# SpliceBERT-510nt
SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate
primary RNA sequences using a masked language modeling (MLM) objective. The 510nt
vertebrate variant is trained exclusively on fixed-length 510 nt fragments.
**WARNING:** This model requires exactly 510 nt of input (excluding [CLS] and [SEP]).
Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning.
For general-purpose RNA embedding, use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) instead.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 510 (fixed-length training) |
| Parameters | ~44M |
Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9
## Pretraining
- **Objective:** Masked language modeling (MLM)
- **Data:** >2 million vertebrate primary RNA sequences from 72 species
- **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
- **Source checkpoint:** `SpliceBERT.510nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))
### Checkpoint selection
The 510nt vertebrate variant is intended for splice site prediction tasks where exact
510 nt windows are used (e.g., centered on a splice site). For variable-length sequences
use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt).
## Parity Verification
Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both `eager` and `sdpa` attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.
## Related Models
See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).
| Model | Context | Training data | Notes |
|---|---|---|---|
| [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) | 1024 nt | 72 vertebrates | Variable-length; general purpose |
| **[SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt)** | 510 nt (fixed) | 72 vertebrates | This model |
| [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific |
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-510nt", trust_remote_code=True)
model.eval()
# Sequence must be exactly 510 nt; tokenizer handles U->T automatically
seq = ("ATCGATCG" * 64)[:510] # exactly 510 nt
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
hidden = out.last_hidden_state[0] # (512, 512)
token_emb = hidden[1:-1] # strip [CLS] and [SEP] -> (510, 512)
mean_emb = token_emb.mean(dim=0) # (512,)
```
### Fine-tuning
Standard HF conventions. For splice site prediction, token-level classification
using all 510 token positions (excluding special tokens) is the typical setup.
## Implementation Notes
The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.20.1`.
This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
not present in the original codebase.
## Citation
```bibtex
@article{chen2024_splicebert,
title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
journal = {Briefings in Bioinformatics},
volume = {25},
number = {3},
pages = {bbae163},
year = {2024},
doi = {10.1093/bib/bbae163}
}
```
## Credits
Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
MIT, following the original repository.
|