Instructions to use Taykhoom/SpliceBERT-1024nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/SpliceBERT-1024nt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,097 Bytes
fe65700 1200db8 fe65700 1200db8 fe65700 1200db8 fe65700 1200db8 fe65700 1200db8 fe65700 1200db8 fe65700 1200db8 fe65700 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- splicing
license: mit
---
# SpliceBERT-1024nt
SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate
primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt
variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 1024 |
| Parameters | ~44M |
Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9
## Pretraining
- **Objective:** Masked language modeling (MLM)
- **Data:** >2 million vertebrate primary RNA sequences from 72 species
- **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T
- **Source checkpoint:** `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))
### Checkpoint selection
The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate
sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are
trained on fixed-length fragments and require exact 510nt input.
## Parity Verification
Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both `eager` and `sdpa` attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.
## Related Models
See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).
| Model | Context | Training data | Notes |
|---|---|---|---|
| **[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt)** | 1024 nt | 72 vertebrates | This model |
| [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input |
| [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input |
## Usage
### Embedding generation
The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
Pass raw sequences directly.
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()
seq = "ACGUACGUACGUACGU" # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0] # (seq_len+2, 512)
token_emb = hidden[1:-1] # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0) # (512,)
# Intermediate layers
layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512)
```
### MLM logits
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()
seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 10)
```
### Fine-tuning
Standard HF conventions. For sequence-level tasks, use mean pooling of non-special
token positions (positions 1 to -1) as input to a prediction head.
## Implementation Notes
The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`.
This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
not present in the original codebase.
```python
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
trust_remote_code=True,
attn_implementation="sdpa")
```
## Citation
```bibtex
@article{chen2024_splicebert,
title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
journal = {Briefings in Bioinformatics},
volume = {25},
number = {3},
pages = {bbae163},
year = {2024},
doi = {10.1093/bib/bbae163}
}
```
## Credits
Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
MIT, following the original repository.
|