---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- splicing
license: mit
---

# SpliceBERT-1024nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate
primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt
variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 1024 |
| Parameters | ~44M |

Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9

## Pretraining

- **Objective:** Masked language modeling (MLM)
- **Data:** >2 million vertebrate primary RNA sequences from 72 species
- **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T
- **Source checkpoint:** `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))

### Checkpoint selection

The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate
sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are
trained on fixed-length fragments and require exact 510nt input.

## Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both `eager` and `sdpa` attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.

## Related Models

See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).

| Model | Context | Training data | Notes |
|---|---|---|---|
| **[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt)** | 1024 nt | 72 vertebrates | This model |
| [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input |
| [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input |

## Usage

### Embedding generation

The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
Pass raw sequences directly.

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "ACGUACGUACGUACGU"  # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0]  # (seq_len+2, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0)   # (512,)

# Intermediate layers
layer3_emb = out.hidden_states[3]  # (1, seq_len+2, 512)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 10)
```

### Fine-tuning

Standard HF conventions. For sequence-level tasks, use mean pooling of non-special
token positions (positions 1 to -1) as input to a prediction head.

## Implementation Notes

The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`.
This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
not present in the original codebase.

```python
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
                                  trust_remote_code=True,
                                  attn_implementation="sdpa")
```

## Citation

```bibtex
@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}
```

## Credits

Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

MIT, following the original repository.