File size: 5,097 Bytes
fe65700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1200db8
fe65700
 
 
 
 
 
 
 
 
 
 
1200db8
 
fe65700
 
 
1200db8
fe65700
1200db8
 
fe65700
 
1200db8
 
fe65700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1200db8
fe65700
1200db8
 
fe65700
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- splicing
license: mit
---

# SpliceBERT-1024nt

SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate
primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt
variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 512 |
| Intermediate dimension | 2048 |
| Vocabulary size | 10 |
| Positional encoding | Learned absolute |
| Architecture | BERT encoder |
| Max sequence length | 1024 |
| Parameters | ~44M |

Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9

## Pretraining

- **Objective:** Masked language modeling (MLM)
- **Data:** >2 million vertebrate primary RNA sequences from 72 species
- **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T
- **Source checkpoint:** `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))

### Checkpoint selection

The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate
sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are
trained on fixed-length fragments and require exact 510nt input.

## Parity Verification

Hidden-state representations verified (max abs diff < 1e-5) against the original
checkpoint at all 7 representation levels (embedding + 6 transformer layers),
for both `eager` and `sdpa` attention backends.
Verified on GPU with PyTorch 2.7 / CUDA 11.8.

## Related Models

See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).

| Model | Context | Training data | Notes |
|---|---|---|---|
| **[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt)** | 1024 nt | 72 vertebrates | This model |
| [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Fixed-length; requires exact 510 nt input |
| [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) | 510 nt (fixed) | Human only | Human-specific; requires exact 510 nt input |

## Usage

### Embedding generation

The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
Pass raw sequences directly.

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "ACGUACGUACGUACGU"  # U->T handled automatically
enc = tokenizer(seq, return_tensors="pt")

with torch.no_grad():
    out = model(**enc, output_hidden_states=True)

# Mean pooling over non-special tokens
hidden = out.last_hidden_state[0]  # (seq_len+2, 512)
token_emb = hidden[1:-1]           # strip [CLS] and [SEP]
mean_emb = token_emb.mean(dim=0)   # (512,)

# Intermediate layers
layer3_emb = out.hidden_states[3]  # (1, seq_len+2, 512)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
model.eval()

seq = "A C G [MASK] A C G T"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits  # (1, seq_len, 10)
```

### Fine-tuning

Standard HF conventions. For sequence-level tasks, use mean pooling of non-special
token positions (positions 1 to -1) as input to a prediction head.

## Implementation Notes

The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`.
This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
not present in the original codebase.

```python
model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
                                  trust_remote_code=True,
                                  attn_implementation="sdpa")
```

## Citation

```bibtex
@article{chen2024_splicebert,
  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
  journal = {Briefings in Bioinformatics},
  volume  = {25},
  number  = {3},
  pages   = {bbae163},
  year    = {2024},
  doi     = {10.1093/bib/bbae163}
}
```

## Credits

Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

MIT, following the original repository.