Taykhoom commited on
Commit
049dd62
·
verified ·
1 Parent(s): 1e0de24

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - rna
4
+ library_name: transformers
5
+ tags:
6
+ - RNA
7
+ - language-model
8
+ - splicing
9
+ license: mit
10
+ ---
11
+
12
+ # SpliceBERT-human-510nt
13
+
14
+ SpliceBERT is a BERT-based RNA language model pre-trained on primary RNA sequences
15
+ using a masked language modeling (MLM) objective. This human-specific 510nt variant
16
+ is trained exclusively on fixed-length 510 nt fragments from human mRNA sequences.
17
+
18
+ **WARNING:** This model requires exactly 510 nt of input (excluding [CLS] and [SEP]).
19
+ Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning.
20
+ For general-purpose RNA embedding, use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) instead.
21
+
22
+ ## Architecture
23
+
24
+ | Parameter | Value |
25
+ |---|---|
26
+ | Layers | 6 |
27
+ | Attention heads | 16 |
28
+ | Embedding dimension | 512 |
29
+ | Intermediate dimension | 2048 |
30
+ | Vocabulary size | 10 |
31
+ | Positional encoding | Learned absolute |
32
+ | Architecture | BERT encoder |
33
+ | Max sequence length | 510 (fixed-length training) |
34
+ | Parameters | ~44M |
35
+
36
+ Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9
37
+
38
+ ## Pretraining
39
+
40
+ - **Objective:** Masked language modeling (MLM)
41
+ - **Data:** Human primary RNA sequences
42
+ - **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
43
+ - **Source checkpoint:** `SpliceBERT-human.510nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))
44
+
45
+ ### Checkpoint selection
46
+
47
+ This human-only variant may outperform the multi-species 510nt model on human-specific
48
+ splicing tasks. For cross-species generalization or variable-length sequences, use
49
+ [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt).
50
+
51
+ ## Parity Verification
52
+
53
+ Hidden-state representations verified (max abs diff < 1e-5) against the original
54
+ checkpoint at all 7 representation levels (embedding + 6 transformer layers),
55
+ for both `eager` and `sdpa` attention backends.
56
+ Verified on GPU with PyTorch 2.7 / CUDA 11.8.
57
+
58
+ ## Related Models
59
+
60
+ See the full [SpliceBERT collection](<COLLECTION_URL>).
61
+
62
+ | Model | Context | Training data | Notes |
63
+ |---|---|---|---|
64
+ | [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) | 1024 nt | 72 vertebrates | Variable-length; general purpose |
65
+ | [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Multi-species 510 nt |
66
+ | **[SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt)** | 510 nt (fixed) | Human only | This model |
67
+
68
+ ## Usage
69
+
70
+ ```python
71
+ import torch
72
+ from transformers import BertTokenizer, BertModel
73
+
74
+ tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-human-510nt")
75
+ model = BertModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt")
76
+ model.eval()
77
+
78
+ # Sequence must be exactly 510 nt; U->T conversion; space-separated
79
+ seq = ("ATCGATCG" * 64)[:510] # exactly 510 nt
80
+ seq_spaced = " ".join(list(seq.upper().replace("U", "T")))
81
+
82
+ enc = tokenizer(seq_spaced, return_tensors="pt")
83
+
84
+ with torch.no_grad():
85
+ out = model(**enc, output_hidden_states=True)
86
+
87
+ hidden = out.last_hidden_state[0] # (512, 512)
88
+ token_emb = hidden[1:-1] # strip [CLS] and [SEP] -> (510, 512)
89
+ mean_emb = token_emb.mean(dim=0) # (512,)
90
+ ```
91
+
92
+ ### Fine-tuning
93
+
94
+ Standard HF conventions. For splice site prediction, token-level classification
95
+ using all 510 token positions (excluding special tokens) is the typical setup.
96
+
97
+ ## Implementation Notes
98
+
99
+ The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.18.0`.
100
+ This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
101
+ adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
102
+ not present in the original codebase.
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @article{chen2024_splicebert,
108
+ title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
109
+ author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
110
+ journal = {Briefings in Bioinformatics},
111
+ volume = {25},
112
+ number = {3},
113
+ pages = {bbae163},
114
+ year = {2024},
115
+ doi = {10.1093/bib/bbae163}
116
+ }
117
+ ```
118
+
119
+ ## Credits
120
+
121
+ Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
122
+ The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
123
+ and reviewed manually by Taykhoom Dalal.
124
+
125
+ ## License
126
+
127
+ MIT, following the original repository.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Taykhoom/SpliceBERT-human-510nt",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "model_type": "bert_updated",
7
+ "auto_map": {
8
+ "AutoConfig": "Taykhoom/BERT-updated--configuration_bert_updated.BertUpdatedConfig",
9
+ "AutoModel": "Taykhoom/BERT-updated--modeling_bert.BertModel",
10
+ "AutoModelForMaskedLM": "Taykhoom/BERT-updated--modeling_bert.BertForMaskedLM"
11
+ },
12
+ "vocab_size": 10,
13
+ "hidden_size": 512,
14
+ "num_hidden_layers": 6,
15
+ "num_attention_heads": 16,
16
+ "intermediate_size": 2048,
17
+ "hidden_act": "gelu",
18
+ "hidden_dropout_prob": 0.1,
19
+ "attention_probs_dropout_prob": 0.1,
20
+ "max_position_embeddings": 512,
21
+ "type_vocab_size": 2,
22
+ "initializer_range": 0.02,
23
+ "layer_norm_eps": 1e-12,
24
+ "pad_token_id": 0,
25
+ "model_max_length": 510,
26
+ "transformers_version": "4.57.6"
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43e4cd7d06d59d2bbed34cb5d20d8032f3a7966ff226ec0d1a9645efd211779a
3
+ size 76749736
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "sep_token": "[SEP]",
4
+ "pad_token": "[PAD]",
5
+ "mask_token": "[MASK]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenization_splicebert.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ from transformers import PreTrainedTokenizer
4
+
5
+ _DEFAULT_VOCAB = {
6
+ "[PAD]": 0,
7
+ "[UNK]": 1,
8
+ "[CLS]": 2,
9
+ "[SEP]": 3,
10
+ "[MASK]": 4,
11
+ "N": 5,
12
+ "A": 6,
13
+ "C": 7,
14
+ "G": 8,
15
+ "T": 9,
16
+ }
17
+
18
+
19
+ class SpliceBERTTokenizer(PreTrainedTokenizer):
20
+ """Single-nucleotide tokenizer for SpliceBERT.
21
+
22
+ Automatically converts U->T and adds [CLS]/[SEP] special tokens.
23
+ Raw sequences (not pre-spaced) are accepted.
24
+ """
25
+
26
+ vocab_files_names = {"vocab_file": "vocab.json"}
27
+ model_input_names = ["input_ids", "attention_mask"]
28
+
29
+ def __init__(
30
+ self,
31
+ vocab_file=None,
32
+ cls_token="[CLS]",
33
+ sep_token="[SEP]",
34
+ pad_token="[PAD]",
35
+ mask_token="[MASK]",
36
+ unk_token="[UNK]",
37
+ **kwargs,
38
+ ):
39
+ self._vocab = dict(_DEFAULT_VOCAB)
40
+ if vocab_file and os.path.isfile(vocab_file):
41
+ with open(vocab_file) as f:
42
+ self._vocab = json.load(f)
43
+ self._ids_to_tokens = {v: k for k, v in self._vocab.items()}
44
+ super().__init__(
45
+ cls_token=cls_token,
46
+ sep_token=sep_token,
47
+ pad_token=pad_token,
48
+ mask_token=mask_token,
49
+ unk_token=unk_token,
50
+ **kwargs,
51
+ )
52
+
53
+ @property
54
+ def vocab_size(self):
55
+ return len(self._vocab)
56
+
57
+ def get_vocab(self):
58
+ return dict(self._vocab)
59
+
60
+ def _tokenize(self, text):
61
+ return list(text.upper().replace("U", "T").replace(" ", ""))
62
+
63
+ def _convert_token_to_id(self, token):
64
+ return self._vocab.get(token, self._vocab["[UNK]"])
65
+
66
+ def _convert_id_to_token(self, index):
67
+ return self._ids_to_tokens.get(index, "[UNK]")
68
+
69
+ def save_vocabulary(self, save_directory, filename_prefix=None):
70
+ os.makedirs(save_directory, exist_ok=True)
71
+ fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
72
+ path = os.path.join(save_directory, fname)
73
+ with open(path, "w") as f:
74
+ json.dump(self._vocab, f, indent=2)
75
+ return (path,)
76
+
77
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
78
+ cls = [self.cls_token_id]
79
+ sep = [self.sep_token_id]
80
+ if token_ids_1 is None:
81
+ return cls + token_ids_0 + sep
82
+ return cls + token_ids_0 + sep + cls + token_ids_1 + sep
83
+
84
+ def get_special_tokens_mask(self, token_ids_0, token_ids_1=None,
85
+ already_has_special_tokens=False):
86
+ if already_has_special_tokens:
87
+ return super().get_special_tokens_mask(
88
+ token_ids_0, token_ids_1, already_has_special_tokens=True
89
+ )
90
+ mask = [1] + [0] * len(token_ids_0) + [1]
91
+ if token_ids_1 is not None:
92
+ mask += [1] + [0] * len(token_ids_1) + [1]
93
+ return mask
94
+
95
+ def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
96
+ if token_ids_1 is None:
97
+ return [0] + token_ids_0 + [0]
98
+ return [0] + token_ids_0 + [0, 0] + token_ids_1 + [0]
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoTokenizer": [
4
+ "tokenization_splicebert.SpliceBERTTokenizer",
5
+ null
6
+ ]
7
+ },
8
+ "model_max_length": 510,
9
+ "tokenizer_class": "SpliceBERTTokenizer",
10
+ "cls_token": "[CLS]",
11
+ "sep_token": "[SEP]",
12
+ "eos_token": "[SEP]",
13
+ "pad_token": "[PAD]",
14
+ "mask_token": "[MASK]",
15
+ "unk_token": "[UNK]"
16
+ }
vocab.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "[PAD]": 0,
3
+ "[UNK]": 1,
4
+ "[CLS]": 2,
5
+ "[SEP]": 3,
6
+ "[MASK]": 4,
7
+ "N": 5,
8
+ "A": 6,
9
+ "C": 7,
10
+ "G": 8,
11
+ "T": 9
12
+ }