Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +127 -0
config.json +27 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenization_splicebert.py +98 -0
tokenizer_config.json +16 -0
vocab.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- language-model
+- splicing
+license: mit
+---
+# SpliceBERT-human-510nt
+SpliceBERT is a BERT-based RNA language model pre-trained on primary RNA sequences
+using a masked language modeling (MLM) objective. This human-specific 510nt variant
+is trained exclusively on fixed-length 510 nt fragments from human mRNA sequences.
+**WARNING:** This model requires exactly 510 nt of input (excluding [CLS] and [SEP]).
+Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning.
+For general-purpose RNA embedding, use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) instead.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 6 |
+| Attention heads | 16 |
+| Embedding dimension | 512 |
+| Intermediate dimension | 2048 |
+| Vocabulary size | 10 |
+| Positional encoding | Learned absolute |
+| Architecture | BERT encoder |
+| Max sequence length | 510 (fixed-length training) |
+| Parameters | ~44M |
+Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9
+## Pretraining
+- **Objective:** Masked language modeling (MLM)
+- **Data:** Human primary RNA sequences
+- **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments
+- **Source checkpoint:** `SpliceBERT-human.510nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))
+### Checkpoint selection
+This human-only variant may outperform the multi-species 510nt model on human-specific
+splicing tasks. For cross-species generalization or variable-length sequences, use
+[SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt).
+## Parity Verification
+Hidden-state representations verified (max abs diff < 1e-5) against the original
+checkpoint at all 7 representation levels (embedding + 6 transformer layers),
+for both `eager` and `sdpa` attention backends.
+Verified on GPU with PyTorch 2.7 / CUDA 11.8.
+## Related Models
+See the full [SpliceBERT collection](<COLLECTION_URL>).
+| Model | Context | Training data | Notes |
+|---|---|---|---|
+| [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) | 1024 nt | 72 vertebrates | Variable-length; general purpose |
+| [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Multi-species 510 nt |
+| **[SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt)** | 510 nt (fixed) | Human only | This model |
+## Usage
+```python
+import torch
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained("Taykhoom/SpliceBERT-human-510nt")
+model = BertModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt")
+model.eval()
+# Sequence must be exactly 510 nt; U->T conversion; space-separated
+seq = ("ATCGATCG" * 64)[:510]  # exactly 510 nt
+seq_spaced = " ".join(list(seq.upper().replace("U", "T")))
+enc = tokenizer(seq_spaced, return_tensors="pt")
+with torch.no_grad():
+    out = model(**enc, output_hidden_states=True)
+hidden = out.last_hidden_state[0]  # (512, 512)
+token_emb = hidden[1:-1]           # strip [CLS] and [SEP] -> (510, 512)
+mean_emb = token_emb.mean(dim=0)   # (512,)
+```
+### Fine-tuning
+Standard HF conventions. For splice site prediction, token-level classification
+using all 510 token positions (excluding special tokens) is the typical setup.
+## Implementation Notes
+The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.18.0`.
+This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
+adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
+not present in the original codebase.
+## Citation
+```bibtex
+@article{chen2024_splicebert,
+  title   = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
+  author  = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
+  journal = {Briefings in Bioinformatics},
+  volume  = {25},
+  number  = {3},
+  pages   = {bbae163},
+  year    = {2024},
+  doi     = {10.1093/bib/bbae163}
+}
+```
+## Credits
+Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+MIT, following the original repository.

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "Taykhoom/SpliceBERT-human-510nt",
+  "architectures": [
+    "BertModel"
+  ],
+  "model_type": "bert_updated",
+  "auto_map": {
+    "AutoConfig": "Taykhoom/BERT-updated--configuration_bert_updated.BertUpdatedConfig",
+    "AutoModel": "Taykhoom/BERT-updated--modeling_bert.BertModel",
+    "AutoModelForMaskedLM": "Taykhoom/BERT-updated--modeling_bert.BertForMaskedLM"
+  },
+  "vocab_size": 10,
+  "hidden_size": 512,
+  "num_hidden_layers": 6,
+  "num_attention_heads": 16,
+  "intermediate_size": 2048,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "attention_probs_dropout_prob": 0.1,
+  "max_position_embeddings": 512,
+  "type_vocab_size": 2,
+  "initializer_range": 0.02,
+  "layer_norm_eps": 1e-12,
+  "pad_token_id": 0,
+  "model_max_length": 510,
+  "transformers_version": "4.57.6"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43e4cd7d06d59d2bbed34cb5d20d8032f3a7966ff226ec0d1a9645efd211779a
+size 76749736

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "sep_token": "[SEP]",
+  "pad_token": "[PAD]",
+  "mask_token": "[MASK]",
+  "unk_token": "[UNK]"
+}

tokenization_splicebert.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import json
+import os
+from transformers import PreTrainedTokenizer
+_DEFAULT_VOCAB = {
+    "[PAD]": 0,
+    "[UNK]": 1,
+    "[CLS]": 2,
+    "[SEP]": 3,
+    "[MASK]": 4,
+    "N": 5,
+    "A": 6,
+    "C": 7,
+    "G": 8,
+    "T": 9,
+}
+class SpliceBERTTokenizer(PreTrainedTokenizer):
+    """Single-nucleotide tokenizer for SpliceBERT.
+    Automatically converts U->T and adds [CLS]/[SEP] special tokens.
+    Raw sequences (not pre-spaced) are accepted.
+    """
+    vocab_files_names = {"vocab_file": "vocab.json"}
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        vocab_file=None,
+        cls_token="[CLS]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        mask_token="[MASK]",
+        unk_token="[UNK]",
+        **kwargs,
+    ):
+        self._vocab = dict(_DEFAULT_VOCAB)
+        if vocab_file and os.path.isfile(vocab_file):
+            with open(vocab_file) as f:
+                self._vocab = json.load(f)
+        self._ids_to_tokens = {v: k for k, v in self._vocab.items()}
+        super().__init__(
+            cls_token=cls_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            unk_token=unk_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self):
+        return len(self._vocab)
+    def get_vocab(self):
+        return dict(self._vocab)
+    def _tokenize(self, text):
+        return list(text.upper().replace("U", "T").replace(" ", ""))
+    def _convert_token_to_id(self, token):
+        return self._vocab.get(token, self._vocab["[UNK]"])
+    def _convert_id_to_token(self, index):
+        return self._ids_to_tokens.get(index, "[UNK]")
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        os.makedirs(save_directory, exist_ok=True)
+        fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
+        path = os.path.join(save_directory, fname)
+        with open(path, "w") as f:
+            json.dump(self._vocab, f, indent=2)
+        return (path,)
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        cls = [self.cls_token_id]
+        sep = [self.sep_token_id]
+        if token_ids_1 is None:
+            return cls + token_ids_0 + sep
+        return cls + token_ids_0 + sep + cls + token_ids_1 + sep
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None,
+                                already_has_special_tokens=False):
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0, token_ids_1, already_has_special_tokens=True
+            )
+        mask = [1] + [0] * len(token_ids_0) + [1]
+        if token_ids_1 is not None:
+            mask += [1] + [0] * len(token_ids_1) + [1]
+        return mask
+    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
+        if token_ids_1 is None:
+            return [0] + token_ids_0 + [0]
+        return [0] + token_ids_0 + [0, 0] + token_ids_1 + [0]

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_splicebert.SpliceBERTTokenizer",
+      null
+    ]
+  },
+  "model_max_length": 510,
+  "tokenizer_class": "SpliceBERTTokenizer",
+  "cls_token": "[CLS]",
+  "sep_token": "[SEP]",
+  "eos_token": "[SEP]",
+  "pad_token": "[PAD]",
+  "mask_token": "[MASK]",
+  "unk_token": "[UNK]"
+}

vocab.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "[PAD]": 0,
+  "[UNK]": 1,
+  "[CLS]": 2,
+  "[SEP]": 3,
+  "[MASK]": 4,
+  "N": 5,
+  "A": 6,
+  "C": 7,
+  "G": 8,
+  "T": 9
+}