--- language: - rna library_name: transformers tags: - RNA - language-model - splicing license: mit --- # SpliceBERT-human-510nt SpliceBERT is a BERT-based RNA language model pre-trained on primary RNA sequences using a masked language modeling (MLM) objective. This human-specific 510nt variant is trained exclusively on fixed-length 510 nt fragments from human mRNA sequences. **WARNING:** This model requires exactly 510 nt of input (excluding [CLS] and [SEP]). Sequences shorter or longer than 510 nt may produce incorrect outputs without fine-tuning. For general-purpose RNA embedding, use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) instead. ## Architecture | Parameter | Value | |---|---| | Layers | 6 | | Attention heads | 16 | | Embedding dimension | 512 | | Intermediate dimension | 2048 | | Vocabulary size | 10 | | Positional encoding | Learned absolute | | Architecture | BERT encoder | | Max sequence length | 510 (fixed-length training) | | Parameters | ~44M | Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9 ## Pretraining - **Objective:** Masked language modeling (MLM) - **Data:** Human primary RNA sequences - **Sequence format:** Single-nucleotide tokenization with spaces; U converted to T; fixed 510 nt fragments - **Source checkpoint:** `SpliceBERT-human.510nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778)) ### Checkpoint selection This human-only variant may outperform the multi-species 510nt model on human-specific splicing tasks. For cross-species generalization or variable-length sequences, use [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt). ## Parity Verification Hidden-state representations verified (max abs diff < 1e-5) against the original checkpoint at all 7 representation levels (embedding + 6 transformer layers), for both `eager` and `sdpa` attention backends. Verified on GPU with PyTorch 2.7 / CUDA 11.8. ## Related Models See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa). | Model | Context | Training data | Notes | |---|---|---|---| | [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) | 1024 nt | 72 vertebrates | Variable-length; general purpose | | [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) | 510 nt (fixed) | 72 vertebrates | Multi-species 510 nt | | **[SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt)** | 510 nt (fixed) | Human only | This model | ## Usage ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True) model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-human-510nt", trust_remote_code=True) model.eval() # Sequence must be exactly 510 nt; tokenizer handles U->T automatically seq = ("ATCGATCG" * 64)[:510] # exactly 510 nt enc = tokenizer(seq, return_tensors="pt") with torch.no_grad(): out = model(**enc, output_hidden_states=True) hidden = out.last_hidden_state[0] # (512, 512) token_emb = hidden[1:-1] # strip [CLS] and [SEP] -> (510, 512) mean_emb = token_emb.mean(dim=0) # (512,) ``` ### Fine-tuning Standard HF conventions. For splice site prediction, token-level classification using all 510 token positions (excluding special tokens) is the typical setup. ## Implementation Notes The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.18.0`. This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support not present in the original codebase. ## Citation ```bibtex @article{chen2024_splicebert, title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction}, author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong}, journal = {Briefings in Bioinformatics}, volume = {25}, number = {3}, pages = {bbae163}, year = {2024}, doi = {10.1093/bib/bbae163} } ``` ## Credits Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal. ## License MIT, following the original repository.