Upload README.md with huggingface_hub

1200db8 verified 3 days ago

5.1 kB

	---
	language:
	- rna
	library_name: transformers
	tags:
	- RNA
	- language-model
	- splicing
	license: mit
	---

	# SpliceBERT-1024nt

	SpliceBERT is a BERT-based RNA language model pre-trained on over 2 million vertebrate
	primary RNA sequences using a masked language modeling (MLM) objective. The 1024nt
	variant is trained on variable-length fragments (64-1024 nt) from 72 vertebrates.

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 6 \|
	\| Attention heads \| 16 \|
	\| Embedding dimension \| 512 \|
	\| Intermediate dimension \| 2048 \|
	\| Vocabulary size \| 10 \|
	\| Positional encoding \| Learned absolute \|
	\| Architecture \| BERT encoder \|
	\| Max sequence length \| 1024 \|
	\| Parameters \| ~44M \|

	Vocabulary: `[PAD]`=0, `[UNK]`=1, `[CLS]`=2, `[SEP]`=3, `[MASK]`=4, `N`=5, `A`=6, `C`=7, `G`=8, `T/U`=9

	## Pretraining

	- Objective: Masked language modeling (MLM)
	- Data: >2 million vertebrate primary RNA sequences from 72 species
	- Sequence format: Single-nucleotide tokenization with spaces; U converted to T
	- Source checkpoint: `SpliceBERT.1024nt/pytorch_model.bin` (from [zenodo:7995778](https://doi.org/10.5281/zenodo.7995778))

	### Checkpoint selection

	The 1024nt variant is the primary SpliceBERT model trained on variable-length vertebrate
	sequences. Use this variant for general-purpose RNA embedding. The 510nt variants are
	trained on fixed-length fragments and require exact 510nt input.

	## Parity Verification

	Hidden-state representations verified (max abs diff < 1e-5) against the original
	checkpoint at all 7 representation levels (embedding + 6 transformer layers),
	for both `eager` and `sdpa` attention backends.
	Verified on GPU with PyTorch 2.7 / CUDA 11.8.

	## Related Models

	See the full [SpliceBERT collection](https://huggingface.co/collections/Taykhoom/splicebert-6a20b72e9bec05b79ce009aa).

	\| Model \| Context \| Training data \| Notes \|
	\|---\|---\|---\|---\|
	\| [SpliceBERT-1024nt](https://huggingface.co/Taykhoom/SpliceBERT-1024nt) \| 1024 nt \| 72 vertebrates \| This model \|
	\| [SpliceBERT-510nt](https://huggingface.co/Taykhoom/SpliceBERT-510nt) \| 510 nt (fixed) \| 72 vertebrates \| Fixed-length; requires exact 510 nt input \|
	\| [SpliceBERT-human-510nt](https://huggingface.co/Taykhoom/SpliceBERT-human-510nt) \| 510 nt (fixed) \| Human only \| Human-specific; requires exact 510 nt input \|

	## Usage

	### Embedding generation

	The tokenizer automatically handles U->T conversion and single-nucleotide spacing.
	Pass raw sequences directly.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
	model.eval()

	seq = "ACGUACGUACGUACGU" # U->T handled automatically
	enc = tokenizer(seq, return_tensors="pt")

	with torch.no_grad():
	out = model(**enc, output_hidden_states=True)

	# Mean pooling over non-special tokens
	hidden = out.last_hidden_state[0] # (seq_len+2, 512)
	token_emb = hidden[1:-1] # strip [CLS] and [SEP]
	mean_emb = token_emb.mean(dim=0) # (512,)

	# Intermediate layers
	layer3_emb = out.hidden_states[3] # (1, seq_len+2, 512)
	```

	### MLM logits

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("Taykhoom/SpliceBERT-1024nt", trust_remote_code=True)
	model.eval()

	seq = "A C G [MASK] A C G T"
	enc = tokenizer(seq, return_tensors="pt")
	with torch.no_grad():
	logits = model(**enc).logits # (1, seq_len, 10)
	```

	### Fine-tuning

	Standard HF conventions. For sequence-level tasks, use mean pooling of non-special
	token positions (positions 1 to -1) as input to a prediction head.

	## Implementation Notes

	The original checkpoint was saved as `BertForMaskedLM` with `transformers==4.24.0`.
	This port uses [BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), which
	adds `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support
	not present in the original codebase.

	```python
	model = AutoModel.from_pretrained("Taykhoom/SpliceBERT-1024nt",
	trust_remote_code=True,
	attn_implementation="sdpa")
	```

	## Citation

	```bibtex
	@article{chen2024_splicebert,
	title = {Self-supervised learning on millions of primary {RNA} sequences from 72 vertebrates improves sequence-based {RNA} splicing prediction},
	author = {Chen, Ken and Zhou, Yue and Ding, Maolin and Wang, Yu and Ren, Zhixiang and Yang, Yuedong},
	journal = {Briefings in Bioinformatics},
	volume = {25},
	number = {3},
	pages = {bbae163},
	year = {2024},
	doi = {10.1093/bib/bbae163}
	}
	```

	## Credits

	Original model and code by Chen et al. Source: [GitHub](https://github.com/biomed-AI/SpliceBERT).
	The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
	and reviewed manually by Taykhoom Dalal.

	## License

	MIT, following the original repository.