Upload README.md with huggingface_hub

aafbae4 verified 5 days ago

4.13 kB

	---
	language:
	- rna
	library_name: transformers
	tags:
	- RNA
	- language-model
	- 3-UTR
	license: mit
	---

	# UTRBERT-5mer

	A BERT-base language model pre-trained on human 3' UTR sequences using 5-mer tokenization.
	Part of the 3UTRBERT model family introduced in Yang et al. (2024).

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dimension \| 768 \|
	\| Intermediate size \| 3072 \|
	\| Vocabulary size \| 1029 (5 special tokens + RNA 5-mers) \|
	\| Positional encoding \| Learned absolute (BERT-style) \|
	\| Architecture \| BERT-base \|
	\| Max sequence length \| 512 tokens (~516 nucleotides for 5-mer) \|

	Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into
	overlapping 5-mers (stride 1). A sequence of length L produces L-4 tokens.
	A [CLS] and [SEP] token are prepended and appended by the tokenizer.

	## Pretraining

	- Objective: Masked Language Modeling (MLM) on 5-mer tokens
	- Data: Human 3' UTR sequences
	- Source checkpoint: `5-new-12w-0/pytorch_model.bin` from figshare article 22851191

	### Checkpoint selection

	The only publicly released pre-trained checkpoint for the 5-mer variant is `5-new-12w-0`.

	## Parity Verification

	Hidden-state representations verified identical (max abs diff = 0.00) to the original
	BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
	layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
	SDPA also verified (max diff < 2e-5 vs eager).

	## Related Models

	See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).

	\| Model \| k-mer \| Vocab size \| Notes \|
	\|---\|---\|---\|---\|
	\| [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) \| 3 \| 69 \| \|
	\| [UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer) \| 4 \| 261 \| \|
	\| [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) \| 5 \| 1029 \| This model \|
	\| [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) \| 6 \| 4101 \| \|

	## Usage

	### Embedding generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-5mer", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/UTRBERT-5mer")
	model.eval()

	sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
	enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

	with torch.no_grad():
	out = model(**enc)

	cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
	token_emb = out.last_hidden_state # (batch, seq_len, 768)

	# Intermediate layers
	out_all = model(**enc, output_hidden_states=True)
	layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
	```

	### Fine-tuning

	Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
	as input to a classification or regression head.

	```python
	from transformers import BertForSequenceClassification

	model = BertForSequenceClassification.from_pretrained(
	"Taykhoom/UTRBERT-5mer",
	num_labels=2,
	)
	```

	## Implementation Notes

	This is a minimal HF port using standard `BertModel` with no custom modeling code.
	The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
	prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
	for the tokenizer (k-mer splitting), not for the model.

	## Citation

	```bibtex
	@article{yang2024_utrbert,
	title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
	author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
	journal = {Advanced Science},
	volume = {11},
	number = {39},
	pages = {e2407013},
	year = {2024},
	doi = {10.1002/advs.202407013}
	}
	```

	## Credits

	Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
	The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
	and reviewed manually by Taykhoom Dalal.

	## License

	MIT, following the original repository.