UTRBERT-5mer / README.md
Taykhoom's picture
Upload README.md with huggingface_hub
aafbae4 verified
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- 3-UTR
license: mit
---
# UTRBERT-5mer
A BERT-base language model pre-trained on human 3' UTR sequences using 5-mer tokenization.
Part of the 3UTRBERT model family introduced in Yang et al. (2024).
## Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 1029 (5 special tokens + RNA 5-mers) |
| Positional encoding | Learned absolute (BERT-style) |
| Architecture | BERT-base |
| Max sequence length | 512 tokens (~516 nucleotides for 5-mer) |
**Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into
overlapping 5-mers (stride 1). A sequence of length L produces L-4 tokens.
A [CLS] and [SEP] token are prepended and appended by the tokenizer.
## Pretraining
- **Objective:** Masked Language Modeling (MLM) on 5-mer tokens
- **Data:** Human 3' UTR sequences
- **Source checkpoint:** `5-new-12w-0/pytorch_model.bin` from figshare article 22851191
### Checkpoint selection
The only publicly released pre-trained checkpoint for the 5-mer variant is `5-new-12w-0`.
## Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original
BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
SDPA also verified (max diff < 2e-5 vs eager).
## Related Models
See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).
| Model | k-mer | Vocab size | Notes |
|---|---|---|---|
| [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) | 3 | 69 | |
| [UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer) | 4 | 261 | |
| **[UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer)** | 5 | 1029 | This model |
| [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-5mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-5mer")
model.eval()
sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 768)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
```
### Fine-tuning
Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
as input to a classification or regression head.
```python
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
"Taykhoom/UTRBERT-5mer",
num_labels=2,
)
```
## Implementation Notes
This is a minimal HF port using standard `BertModel` with no custom modeling code.
The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
for the tokenizer (k-mer splitting), not for the model.
## Citation
```bibtex
@article{yang2024_utrbert,
title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
journal = {Advanced Science},
volume = {11},
number = {39},
pages = {e2407013},
year = {2024},
doi = {10.1002/advs.202407013}
}
```
## Credits
Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
MIT, following the original repository.