---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- 3-UTR
license: mit
---

# UTRBERT-4mer

A BERT-base language model pre-trained on human 3' UTR sequences using 4-mer tokenization.
Part of the 3UTRBERT model family introduced in Yang et al. (2024).

## Architecture

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 261 (5 special tokens + RNA 4-mers) |
| Positional encoding | Learned absolute (BERT-style) |
| Architecture | BERT-base |
| Max sequence length | 512 tokens (~515 nucleotides for 4-mer) |

**Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into
overlapping 4-mers (stride 1). A sequence of length L produces L-3 tokens.
A [CLS] and [SEP] token are prepended and appended by the tokenizer.

## Pretraining

- **Objective:** Masked Language Modeling (MLM) on 4-mer tokens
- **Data:** Human 3' UTR sequences
- **Source checkpoint:** `4-new-12w-0/pytorch_model.bin` from figshare article 22851119

### Checkpoint selection

The only publicly released pre-trained checkpoint for the 4-mer variant is `4-new-12w-0`.

## Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original
BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
SDPA also verified (max diff < 2e-5 vs eager).

## Related Models

See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).

| Model | k-mer | Vocab size | Notes |
|---|---|---|---|
| [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) | 3 | 69 | |
| **[UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer)** | 4 | 261 | This model |
| [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) | 5 | 1029 | |
| [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-4mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-4mer")
model.eval()

sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)
```

### Fine-tuning

Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
as input to a classification or regression head.

```python
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "Taykhoom/UTRBERT-4mer",
    num_labels=2,
)
```

## Implementation Notes

This is a minimal HF port using standard `BertModel` with no custom modeling code.
The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
for the tokenizer (k-mer splitting), not for the model.

## Citation

```bibtex
@article{yang2024_utrbert,
  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
  journal = {Advanced Science},
  volume  = {11},
  number  = {39},
  pages   = {e2407013},
  year    = {2024},
  doi     = {10.1002/advs.202407013}
}
```

## Credits

Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

MIT, following the original repository.