Instructions to use Taykhoom/UTRBERT-3mer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTRBERT-3mer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- 3-UTR
license: mit
UTRBERT-3mer
A BERT-base language model pre-trained on human 3' UTR sequences using 3-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).
Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 69 (5 special tokens + 64 RNA 3-mers) |
| Positional encoding | Learned absolute (BERT-style) |
| Architecture | BERT-base |
| Max sequence length | 512 tokens (~514 nucleotides for 3-mer) |
Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 3-mers (stride 1). A sequence of length L produces L-2 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.
Pretraining
- Objective: Masked Language Modeling (MLM) on 3-mer tokens
- Data: Human 3' UTR sequences
- Source checkpoint:
3-new-12w-0/pytorch_model.binfrom figshare article 22847354
Checkpoint selection
The only publicly released pre-trained checkpoint for the 3-mer variant is 3-new-12w-0.
Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6. SDPA also verified (max diff < 2e-5 vs eager).
Related Models
See the full UTRBERT collection.
| Model | k-mer | Vocab size | Notes |
|---|---|---|---|
| UTRBERT-3mer | 3 | 69 | This model |
| UTRBERT-4mer | 4 | 261 | |
| UTRBERT-5mer | 5 | 1029 | |
| UTRBERT-6mer | 6 | 4101 |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer")
model.eval()
sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
token_emb = out.last_hidden_state # (batch, seq_len, 768)
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
Fine-tuning
Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding as input to a classification or regression head.
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
"Taykhoom/UTRBERT-3mer",
num_labels=2,
)
Implementation Notes
This is a minimal HF port using standard BertModel with no custom modeling code.
The original checkpoint (BertForMaskedLM) was converted by stripping the bert.
prefix and dropping the cls.* MLM head. trust_remote_code=True is required only
for the tokenizer (k-mer splitting), not for the model.
Citation
@article{yang2024_utrbert,
title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
journal = {Advanced Science},
volume = {11},
number = {39},
pages = {e2407013},
year = {2024},
doi = {10.1002/advs.202407013}
}
Credits
Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
MIT, following the original repository.