Upload README.md with huggingface_hub

c1876df verified 5 days ago

4.13 kB

language:
  - rna
library_name: transformers
tags:
  - RNA
  - language-model
  - 3-UTR
license: mit

UTRBERT-3mer

A BERT-base language model pre-trained on human 3' UTR sequences using 3-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).

Architecture

Parameter	Value
Layers	12
Attention heads	12
Embedding dimension	768
Intermediate size	3072
Vocabulary size	69 (5 special tokens + 64 RNA 3-mers)
Positional encoding	Learned absolute (BERT-style)
Architecture	BERT-base
Max sequence length	512 tokens (~514 nucleotides for 3-mer)

Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 3-mers (stride 1). A sequence of length L produces L-2 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.

Pretraining

Objective: Masked Language Modeling (MLM) on 3-mer tokens
Data: Human 3' UTR sequences
Source checkpoint: 3-new-12w-0/pytorch_model.bin from figshare article 22847354

Checkpoint selection

The only publicly released pre-trained checkpoint for the 3-mer variant is 3-new-12w-0.

Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6. SDPA also verified (max diff < 2e-5 vs eager).

Related Models

See the full UTRBERT collection.

Model	k-mer	Vocab size	Notes
UTRBERT-3mer	3	69	This model
UTRBERT-4mer	4	261
UTRBERT-5mer	5	1029
UTRBERT-6mer	6	4101

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer")
model.eval()

sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)

Fine-tuning

Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding as input to a classification or regression head.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "Taykhoom/UTRBERT-3mer",
    num_labels=2,
)

Implementation Notes

This is a minimal HF port using standard BertModel with no custom modeling code. The original checkpoint (BertForMaskedLM) was converted by stripping the bert. prefix and dropping the cls.* MLM head. trust_remote_code=True is required only for the tokenizer (k-mer splitting), not for the model.

Citation

@article{yang2024_utrbert,
  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
  journal = {Advanced Science},
  volume  = {11},
  number  = {39},
  pages   = {e2407013},
  year    = {2024},
  doi     = {10.1002/advs.202407013}
}

Credits

Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.