--- language: - rna library_name: transformers tags: - RNA - language-model - 3-UTR license: mit --- # UTRBERT-4mer A BERT-base language model pre-trained on human 3' UTR sequences using 4-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024). ## Architecture | Parameter | Value | |---|---| | Layers | 12 | | Attention heads | 12 | | Embedding dimension | 768 | | Intermediate size | 3072 | | Vocabulary size | 261 (5 special tokens + RNA 4-mers) | | Positional encoding | Learned absolute (BERT-style) | | Architecture | BERT-base | | Max sequence length | 512 tokens (~515 nucleotides for 4-mer) | **Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into overlapping 4-mers (stride 1). A sequence of length L produces L-3 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer. ## Pretraining - **Objective:** Masked Language Modeling (MLM) on 4-mer tokens - **Data:** Human 3' UTR sequences - **Source checkpoint:** `4-new-12w-0/pytorch_model.bin` from figshare article 22851119 ### Checkpoint selection The only publicly released pre-trained checkpoint for the 4-mer variant is `4-new-12w-0`. ## Parity Verification Hidden-state representations verified identical (max abs diff = 0.00) to the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6. SDPA also verified (max diff < 2e-5 vs eager). ## Related Models See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER). | Model | k-mer | Vocab size | Notes | |---|---|---|---| | [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) | 3 | 69 | | | **[UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer)** | 4 | 261 | This model | | [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) | 5 | 1029 | | | [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | | ## Usage ### Embedding generation ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-4mer", trust_remote_code=True) model = AutoModel.from_pretrained("Taykhoom/UTRBERT-4mer") model.eval() sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"] enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512) with torch.no_grad(): out = model(**enc) cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token token_emb = out.last_hidden_state # (batch, seq_len, 768) # Intermediate layers out_all = model(**enc, output_hidden_states=True) layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768) ``` ### Fine-tuning Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding as input to a classification or regression head. ```python from transformers import BertForSequenceClassification model = BertForSequenceClassification.from_pretrained( "Taykhoom/UTRBERT-4mer", num_labels=2, ) ``` ## Implementation Notes This is a minimal HF port using standard `BertModel` with no custom modeling code. The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.` prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only for the tokenizer (k-mer splitting), not for the model. ## Citation ```bibtex @article{yang2024_utrbert, title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning}, author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao}, journal = {Advanced Science}, volume = {11}, number = {39}, pages = {e2407013}, year = {2024}, doi = {10.1002/advs.202407013} } ``` ## Credits Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal. ## License MIT, following the original repository.