Taykhoom
/

UTRBERT-3mer

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- language-model
+- 3-UTR
+license: mit
+---
+# UTRBERT-3mer
+A BERT-base language model pre-trained on human 3' UTR sequences using 3-mer tokenization.
+Part of the 3UTRBERT model family introduced in Yang et al. (2024).
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| Intermediate size | 3072 |
+| Vocabulary size | 69 (5 special tokens + 64 RNA 3-mers) |
+| Positional encoding | Learned absolute (BERT-style) |
+| Architecture | BERT-base |
+| Max sequence length | 512 tokens (~514 nucleotides for 3-mer) |
+**Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into
+overlapping 3-mers (stride 1). A sequence of length L produces L-2 tokens. A [CLS]
+and [SEP] token are prepended and appended by the tokenizer.
+## Pretraining
+- **Objective:** Masked Language Modeling (MLM) on 3-mer tokens
+- **Data:** Human 3' UTR sequences
+- **Source checkpoint:** `3-new-12w-0/pytorch_model.bin` from figshare article 22847354
+### Checkpoint selection
+The only publicly released pre-trained checkpoint for the 3-mer variant is `3-new-12w-0`.
+## Parity Verification
+Hidden-state representations verified identical (max abs diff = 0.00) to the original
+BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
+layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
+SDPA also verified (max diff < 2e-5 vs eager).
+## Related Models
+See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).
+| Model | k-mer | Vocab size | Notes |
+|---|---|---|---|
+| **[UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer)** | 3 | 69 | This model |
+| [UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer) | 4 | 261 | |
+| [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) | 5 | 1029 | |
+| [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | |
+## Usage
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer")
+model.eval()
+sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
+enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
+with torch.no_grad():
+    out = model(**enc)
+cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
+token_emb = out.last_hidden_state             # (batch, seq_len, 768)
+# Intermediate layers
+out_all = model(**enc, output_hidden_states=True)
+layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)
+```
+### Fine-tuning
+Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
+as input to a classification or regression head.
+```python
+from transformers import BertForSequenceClassification
+model = BertForSequenceClassification.from_pretrained(
+    "Taykhoom/UTRBERT-3mer",
+    num_labels=2,
+)
+```
+## Implementation Notes
+This is a minimal HF port using standard `BertModel` with no custom modeling code.
+The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
+prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
+for the tokenizer (k-mer splitting), not for the model.
+## Citation
+```bibtex
+@article{yang2024_utrbert,
+  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
+  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
+  journal = {Advanced Science},
+  volume  = {11},
+  number  = {39},
+  pages   = {e2407013},
+  year    = {2024},
+  doi     = {10.1002/advs.202407013}
+}
+```
+## Credits
+Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+MIT, following the original repository.