Instructions to use Taykhoom/UTRBERT-5mer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTRBERT-5mer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTRBERT-5mer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - rna | |
| library_name: transformers | |
| tags: | |
| - RNA | |
| - language-model | |
| - 3-UTR | |
| license: mit | |
| # UTRBERT-5mer | |
| A BERT-base language model pre-trained on human 3' UTR sequences using 5-mer tokenization. | |
| Part of the 3UTRBERT model family introduced in Yang et al. (2024). | |
| ## Architecture | |
| | Parameter | Value | | |
| |---|---| | |
| | Layers | 12 | | |
| | Attention heads | 12 | | |
| | Embedding dimension | 768 | | |
| | Intermediate size | 3072 | | |
| | Vocabulary size | 1029 (5 special tokens + RNA 5-mers) | | |
| | Positional encoding | Learned absolute (BERT-style) | | |
| | Architecture | BERT-base | | |
| | Max sequence length | 512 tokens (~516 nucleotides for 5-mer) | | |
| **Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into | |
| overlapping 5-mers (stride 1). A sequence of length L produces L-4 tokens. | |
| A [CLS] and [SEP] token are prepended and appended by the tokenizer. | |
| ## Pretraining | |
| - **Objective:** Masked Language Modeling (MLM) on 5-mer tokens | |
| - **Data:** Human 3' UTR sequences | |
| - **Source checkpoint:** `5-new-12w-0/pytorch_model.bin` from figshare article 22851191 | |
| ### Checkpoint selection | |
| The only publicly released pre-trained checkpoint for the 5-mer variant is `5-new-12w-0`. | |
| ## Parity Verification | |
| Hidden-state representations verified identical (max abs diff = 0.00) to the original | |
| BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer | |
| layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6. | |
| SDPA also verified (max diff < 2e-5 vs eager). | |
| ## Related Models | |
| See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER). | |
| | Model | k-mer | Vocab size | Notes | | |
| |---|---|---|---| | |
| | [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) | 3 | 69 | | | |
| | [UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer) | 4 | 261 | | | |
| | **[UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer)** | 5 | 1029 | This model | | |
| | [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | | | |
| ## Usage | |
| ### Embedding generation | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-5mer", trust_remote_code=True) | |
| model = AutoModel.from_pretrained("Taykhoom/UTRBERT-5mer") | |
| model.eval() | |
| sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"] | |
| enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512) | |
| with torch.no_grad(): | |
| out = model(**enc) | |
| cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token | |
| token_emb = out.last_hidden_state # (batch, seq_len, 768) | |
| # Intermediate layers | |
| out_all = model(**enc, output_hidden_states=True) | |
| layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768) | |
| ``` | |
| ### Fine-tuning | |
| Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding | |
| as input to a classification or regression head. | |
| ```python | |
| from transformers import BertForSequenceClassification | |
| model = BertForSequenceClassification.from_pretrained( | |
| "Taykhoom/UTRBERT-5mer", | |
| num_labels=2, | |
| ) | |
| ``` | |
| ## Implementation Notes | |
| This is a minimal HF port using standard `BertModel` with no custom modeling code. | |
| The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.` | |
| prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only | |
| for the tokenizer (k-mer splitting), not for the model. | |
| ## Citation | |
| ```bibtex | |
| @article{yang2024_utrbert, | |
| title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning}, | |
| author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao}, | |
| journal = {Advanced Science}, | |
| volume = {11}, | |
| number = {39}, | |
| pages = {e2407013}, | |
| year = {2024}, | |
| doi = {10.1002/advs.202407013} | |
| } | |
| ``` | |
| ## Credits | |
| Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT). | |
| The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) | |
| and reviewed manually by Taykhoom Dalal. | |
| ## License | |
| MIT, following the original repository. | |