Instructions to use Taykhoom/UTRBERT-4mer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTRBERT-4mer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTRBERT-4mer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- rna
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- RNA
|
| 7 |
+
- language-model
|
| 8 |
+
- 3-UTR
|
| 9 |
+
license: mit
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# UTRBERT-4mer
|
| 13 |
+
|
| 14 |
+
A BERT-base language model pre-trained on human 3' UTR sequences using 4-mer tokenization.
|
| 15 |
+
Part of the 3UTRBERT model family introduced in Yang et al. (2024).
|
| 16 |
+
|
| 17 |
+
## Architecture
|
| 18 |
+
|
| 19 |
+
| Parameter | Value |
|
| 20 |
+
|---|---|
|
| 21 |
+
| Layers | 12 |
|
| 22 |
+
| Attention heads | 12 |
|
| 23 |
+
| Embedding dimension | 768 |
|
| 24 |
+
| Intermediate size | 3072 |
|
| 25 |
+
| Vocabulary size | 261 (5 special tokens + RNA 4-mers) |
|
| 26 |
+
| Positional encoding | Learned absolute (BERT-style) |
|
| 27 |
+
| Architecture | BERT-base |
|
| 28 |
+
| Max sequence length | 512 tokens (~515 nucleotides for 4-mer) |
|
| 29 |
+
|
| 30 |
+
**Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into
|
| 31 |
+
overlapping 4-mers (stride 1). A sequence of length L produces L-3 tokens.
|
| 32 |
+
A [CLS] and [SEP] token are prepended and appended by the tokenizer.
|
| 33 |
+
|
| 34 |
+
## Pretraining
|
| 35 |
+
|
| 36 |
+
- **Objective:** Masked Language Modeling (MLM) on 4-mer tokens
|
| 37 |
+
- **Data:** Human 3' UTR sequences
|
| 38 |
+
- **Source checkpoint:** `4-new-12w-0/pytorch_model.bin` from figshare article 22851119
|
| 39 |
+
|
| 40 |
+
### Checkpoint selection
|
| 41 |
+
|
| 42 |
+
The only publicly released pre-trained checkpoint for the 4-mer variant is `4-new-12w-0`.
|
| 43 |
+
|
| 44 |
+
## Parity Verification
|
| 45 |
+
|
| 46 |
+
Hidden-state representations verified identical (max abs diff = 0.00) to the original
|
| 47 |
+
BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
|
| 48 |
+
layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
|
| 49 |
+
SDPA also verified (max diff < 2e-5 vs eager).
|
| 50 |
+
|
| 51 |
+
## Related Models
|
| 52 |
+
|
| 53 |
+
See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).
|
| 54 |
+
|
| 55 |
+
| Model | k-mer | Vocab size | Notes |
|
| 56 |
+
|---|---|---|---|
|
| 57 |
+
| [UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer) | 3 | 69 | |
|
| 58 |
+
| **[UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer)** | 4 | 261 | This model |
|
| 59 |
+
| [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) | 5 | 1029 | |
|
| 60 |
+
| [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | |
|
| 61 |
+
|
| 62 |
+
## Usage
|
| 63 |
+
|
| 64 |
+
### Embedding generation
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
import torch
|
| 68 |
+
from transformers import AutoTokenizer, AutoModel
|
| 69 |
+
|
| 70 |
+
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-4mer", trust_remote_code=True)
|
| 71 |
+
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-4mer")
|
| 72 |
+
model.eval()
|
| 73 |
+
|
| 74 |
+
sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
|
| 75 |
+
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
| 76 |
+
|
| 77 |
+
with torch.no_grad():
|
| 78 |
+
out = model(**enc)
|
| 79 |
+
|
| 80 |
+
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
|
| 81 |
+
token_emb = out.last_hidden_state # (batch, seq_len, 768)
|
| 82 |
+
|
| 83 |
+
# Intermediate layers
|
| 84 |
+
out_all = model(**enc, output_hidden_states=True)
|
| 85 |
+
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Fine-tuning
|
| 89 |
+
|
| 90 |
+
Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
|
| 91 |
+
as input to a classification or regression head.
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
from transformers import BertForSequenceClassification
|
| 95 |
+
|
| 96 |
+
model = BertForSequenceClassification.from_pretrained(
|
| 97 |
+
"Taykhoom/UTRBERT-4mer",
|
| 98 |
+
num_labels=2,
|
| 99 |
+
)
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
## Implementation Notes
|
| 103 |
+
|
| 104 |
+
This is a minimal HF port using standard `BertModel` with no custom modeling code.
|
| 105 |
+
The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
|
| 106 |
+
prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
|
| 107 |
+
for the tokenizer (k-mer splitting), not for the model.
|
| 108 |
+
|
| 109 |
+
## Citation
|
| 110 |
+
|
| 111 |
+
```bibtex
|
| 112 |
+
@article{yang2024_utrbert,
|
| 113 |
+
title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
|
| 114 |
+
author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
|
| 115 |
+
journal = {Advanced Science},
|
| 116 |
+
volume = {11},
|
| 117 |
+
number = {39},
|
| 118 |
+
pages = {e2407013},
|
| 119 |
+
year = {2024},
|
| 120 |
+
doi = {10.1002/advs.202407013}
|
| 121 |
+
}
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## Credits
|
| 125 |
+
|
| 126 |
+
Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
|
| 127 |
+
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
|
| 128 |
+
and reviewed manually by Taykhoom Dalal.
|
| 129 |
+
|
| 130 |
+
## License
|
| 131 |
+
|
| 132 |
+
MIT, following the original repository.
|