Taykhoom commited on
Commit
c1876df
·
verified ·
1 Parent(s): 627b30c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - rna
4
+ library_name: transformers
5
+ tags:
6
+ - RNA
7
+ - language-model
8
+ - 3-UTR
9
+ license: mit
10
+ ---
11
+
12
+ # UTRBERT-3mer
13
+
14
+ A BERT-base language model pre-trained on human 3' UTR sequences using 3-mer tokenization.
15
+ Part of the 3UTRBERT model family introduced in Yang et al. (2024).
16
+
17
+ ## Architecture
18
+
19
+ | Parameter | Value |
20
+ |---|---|
21
+ | Layers | 12 |
22
+ | Attention heads | 12 |
23
+ | Embedding dimension | 768 |
24
+ | Intermediate size | 3072 |
25
+ | Vocabulary size | 69 (5 special tokens + 64 RNA 3-mers) |
26
+ | Positional encoding | Learned absolute (BERT-style) |
27
+ | Architecture | BERT-base |
28
+ | Max sequence length | 512 tokens (~514 nucleotides for 3-mer) |
29
+
30
+ **Tokenization:** raw RNA (or DNA) sequences are converted T->U, then split into
31
+ overlapping 3-mers (stride 1). A sequence of length L produces L-2 tokens. A [CLS]
32
+ and [SEP] token are prepended and appended by the tokenizer.
33
+
34
+ ## Pretraining
35
+
36
+ - **Objective:** Masked Language Modeling (MLM) on 3-mer tokens
37
+ - **Data:** Human 3' UTR sequences
38
+ - **Source checkpoint:** `3-new-12w-0/pytorch_model.bin` from figshare article 22847354
39
+
40
+ ### Checkpoint selection
41
+
42
+ The only publicly released pre-trained checkpoint for the 3-mer variant is `3-new-12w-0`.
43
+
44
+ ## Parity Verification
45
+
46
+ Hidden-state representations verified identical (max abs diff = 0.00) to the original
47
+ BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer
48
+ layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6.
49
+ SDPA also verified (max diff < 2e-5 vs eager).
50
+
51
+ ## Related Models
52
+
53
+ See the full [UTRBERT collection](https://huggingface.co/collections/Taykhoom/utrbert-PLACEHOLDER).
54
+
55
+ | Model | k-mer | Vocab size | Notes |
56
+ |---|---|---|---|
57
+ | **[UTRBERT-3mer](https://huggingface.co/Taykhoom/UTRBERT-3mer)** | 3 | 69 | This model |
58
+ | [UTRBERT-4mer](https://huggingface.co/Taykhoom/UTRBERT-4mer) | 4 | 261 | |
59
+ | [UTRBERT-5mer](https://huggingface.co/Taykhoom/UTRBERT-5mer) | 5 | 1029 | |
60
+ | [UTRBERT-6mer](https://huggingface.co/Taykhoom/UTRBERT-6mer) | 6 | 4101 | |
61
+
62
+ ## Usage
63
+
64
+ ### Embedding generation
65
+
66
+ ```python
67
+ import torch
68
+ from transformers import AutoTokenizer, AutoModel
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-3mer", trust_remote_code=True)
71
+ model = AutoModel.from_pretrained("Taykhoom/UTRBERT-3mer")
72
+ model.eval()
73
+
74
+ sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
75
+ enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)
76
+
77
+ with torch.no_grad():
78
+ out = model(**enc)
79
+
80
+ cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
81
+ token_emb = out.last_hidden_state # (batch, seq_len, 768)
82
+
83
+ # Intermediate layers
84
+ out_all = model(**enc, output_hidden_states=True)
85
+ layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
86
+ ```
87
+
88
+ ### Fine-tuning
89
+
90
+ Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
91
+ as input to a classification or regression head.
92
+
93
+ ```python
94
+ from transformers import BertForSequenceClassification
95
+
96
+ model = BertForSequenceClassification.from_pretrained(
97
+ "Taykhoom/UTRBERT-3mer",
98
+ num_labels=2,
99
+ )
100
+ ```
101
+
102
+ ## Implementation Notes
103
+
104
+ This is a minimal HF port using standard `BertModel` with no custom modeling code.
105
+ The original checkpoint (`BertForMaskedLM`) was converted by stripping the `bert.`
106
+ prefix and dropping the `cls.*` MLM head. `trust_remote_code=True` is required only
107
+ for the tokenizer (k-mer splitting), not for the model.
108
+
109
+ ## Citation
110
+
111
+ ```bibtex
112
+ @article{yang2024_utrbert,
113
+ title = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
114
+ author = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
115
+ journal = {Advanced Science},
116
+ volume = {11},
117
+ number = {39},
118
+ pages = {e2407013},
119
+ year = {2024},
120
+ doi = {10.1002/advs.202407013}
121
+ }
122
+ ```
123
+
124
+ ## Credits
125
+
126
+ Original model and code by Yang et al. Source: [GitHub](https://github.com/yangyn533/3UTRBERT).
127
+ The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
128
+ and reviewed manually by Taykhoom Dalal.
129
+
130
+ ## License
131
+
132
+ MIT, following the original repository.