|
|
--- |
|
|
license: other |
|
|
tags: |
|
|
- rna |
|
|
- gquad |
|
|
- g-quadruplex |
|
|
- transformer |
|
|
- genomics |
|
|
- rna-biology |
|
|
library_name: transformers |
|
|
extra_gated_fields: |
|
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
|
--- |
|
|
|
|
|
# mRNAbert |
|
|
|
|
|
**mRNAbert** is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the [G4mer](https://huggingface.co/Biociphers/g4mer) project, including rG4 structure prediction and variant effect analysis. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Architecture: BERT-base |
|
|
- Tokenization: Overlapping 6-mers |
|
|
- Pretraining data: Human transcriptome (GENCODE v40, hg38) |
|
|
- Task: Masked language modeling (MLM) |
|
|
- Input: RNA sequences (ACGT) |
|
|
- Max length: 512nt |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
This is the official implementation of the **G4mer** model as described in the manuscript: |
|
|
|
|
|
> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024). |
|
|
|
|
|
See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict: |
|
|
|
|
|
- **Binary classification**: Whether a 70-nt seqeunce region forms an rG4 structure |
|
|
|
|
|
All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome. |
|
|
|
|
|
### Variants |
|
|
|
|
|
| Model | Task | Size | |
|
|
|--------------------------------------|-------------------|--------| |
|
|
| `Biociphers/g4mer` | rG4 binary class | ~46M | |
|
|
| `Biociphers/g4mer-subtype` | rG4 subtype class | ~46M | |
|
|
| `Biociphers/g4mer-regression` | rG4 strength | ~46M | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Fine-tune |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
from torch.utils.data import DataLoader, Dataset |
|
|
from torch.optim import AdamW |
|
|
import torch.nn.functional as F |
|
|
|
|
|
# Example dataset |
|
|
sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA", # rG4 |
|
|
"TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"] # non-rG4 |
|
|
labels = [1, 0] |
|
|
|
|
|
# Tokenization with 6-mers |
|
|
def to_kmers(seq, k=6): |
|
|
return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)]) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert") |
|
|
tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences] |
|
|
|
|
|
# Dataset class |
|
|
class rG4Dataset(Dataset): |
|
|
def __init__(self, tokenized_inputs, labels): |
|
|
self.inputs = tokenized_inputs |
|
|
self.labels = labels |
|
|
|
|
|
def __len__(self): |
|
|
return len(self.labels) |
|
|
|
|
|
def __getitem__(self, idx): |
|
|
item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()} |
|
|
item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long) |
|
|
return item |
|
|
|
|
|
dataset = rG4Dataset(tokenized, labels) |
|
|
loader = DataLoader(dataset, batch_size=2, shuffle=True) |
|
|
|
|
|
# Load base model for classification |
|
|
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2) |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
|
|
|
# Optimizer |
|
|
optimizer = AdamW(model.parameters(), lr=2e-5) |
|
|
|
|
|
# Training loop (1 epoch for demo) |
|
|
model.train() |
|
|
for batch in loader: |
|
|
batch = {k: v.to(device) for k, v in batch.items()} |
|
|
outputs = model(**batch) |
|
|
loss = outputs.loss |
|
|
loss.backward() |
|
|
optimizer.step() |
|
|
optimizer.zero_grad() |
|
|
print("Loss:", loss.item()) |
|
|
``` |
|
|
|
|
|
## Web Tool |
|
|
|
|
|
The `mRNAbert` model was fine-tuned to create **[G4mer](https://huggingface.co/Biociphers/g4mer)**, a state-of-the-art model for predicting **RNA G-quadruplexes** and their subtypes. |
|
|
|
|
|
You can explore G4mer predictions interactively through our web tool: |
|
|
|
|
|
**[G4mer Web Tool](https://tools.biociphers.org/g4mer)** |
|
|
|
|
|
Features include: |
|
|
- **RNA sequence prediction** (binary rG4-forming vs. non-forming) |
|
|
- **Transcriptome-wide prediction** of rG4s and subtypes |
|
|
- **Variant effect annotation** using gnomAD SNVs |
|
|
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context |
|
|
|
|
|
No installation needed — just visit and start exploring. |
|
|
|
|
|
## Citation - MLA |
|
|
|
|
|
``` |
|
|
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221. |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila). |