|
|
--- |
|
|
license: other |
|
|
tags: |
|
|
- rna |
|
|
- gquad |
|
|
- g-quadruplex |
|
|
- transformer |
|
|
- genomics |
|
|
- rna-biology |
|
|
library_name: transformers |
|
|
extra_gated_fields: |
|
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
|
--- |
|
|
|
|
|
# G4mer |
|
|
|
|
|
**G4mer** is a transformer-based RNA foundation model trained to identify RNA G-quadruplexes (rG4s) from sequence input, fine-tuned with mRNAbert (Biociphers/mRNAbert). |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
This is the official implementation of the **G4mer** model as described in the manuscript: |
|
|
|
|
|
> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024). |
|
|
|
|
|
See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict: |
|
|
|
|
|
- **Binary classification**: Whether a 70-nt seqeunce region forms an rG4 structure |
|
|
|
|
|
All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome. |
|
|
|
|
|
### Variants |
|
|
|
|
|
| Model | Task | Size | |
|
|
|--------------------------------------|-------------------|--------| |
|
|
| `Biociphers/g4mer` | rG4 binary class | ~46M | |
|
|
| `Biociphers/g4mer-subtype` | rG4 subtype class | ~46M | |
|
|
| `Biociphers/g4mer-regression` | rG4 strength | ~46M | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Binary rG4 Prediction |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer") |
|
|
|
|
|
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" # max length: 70nt window |
|
|
|
|
|
def to_kmers(seq, k=6): |
|
|
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)]) |
|
|
|
|
|
sequence = to_kmers(sequence, k=6) # Convert to 6-mers |
|
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
|
|
|
rG4_probability = torch.softmax(logits, dim=1)[:, 1].item() |
|
|
print(rG4_probability) |
|
|
``` |
|
|
|
|
|
G4mer was trained on a maximum of 70nt per sequence. For sequences longer than 70nt, we recommend scanning the input sequence with a sliding window of 70nt and taking the maximum rG4 score across all windows. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("Biociphers/g4mer") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/g4mer") |
|
|
model.eval() |
|
|
|
|
|
# Define k-mer function |
|
|
def to_kmers(seq, k=6): |
|
|
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)]) |
|
|
|
|
|
# Define a long sequence (must contain only A/C/G/T) |
|
|
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" * 2 # ~100nt |
|
|
|
|
|
# Slide 70nt window with stride 1 |
|
|
window_size = 70 |
|
|
stride = 1 |
|
|
windows = [sequence[i:i+window_size] for i in range(0, len(sequence) - window_size + 1, stride)] |
|
|
|
|
|
# Score each window using G4mer |
|
|
scores = [] |
|
|
for w in windows: |
|
|
kmer_seq = to_kmers(w, k=6) |
|
|
tokens = tokenizer(kmer_seq, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
output = model(**tokens) |
|
|
prob = torch.nn.functional.softmax(output.logits, dim=-1) |
|
|
scores.append(prob[0][1].item()) # class 1 = rG4-forming |
|
|
|
|
|
# Final rG4 score for the long sequence |
|
|
max_score = max(scores) |
|
|
print(f"Max rG4 score across windows: {max_score:.3f}") |
|
|
``` |
|
|
|
|
|
## Web Tool |
|
|
|
|
|
You can explore G4mer predictions interactively through our web tool: |
|
|
|
|
|
**[G4mer Web Tool](https://tools.biociphers.org/g4mer)** |
|
|
|
|
|
Features include: |
|
|
- **RNA sequence prediction** runs `G4mer` on GPU to compute probability of rG4-forming |
|
|
- **Transcriptome-wide prediction** of rG4s and subtypes |
|
|
- **Variant effect annotation** using gnomAD SNVs |
|
|
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context |
|
|
|
|
|
No installation needed — just visit and start exploring. |
|
|
|
|
|
## Citation - MLA |
|
|
|
|
|
``` |
|
|
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221. |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila). |
|
|
|