|
|
--- |
|
|
license: other |
|
|
tags: |
|
|
- rna |
|
|
- gquad |
|
|
- g-quadruplex |
|
|
- transformer |
|
|
- genomics |
|
|
- rna-biology |
|
|
library_name: transformers |
|
|
extra_gated_fields: |
|
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
|
--- |
|
|
|
|
|
# G4mer Subtype |
|
|
|
|
|
**G4mer-Subtype** is a transformer-based RNA language model that predicts RNA G-quadruplex (rG4) **subtypes** from sequence input. It is fine-tuned from [`Biociphers/mRNAbert`](https://huggingface.co/Biociphers/mRNAbert) and trained on 70-nt sequences labeled with experimentally derived rG4 subtype categories. |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
This is the official subtype classification model from the **G4mer** framework as described in the manuscript: |
|
|
|
|
|
> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024). |
|
|
|
|
|
See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
G4mer-Subtype is trained to classify each 70-nt RNA sequence into one of **eight rG4 subtypes**, each representing a distinct sequence/structure motif observed in experimental rG4 data. |
|
|
|
|
|
### Subtype Mapping |
|
|
|
|
|
| Class Index | Subtype Description | |
|
|
|-------------|------------------------------------------| |
|
|
| 0 | G≥40% | |
|
|
| 1 | Unknown | |
|
|
| 2 | Bulges | |
|
|
| 3 | Canonical | |
|
|
| 4 | Long loop | |
|
|
| 5 | Potential G-quadruplex & G≥40% | |
|
|
| 6 | Potential G-triplex & G≥40% | |
|
|
| 7 | Two-quartet | |
|
|
|
|
|
All models use overlapping 6-mer tokenization and were fine-tuned on human transcriptome-derived sequences with subtype labels. |
|
|
|
|
|
### Variants |
|
|
|
|
|
| Model | Task | Size | |
|
|
|--------------------------------------|-----------------------|--------| |
|
|
| `Biociphers/g4mer` | rG4 binary class | ~46M | |
|
|
| `Biociphers/g4mer-subtype` | rG4 subtype class | ~46M | |
|
|
| `Biociphers/g4mer-regression` | rG4 strength (score) | ~46M | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Predict rG4 Subtypes |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load binary rG4 model and tokenizer |
|
|
binary_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer") |
|
|
binary_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer") |
|
|
binary_model.eval() |
|
|
|
|
|
# Load subtype model and tokenizer |
|
|
subtype_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer-subtype") |
|
|
subtype_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer-subtype") |
|
|
subtype_model.eval() |
|
|
|
|
|
# Input sequence (max 70 nt) |
|
|
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" |
|
|
|
|
|
# Convert to space-separated 6-mers |
|
|
def to_kmers(seq, k=6): |
|
|
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)]) |
|
|
|
|
|
kmer_sequence = to_kmers(sequence) |
|
|
|
|
|
# Predict rG4 binary score |
|
|
binary_inputs = binary_tokenizer(kmer_sequence, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
binary_output = binary_model(**binary_inputs) |
|
|
rG4_prob = torch.nn.functional.softmax(binary_output.logits, dim=-1)[0][1].item() |
|
|
|
|
|
# If confidently predicted to be rG4. Here, we set rG4 threshold to moderately confident with 0.7. |
|
|
if rG4_prob > 0.7: |
|
|
# Only classify subtype if confident rG4 |
|
|
subtype_inputs = subtype_tokenizer(kmer_sequence, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
subtype_output = subtype_model(**subtype_inputs) |
|
|
subtype_probs = torch.nn.functional.softmax(subtype_output.logits, dim=-1) |
|
|
predicted_class = torch.argmax(subtype_probs, dim=-1).item() |
|
|
|
|
|
subtype_mapping = { |
|
|
0: "G≥40%", |
|
|
1: "Unknown", |
|
|
2: "Bulges", |
|
|
3: "Canonical", |
|
|
4: "Long loop", |
|
|
5: "Potential G-quadruplex & G≥40%", |
|
|
6: "Potential G-triplex & G≥40%", |
|
|
7: "Two-quartet" |
|
|
} |
|
|
print(f"Predicted subtype: {subtype_mapping[predicted_class]}") |
|
|
else: |
|
|
print(f"Not a confident rG4 (score = {rG4_prob:.2f}); skipping subtype classification.") |
|
|
``` |
|
|
|
|
|
## Training data |
|
|
|
|
|
The model was trained on experimentally validated rG4 regions annotated with subtype labels based on loop lengths, bulges, guanine content, and overall folding potential. |
|
|
Each 70-nt training window was associated with one of the eight subtype labels shown above. |
|
|
|
|
|
## Intended use |
|
|
|
|
|
G4mer-Subtype is intended for researchers studying: |
|
|
|
|
|
- RNA G-quadruplex structural diversity |
|
|
- Subtype-specific regulatory roles in the transcriptome |
|
|
- Effects of sequence variation on rG4 formation patterns |
|
|
|
|
|
## Web Tool |
|
|
|
|
|
You can explore G4mer predictions interactively through our web tool: |
|
|
|
|
|
**[G4mer Web Tool](https://tools.biociphers.org/g4mer)** |
|
|
|
|
|
Features include: |
|
|
- **RNA sequence prediction** runs `G4mer` on GPU to compute probability of rG4-forming |
|
|
- **Transcriptome-wide prediction** of rG4s and subtypes |
|
|
- **Variant effect annotation** using gnomAD SNVs |
|
|
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context |
|
|
|
|
|
No installation needed — just visit and start exploring. |
|
|
|
|
|
## Citation - MLA |
|
|
|
|
|
``` |
|
|
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221. |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila). |