g4mer-subtype / README.md

Update citation

2401d8f verified about 1 month ago

5.71 kB

	---
	license: other
	tags:
	- rna
	- gquad
	- g-quadruplex
	- transformer
	- genomics
	- rna-biology
	library_name: transformers
	extra_gated_fields:
	I agree to use this model for non-commercial use ONLY: checkbox
	---

	# G4mer Subtype

	G4mer-Subtype is a transformer-based RNA language model that predicts RNA G-quadruplex (rG4) subtypes from sequence input. It is fine-tuned from [`Biociphers/mRNAbert`](https://huggingface.co/Biociphers/mRNAbert) and trained on 70-nt sequences labeled with experimentally derived rG4 subtype categories.

	## Disclaimer

	This is the official subtype classification model from the G4mer framework as described in the manuscript:

	> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

	See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

	## Model Details

	G4mer-Subtype is trained to classify each 70-nt RNA sequence into one of eight rG4 subtypes, each representing a distinct sequence/structure motif observed in experimental rG4 data.

	### Subtype Mapping

	\| Class Index \| Subtype Description \|
	\|-------------\|------------------------------------------\|
	\| 0 \| G≥40% \|
	\| 1 \| Unknown \|
	\| 2 \| Bulges \|
	\| 3 \| Canonical \|
	\| 4 \| Long loop \|
	\| 5 \| Potential G-quadruplex & G≥40% \|
	\| 6 \| Potential G-triplex & G≥40% \|
	\| 7 \| Two-quartet \|

	All models use overlapping 6-mer tokenization and were fine-tuned on human transcriptome-derived sequences with subtype labels.

	### Variants

	\| Model \| Task \| Size \|
	\|--------------------------------------\|-----------------------\|--------\|
	\| `Biociphers/g4mer` \| rG4 binary class \| ~46M \|
	\| `Biociphers/g4mer-subtype` \| rG4 subtype class \| ~46M \|
	\| `Biociphers/g4mer-regression` \| rG4 strength (score) \| ~46M \|

	## Usage

	### Predict rG4 Subtypes

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load binary rG4 model and tokenizer
	binary_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
	binary_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")
	binary_model.eval()

	# Load subtype model and tokenizer
	subtype_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer-subtype")
	subtype_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer-subtype")
	subtype_model.eval()

	# Input sequence (max 70 nt)
	sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"

	# Convert to space-separated 6-mers
	def to_kmers(seq, k=6):
	return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

	kmer_sequence = to_kmers(sequence)

	# Predict rG4 binary score
	binary_inputs = binary_tokenizer(kmer_sequence, return_tensors="pt")
	with torch.no_grad():
	binary_output = binary_model(**binary_inputs)
	rG4_prob = torch.nn.functional.softmax(binary_output.logits, dim=-1)[0][1].item()

	# If confidently predicted to be rG4. Here, we set rG4 threshold to moderately confident with 0.7.
	if rG4_prob > 0.7:
	# Only classify subtype if confident rG4
	subtype_inputs = subtype_tokenizer(kmer_sequence, return_tensors="pt")
	with torch.no_grad():
	subtype_output = subtype_model(**subtype_inputs)
	subtype_probs = torch.nn.functional.softmax(subtype_output.logits, dim=-1)
	predicted_class = torch.argmax(subtype_probs, dim=-1).item()

	subtype_mapping = {
	0: "G≥40%",
	1: "Unknown",
	2: "Bulges",
	3: "Canonical",
	4: "Long loop",
	5: "Potential G-quadruplex & G≥40%",
	6: "Potential G-triplex & G≥40%",
	7: "Two-quartet"
	}
	print(f"Predicted subtype: {subtype_mapping[predicted_class]}")
	else:
	print(f"Not a confident rG4 (score = {rG4_prob:.2f}); skipping subtype classification.")
	```

	## Training data

	The model was trained on experimentally validated rG4 regions annotated with subtype labels based on loop lengths, bulges, guanine content, and overall folding potential.
	Each 70-nt training window was associated with one of the eight subtype labels shown above.

	## Intended use

	G4mer-Subtype is intended for researchers studying:

	- RNA G-quadruplex structural diversity
	- Subtype-specific regulatory roles in the transcriptome
	- Effects of sequence variation on rG4 formation patterns

	## Web Tool

	You can explore G4mer predictions interactively through our web tool:

	[G4mer Web Tool](https://tools.biociphers.org/g4mer)

	Features include:
	- RNA sequence prediction runs `G4mer` on GPU to compute probability of rG4-forming
	- Transcriptome-wide prediction of rG4s and subtypes
	- Variant effect annotation using gnomAD SNVs
	- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

	No installation needed — just visit and start exploring.

	## Citation - MLA

	```
	Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
	```

	## Contact

	For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).