g4mer / README.md

Update citation

beb2704 verified about 1 month ago

4.42 kB

	---
	license: other
	tags:
	- rna
	- gquad
	- g-quadruplex
	- transformer
	- genomics
	- rna-biology
	library_name: transformers
	extra_gated_fields:
	I agree to use this model for non-commercial use ONLY: checkbox
	---

	# G4mer

	G4mer is a transformer-based RNA foundation model trained to identify RNA G-quadruplexes (rG4s) from sequence input, fine-tuned with mRNAbert (Biociphers/mRNAbert).

	## Disclaimer

	This is the official implementation of the G4mer model as described in the manuscript:

	> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

	See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

	## Model Details

	G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

	- Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure

	All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

	### Variants

	\| Model \| Task \| Size \|
	\|--------------------------------------\|-------------------\|--------\|
	\| `Biociphers/g4mer` \| rG4 binary class \| ~46M \|
	\| `Biociphers/g4mer-subtype` \| rG4 subtype class \| ~46M \|
	\| `Biociphers/g4mer-regression` \| rG4 strength \| ~46M \|

	## Usage

	### Binary rG4 Prediction

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
	model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")

	sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" # max length: 70nt window

	def to_kmers(seq, k=6):
	return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

	sequence = to_kmers(sequence, k=6) # Convert to 6-mers
	inputs = tokenizer(sequence, return_tensors="pt")
	outputs = model(**inputs)
	logits = outputs.logits

	rG4_probability = torch.softmax(logits, dim=1)[:, 1].item()
	print(rG4_probability)
	```

	G4mer was trained on a maximum of 70nt per sequence. For sequences longer than 70nt, we recommend scanning the input sequence with a sliding window of 70nt and taking the maximum rG4 score across all windows.

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Biociphers/g4mer")
	model = AutoModelForSequenceClassification.from_pretrained("Biociphers/g4mer")
	model.eval()

	# Define k-mer function
	def to_kmers(seq, k=6):
	return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

	# Define a long sequence (must contain only A/C/G/T)
	sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" * 2 # ~100nt

	# Slide 70nt window with stride 1
	window_size = 70
	stride = 1
	windows = [sequence[i:i+window_size] for i in range(0, len(sequence) - window_size + 1, stride)]

	# Score each window using G4mer
	scores = []
	for w in windows:
	kmer_seq = to_kmers(w, k=6)
	tokens = tokenizer(kmer_seq, return_tensors="pt")
	with torch.no_grad():
	output = model(**tokens)
	prob = torch.nn.functional.softmax(output.logits, dim=-1)
	scores.append(prob[0][1].item()) # class 1 = rG4-forming

	# Final rG4 score for the long sequence
	max_score = max(scores)
	print(f"Max rG4 score across windows: {max_score:.3f}")
	```

	## Web Tool

	You can explore G4mer predictions interactively through our web tool:

	[G4mer Web Tool](https://tools.biociphers.org/g4mer)

	Features include:
	- RNA sequence prediction runs `G4mer` on GPU to compute probability of rG4-forming
	- Transcriptome-wide prediction of rG4s and subtypes
	- Variant effect annotation using gnomAD SNVs
	- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

	No installation needed — just visit and start exploring.

	## Citation - MLA

	```
	Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
	```

	## Contact

	For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).