mRNAbert / README.md

Update citation

07cb8dd verified about 1 month ago

4.72 kB

	---
	license: other
	tags:
	- rna
	- gquad
	- g-quadruplex
	- transformer
	- genomics
	- rna-biology
	library_name: transformers
	extra_gated_fields:
	I agree to use this model for non-commercial use ONLY: checkbox
	---

	# mRNAbert

	mRNAbert is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the [G4mer](https://huggingface.co/Biociphers/g4mer) project, including rG4 structure prediction and variant effect analysis.

	## Model Details

	- Architecture: BERT-base
	- Tokenization: Overlapping 6-mers
	- Pretraining data: Human transcriptome (GENCODE v40, hg38)
	- Task: Masked language modeling (MLM)
	- Input: RNA sequences (ACGT)
	- Max length: 512nt

	## Disclaimer

	This is the official implementation of the G4mer model as described in the manuscript:

	> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

	See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

	## Model Details

	G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

	- Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure

	All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

	### Variants

	\| Model \| Task \| Size \|
	\|--------------------------------------\|-------------------\|--------\|
	\| `Biociphers/g4mer` \| rG4 binary class \| ~46M \|
	\| `Biociphers/g4mer-subtype` \| rG4 subtype class \| ~46M \|
	\| `Biociphers/g4mer-regression` \| rG4 strength \| ~46M \|

	## Usage

	### Fine-tune

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	from torch.utils.data import DataLoader, Dataset
	from torch.optim import AdamW
	import torch.nn.functional as F

	# Example dataset
	sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA", # rG4
	"TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"] # non-rG4
	labels = [1, 0]

	# Tokenization with 6-mers
	def to_kmers(seq, k=6):
	return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)])

	tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert")
	tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences]

	# Dataset class
	class rG4Dataset(Dataset):
	def __init__(self, tokenized_inputs, labels):
	self.inputs = tokenized_inputs
	self.labels = labels

	def __len__(self):
	return len(self.labels)

	def __getitem__(self, idx):
	item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()}
	item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
	return item

	dataset = rG4Dataset(tokenized, labels)
	loader = DataLoader(dataset, batch_size=2, shuffle=True)

	# Load base model for classification
	model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	# Optimizer
	optimizer = AdamW(model.parameters(), lr=2e-5)

	# Training loop (1 epoch for demo)
	model.train()
	for batch in loader:
	batch = {k: v.to(device) for k, v in batch.items()}
	outputs = model(**batch)
	loss = outputs.loss
	loss.backward()
	optimizer.step()
	optimizer.zero_grad()
	print("Loss:", loss.item())
	```

	## Web Tool

	The `mRNAbert` model was fine-tuned to create [G4mer](https://huggingface.co/Biociphers/g4mer), a state-of-the-art model for predicting RNA G-quadruplexes and their subtypes.

	You can explore G4mer predictions interactively through our web tool:

	[G4mer Web Tool](https://tools.biociphers.org/g4mer)

	Features include:
	- RNA sequence prediction (binary rG4-forming vs. non-forming)
	- Transcriptome-wide prediction of rG4s and subtypes
	- Variant effect annotation using gnomAD SNVs
	- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

	No installation needed — just visit and start exploring.

	## Citation - MLA

	```
	Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
	```

	## Contact

	For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).