Improve model card for discoverability and add citation metadata
Browse files- Rewrite README into a structured Hugging Face model card with YAML
metadata (license, tags, pipeline_tag), badges, Model Summary/Details,
Intended Use, Limitations, and an FAQ
- Fix usage instructions and directory tree (use model_updated.pt +
tokenizer/; historical split model is under experiment_data/historical_version/)
- Align all figures with the PLOS ONE paper (16.09M sequences / 17.4B nt,
350M params, model dimension 1280, 1024-token context, BPE vocab 1024)
- Add BibTeX / plain-text citation and paper + preprint DOI links
- Add CITATION.cff and .gitignore
- .gitignore +12 -0
- CITATION.cff +57 -0
- README.md +222 -52
.gitignore
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
|
| 5 |
+
# Local generation outputs (not part of the published repository)
|
| 6 |
+
outputs/
|
| 7 |
+
|
| 8 |
+
# Claude Code local settings
|
| 9 |
+
.claude/settings.local.json
|
| 10 |
+
|
| 11 |
+
# OS / editor cruft
|
| 12 |
+
.DS_Store
|
CITATION.cff
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
cff-version: 1.2.0
|
| 2 |
+
message: "If you use GenerRNA in your research, please cite the article below."
|
| 3 |
+
title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
|
| 4 |
+
abstract: >-
|
| 5 |
+
GenerRNA is a generative pre-trained language model for de novo RNA design,
|
| 6 |
+
based on a Transformer decoder-only architecture. It generates novel RNA
|
| 7 |
+
sequences in a zero-shot manner or after fine-tuning, without requiring prior
|
| 8 |
+
structural information.
|
| 9 |
+
type: software
|
| 10 |
+
authors:
|
| 11 |
+
- family-names: Zhao
|
| 12 |
+
given-names: Yichong
|
| 13 |
+
affiliation: "The University of Tokyo"
|
| 14 |
+
- family-names: Oono
|
| 15 |
+
given-names: Kenta
|
| 16 |
+
affiliation: "Preferred Networks, Inc."
|
| 17 |
+
- family-names: Takizawa
|
| 18 |
+
given-names: Hiroki
|
| 19 |
+
affiliation: "Preferred Networks, Inc."
|
| 20 |
+
- family-names: Kotera
|
| 21 |
+
given-names: Masaaki
|
| 22 |
+
affiliation: "Preferred Networks, Inc."
|
| 23 |
+
email: kotera@preferred.jp
|
| 24 |
+
repository-code: "https://huggingface.co/pfnet/GenerRNA"
|
| 25 |
+
url: "https://huggingface.co/pfnet/GenerRNA"
|
| 26 |
+
license: MIT
|
| 27 |
+
keywords:
|
| 28 |
+
- RNA design
|
| 29 |
+
- de novo design
|
| 30 |
+
- generative model
|
| 31 |
+
- language model
|
| 32 |
+
- transformer
|
| 33 |
+
- RNA generation
|
| 34 |
+
- computational biology
|
| 35 |
+
- bioinformatics
|
| 36 |
+
- drug discovery
|
| 37 |
+
preferred-citation:
|
| 38 |
+
type: article
|
| 39 |
+
title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
|
| 40 |
+
authors:
|
| 41 |
+
- family-names: Zhao
|
| 42 |
+
given-names: Yichong
|
| 43 |
+
- family-names: Oono
|
| 44 |
+
given-names: Kenta
|
| 45 |
+
- family-names: Takizawa
|
| 46 |
+
given-names: Hiroki
|
| 47 |
+
- family-names: Kotera
|
| 48 |
+
given-names: Masaaki
|
| 49 |
+
journal: "PLOS ONE"
|
| 50 |
+
year: 2024
|
| 51 |
+
month: 10
|
| 52 |
+
volume: 19
|
| 53 |
+
issue: 10
|
| 54 |
+
start: "e0310814"
|
| 55 |
+
doi: "10.1371/journal.pone.0310814"
|
| 56 |
+
publisher:
|
| 57 |
+
name: "Public Library of Science"
|
README.md
CHANGED
|
@@ -1,11 +1,100 @@
|
|
| 1 |
-
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
-
# Requirements
|
| 7 |
-
A CUDA environment, and a minimum VRAM of 8GB was required.
|
| 8 |
-
### Dependencies
|
| 9 |
```
|
| 10 |
torch>=2.0
|
| 11 |
numpy
|
|
@@ -14,68 +103,149 @@ datasets==2.14.4
|
|
| 14 |
tqdm
|
| 15 |
```
|
| 16 |
|
| 17 |
-
#
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
```
|
| 21 |
-
.
|
| 22 |
-
βββ LICENSE
|
| 23 |
-
βββ README.md
|
| 24 |
-
βββ configs
|
| 25 |
-
β βββ example_finetuning.py
|
| 26 |
-
β βββ example_pretraining.py
|
| 27 |
-
βββ experiments_data
|
| 28 |
-
βββ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
|
| 29 |
-
βββ model.pt.part-ab
|
| 30 |
-
βββ model.pt.part-ac
|
| 31 |
-
βββ model.pt.part-ad
|
| 32 |
-
βββ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
|
| 33 |
-
βββ model.py # define the architecture
|
| 34 |
-
βββ sampling.py # script to generate sequences
|
| 35 |
-
βββ tokenization.py # preparete data
|
| 36 |
-
βββ tokenizer_bpe_1024
|
| 37 |
-
β βββ tokenizer.json
|
| 38 |
-
β βββ ....
|
| 39 |
-
βββ train.py # script for training/fine-tuning
|
| 40 |
-
```
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
|
|
|
| 44 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
python sampling.py \
|
| 46 |
--out_path {output_file_path} \
|
| 47 |
--max_new_tokens 256 \
|
| 48 |
-
--ckpt_path
|
| 49 |
-
--tokenizer_path
|
| 50 |
-
```
|
| 51 |
-
### Pre-training or Fine-tuning on your own sequences
|
| 52 |
-
First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.
|
| 53 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
python tokenization.py \
|
| 55 |
-
--data_dir {
|
| 56 |
--file_name {file_name_of_sequence_data} \
|
| 57 |
-
--tokenizer_path
|
| 58 |
--out_dir {directory_to_save_tokenized_data} \
|
| 59 |
--block_size 256
|
| 60 |
```
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
-
```
|
| 66 |
-
python train.py \
|
| 67 |
-
--config {path_to_your_config_file}
|
| 68 |
-
```
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
python train_BPE.py \
|
| 74 |
-
--txt_file_path {
|
| 75 |
--vocab_size 50256 \
|
| 76 |
-
--new_tokenizer_path {directory_to_save_trained_tokenizer}
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
| 78 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
-
The source code is licensed MIT. See `LICENSE`
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- biology
|
| 5 |
+
- rna
|
| 6 |
+
- rna-design
|
| 7 |
+
- rna-generation
|
| 8 |
+
- de-novo-design
|
| 9 |
+
- generative-model
|
| 10 |
+
- language-model
|
| 11 |
+
- transformer
|
| 12 |
+
- gpt
|
| 13 |
+
- nucleotide
|
| 14 |
+
- bioinformatics
|
| 15 |
+
- computational-biology
|
| 16 |
+
- drug-discovery
|
| 17 |
+
- molecular-design
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
---
|
| 20 |
|
| 21 |
+
# GenerRNA: A Generative Language Model for *de novo* RNA Design
|
| 22 |
+
|
| 23 |
+
[](https://doi.org/10.1371/journal.pone.0310814)
|
| 24 |
+
[](https://doi.org/10.1101/2024.02.01.578496)
|
| 25 |
+
[](./LICENSE)
|
| 26 |
+
[](https://huggingface.co/pfnet/GenerRNA)
|
| 27 |
+
|
| 28 |
+
**GenerRNA is a generative pre-trained language model for *de novo* RNA sequence design.** It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences **without any structural input, functional label, or sequence alignment**. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.
|
| 29 |
+
|
| 30 |
+
With GenerRNA you can:
|
| 31 |
+
|
| 32 |
+
- **Generate RNA in a zero-shot manner** to explore the RNA sequence space, or
|
| 33 |
+
- **Fine-tune on your own dataset** to generate RNAs belonging to a particular family or possessing specific characteristics (e.g., high binding affinity to a target protein).
|
| 34 |
+
|
| 35 |
+
> Developed by [Preferred Networks, Inc.](https://www.preferred.jp/en/) and The University of Tokyo. Introduced in *PLOS ONE* (2024): [GenerRNA: A generative pre-trained language model for *de novo* RNA design](https://doi.org/10.1371/journal.pone.0310814).
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Table of Contents
|
| 40 |
+
|
| 41 |
+
- [Model Summary](#model-summary)
|
| 42 |
+
- [Key Features](#key-features)
|
| 43 |
+
- [Model Details](#model-details)
|
| 44 |
+
- [Intended Use & Use Cases](#intended-use--use-cases)
|
| 45 |
+
- [Requirements](#requirements)
|
| 46 |
+
- [Quickstart](#quickstart)
|
| 47 |
+
- [Training & Fine-tuning](#training--fine-tuning)
|
| 48 |
+
- [Repository Structure](#repository-structure)
|
| 49 |
+
- [Training Data](#training-data)
|
| 50 |
+
- [Limitations](#limitations)
|
| 51 |
+
- [FAQ](#faq)
|
| 52 |
+
- [Citation](#citation)
|
| 53 |
+
- [License](#license)
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Model Summary
|
| 58 |
+
|
| 59 |
+
GenerRNA is a **Transformer decoder-only (GPT-style) language model** trained on RNA nucleotide sequences. By treating RNA as a sequence of tokens, it learns statistical and structural regularities of RNA directly from data and can then **sample entirely new sequences**. GenerRNA was pre-trained on ~16 million RNA sequences (16.09M), encompassing ~17.4 billion nucleotides. Generated RNAs are novel (distinct from training sequences) yet fold into stable secondary structures, and the model can be fine-tuned to design functional RNAs such as protein binders β all without requiring prior structural knowledge.
|
| 60 |
+
|
| 61 |
+
## Key Features
|
| 62 |
+
|
| 63 |
+
- 𧬠**De novo RNA generation** β create novel RNA sequences from scratch; no structure, label, or alignment required.
|
| 64 |
+
- π― **Zero-shot or fine-tuned** β explore RNA space out of the box, or specialize the model for a target family or function.
|
| 65 |
+
- π¬ **Structurally plausible outputs** β generated sequences fold into stable secondary structures (low minimum free energy).
|
| 66 |
+
- π§© **Transformer / GPT architecture** β a familiar, scalable decoder-only design (~350M parameters).
|
| 67 |
+
- β‘ **Two checkpoints provided** β an updated long-context model and the original historical model.
|
| 68 |
+
- π **Open & reproducible** β MIT-licensed code, tokenizer, checkpoints, and the data behind the paper's figures.
|
| 69 |
+
|
| 70 |
+
## Model Details
|
| 71 |
+
|
| 72 |
+
| | |
|
| 73 |
+
|---|---|
|
| 74 |
+
| **Model type** | Generative language model (decoder-only Transformer, GPT-style) |
|
| 75 |
+
| **Domain** | RNA / nucleotide sequences |
|
| 76 |
+
| **Parameters** | 350M (24 transformer layers, model dimension 1280) |
|
| 77 |
+
| **Context window** | 1024 tokens (~4000 nucleotides) |
|
| 78 |
+
| **Tokenizer** | Byte-Pair Encoding (BPE), vocabulary size 1024 |
|
| 79 |
+
| **Checkpoints** | `model_updated.pt` (recommended; longer context, deduplicated data) Β· original split model in `experiment_data/historical_version/` |
|
| 80 |
+
| **Framework** | PyTorch (β₯ 2.0) |
|
| 81 |
+
| **License** | MIT |
|
| 82 |
+
| **Paper** | *PLOS ONE* 19(10):e0310814 (2024) Β· [doi:10.1371/journal.pone.0310814](https://doi.org/10.1371/journal.pone.0310814) |
|
| 83 |
+
| **Developed by** | Preferred Networks, Inc. & The University of Tokyo |
|
| 84 |
+
|
| 85 |
+
## Intended Use & Use Cases
|
| 86 |
+
|
| 87 |
+
GenerRNA is intended for **research in RNA biology, synthetic biology, and RNA-based therapeutics / drug discovery**. Typical use cases include:
|
| 88 |
+
|
| 89 |
+
- Exploring the diversity of the RNA sequence space.
|
| 90 |
+
- Generating candidate RNAs from a target family by fine-tuning on family-specific data.
|
| 91 |
+
- Designing RNAs with desired functional properties, such as aptamers/binders with high affinity to a target protein (demonstrated for the RNA-binding proteins **ELAVL1** and **SRSF1** in the paper).
|
| 92 |
+
- Serving as a pre-trained backbone for downstream RNA modeling and design tasks.
|
| 93 |
+
|
| 94 |
+
## Requirements
|
| 95 |
+
|
| 96 |
+
A CUDA environment with a minimum of **8 GB VRAM** is required.
|
| 97 |
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
torch>=2.0
|
| 100 |
numpy
|
|
|
|
| 103 |
tqdm
|
| 104 |
```
|
| 105 |
|
| 106 |
+
## Quickstart
|
| 107 |
+
|
| 108 |
+
Clone the repository (it ships with the recommended checkpoint `model_updated.pt` and its `tokenizer/`):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
```bash
|
| 111 |
+
git clone https://huggingface.co/pfnet/GenerRNA
|
| 112 |
+
cd GenerRNA
|
| 113 |
```
|
| 114 |
+
|
| 115 |
+
### De novo generation (zero-shot)
|
| 116 |
+
|
| 117 |
+
```bash
|
| 118 |
python sampling.py \
|
| 119 |
--out_path {output_file_path} \
|
| 120 |
--max_new_tokens 256 \
|
| 121 |
+
--ckpt_path model_updated.pt \
|
| 122 |
+
--tokenizer_path tokenizer
|
|
|
|
|
|
|
|
|
|
| 123 |
```
|
| 124 |
+
|
| 125 |
+
> **Want to use the original (historical) model instead?** It is stored as split files. Recombine it and use its dedicated tokenizer:
|
| 126 |
+
>
|
| 127 |
+
> ```bash
|
| 128 |
+
> cat experiment_data/historical_version/model.pt.part-* > model.pt
|
| 129 |
+
> python sampling.py \
|
| 130 |
+
> --out_path {output_file_path} \
|
| 131 |
+
> --max_new_tokens 256 \
|
| 132 |
+
> --ckpt_path model.pt \
|
| 133 |
+
> --tokenizer_path experiment_data/historical_version/tokenizer_bpe_1024
|
| 134 |
+
> ```
|
| 135 |
+
|
| 136 |
+
## Training & Fine-tuning
|
| 137 |
+
|
| 138 |
+
**1. Tokenize your sequences** (one sequence per line, no header):
|
| 139 |
+
|
| 140 |
+
```bash
|
| 141 |
python tokenization.py \
|
| 142 |
+
--data_dir {path_to_directory_containing_sequence_data} \
|
| 143 |
--file_name {file_name_of_sequence_data} \
|
| 144 |
+
--tokenizer_path tokenizer \
|
| 145 |
--out_dir {directory_to_save_tokenized_data} \
|
| 146 |
--block_size 256
|
| 147 |
```
|
| 148 |
|
| 149 |
+
**2. Create a config** based on `configs/example_pretraining.py` (training from scratch) or `configs/example_finetuning.py` (fine-tuning).
|
| 150 |
|
| 151 |
+
**3. Train / fine-tune:**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
+
```bash
|
| 154 |
+
python train.py --config {path_to_your_config_file}
|
| 155 |
```
|
| 156 |
+
|
| 157 |
+
### Train your own tokenizer (optional)
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
python train_BPE.py \
|
| 161 |
+
--txt_file_path {path_to_training_file_one_sequence_per_line} \
|
| 162 |
--vocab_size 50256 \
|
| 163 |
+
--new_tokenizer_path {directory_to_save_trained_tokenizer}
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
## Repository Structure
|
| 167 |
+
|
| 168 |
```
|
| 169 |
+
.
|
| 170 |
+
βββ LICENSE
|
| 171 |
+
βββ README.md
|
| 172 |
+
βββ CITATION.cff # machine-readable citation metadata
|
| 173 |
+
βββ model.py # model architecture (decoder-only Transformer)
|
| 174 |
+
βββ sampling.py # generate sequences from a trained model
|
| 175 |
+
βββ tokenization.py # tokenize sequence data for training
|
| 176 |
+
βββ train.py # pre-training / fine-tuning entry point
|
| 177 |
+
βββ train_BPE.py # train a new BPE tokenizer
|
| 178 |
+
βββ model_updated.pt # recommended checkpoint (longer context, deduplicated data)
|
| 179 |
+
βββ tokenizer/ # BPE tokenizer for model_updated.pt
|
| 180 |
+
βββ configs/
|
| 181 |
+
β βββ example_pretraining.py
|
| 182 |
+
β βββ example_finetuning.py
|
| 183 |
+
βββ experiment_data/
|
| 184 |
+
βββ *.csv # data underlying the paper's figures
|
| 185 |
+
βββ pretraining_data.sh # how the pre-training corpus was built (RNAcentral + MMseqs2)
|
| 186 |
+
βββ historical_version/ # original model (split into parts) + its tokenizer
|
| 187 |
+
βββ model.pt.part-a{a,b,c,d}
|
| 188 |
+
βββ tokenizer_bpe_1024/
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## Training Data
|
| 192 |
+
|
| 193 |
+
GenerRNA was pre-trained on RNA sequences from **[RNAcentral](https://rnacentral.org/)** (release 22, which aggregates 51 expert databases). Starting from **34.39 million** raw sequences, deduplication with **[MMseqs2](https://github.com/soedinglab/MMseqs2)** at **80% sequence identity** yielded a pre-training corpus of **~16 million sequences (16.09M), encompassing ~17.4 billion nucleotides**. GenerRNA has a context window of **1024 tokens (~4000 nucleotides)**. The pre-processing pipeline is in [`experiment_data/pretraining_data.sh`](experiment_data/pretraining_data.sh), and the data underlying the paper's figures is provided in `experiment_data/`. See the [paper](https://doi.org/10.1371/journal.pone.0310814) for full dataset details.
|
| 194 |
+
|
| 195 |
+
## Limitations
|
| 196 |
+
|
| 197 |
+
- GenerRNA models RNA **sequence**; it does not explicitly predict tertiary structure or function. Validate candidates with downstream structure/function tools and wet-lab experiments.
|
| 198 |
+
- A CUDA GPU is required for generation and training as provided.
|
| 199 |
+
- Zero-shot outputs reflect the natural distribution of the training data; targeting a specific family or property generally requires fine-tuning.
|
| 200 |
+
- Generated sequences are computational hypotheses and should be experimentally validated before any real-world application.
|
| 201 |
+
|
| 202 |
+
## FAQ
|
| 203 |
+
|
| 204 |
+
**What is GenerRNA?**
|
| 205 |
+
GenerRNA is a generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences *de novo*, without requiring structural information, functional labels, or sequence alignments.
|
| 206 |
+
|
| 207 |
+
**How is GenerRNA different from other RNA models?**
|
| 208 |
+
Most RNA models are *discriminative* β they predict structure or properties from a given sequence. GenerRNA is *generative*: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
|
| 209 |
+
|
| 210 |
+
**Do I need RNA structure or alignments as input?**
|
| 211 |
+
No. GenerRNA generates sequences directly from its learned distribution; no structure or alignment is needed.
|
| 212 |
+
|
| 213 |
+
**Can I generate RNAs from a specific family or with a specific function?**
|
| 214 |
+
Yes. Fine-tune GenerRNA on a family- or function-specific dataset. The paper demonstrates designing RNAs with high binding affinity to the proteins ELAVL1 and SRSF1.
|
| 215 |
+
|
| 216 |
+
**Which checkpoint should I use?**
|
| 217 |
+
Use `model_updated.pt` (longer context, trained on deduplicated data). The original split model is kept in `experiment_data/historical_version/` for reproducibility.
|
| 218 |
+
|
| 219 |
+
**Is GenerRNA free to use?**
|
| 220 |
+
Yes. The code and weights are released under the MIT License. Please cite the paper if you use GenerRNA in your work.
|
| 221 |
+
|
| 222 |
+
**How do I cite GenerRNA?**
|
| 223 |
+
See [Citation](#citation) below.
|
| 224 |
+
|
| 225 |
+
## Citation
|
| 226 |
+
|
| 227 |
+
If you use GenerRNA, its checkpoints, or this repository in your research, please cite:
|
| 228 |
+
|
| 229 |
+
```bibtex
|
| 230 |
+
@article{zhao2024generrna,
|
| 231 |
+
title = {GenerRNA: A generative pre-trained language model for de novo RNA design},
|
| 232 |
+
author = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
|
| 233 |
+
journal = {PLOS ONE},
|
| 234 |
+
volume = {19},
|
| 235 |
+
number = {10},
|
| 236 |
+
pages = {e0310814},
|
| 237 |
+
year = {2024},
|
| 238 |
+
doi = {10.1371/journal.pone.0310814},
|
| 239 |
+
publisher = {Public Library of Science}
|
| 240 |
+
}
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
**Plain text:** Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for *de novo* RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814
|
| 244 |
+
|
| 245 |
+
- π **Paper (PLOS ONE):** https://doi.org/10.1371/journal.pone.0310814
|
| 246 |
+
- π **Preprint (bioRxiv):** https://doi.org/10.1101/2024.02.01.578496
|
| 247 |
+
- π€ **Model:** https://huggingface.co/pfnet/GenerRNA
|
| 248 |
+
|
| 249 |
+
## License
|
| 250 |
|
| 251 |
+
The source code is licensed under the **MIT License** β see [`LICENSE`](LICENSE). Β© 2024 Yichong Zhao, Masaaki Kotera, Kenta Oono, Hiroki Takizawa.
|
|
|