Eric2333 commited on
Commit
becf8d9
Β·
1 Parent(s): 1591961

Improve model card for discoverability and add citation metadata

Browse files

- Rewrite README into a structured Hugging Face model card with YAML
metadata (license, tags, pipeline_tag), badges, Model Summary/Details,
Intended Use, Limitations, and an FAQ
- Fix usage instructions and directory tree (use model_updated.pt +
tokenizer/; historical split model is under experiment_data/historical_version/)
- Align all figures with the PLOS ONE paper (16.09M sequences / 17.4B nt,
350M params, model dimension 1280, 1024-token context, BPE vocab 1024)
- Add BibTeX / plain-text citation and paper + preprint DOI links
- Add CITATION.cff and .gitignore

Files changed (3) hide show
  1. .gitignore +12 -0
  2. CITATION.cff +57 -0
  3. README.md +222 -52
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+
5
+ # Local generation outputs (not part of the published repository)
6
+ outputs/
7
+
8
+ # Claude Code local settings
9
+ .claude/settings.local.json
10
+
11
+ # OS / editor cruft
12
+ .DS_Store
CITATION.cff ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ cff-version: 1.2.0
2
+ message: "If you use GenerRNA in your research, please cite the article below."
3
+ title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
4
+ abstract: >-
5
+ GenerRNA is a generative pre-trained language model for de novo RNA design,
6
+ based on a Transformer decoder-only architecture. It generates novel RNA
7
+ sequences in a zero-shot manner or after fine-tuning, without requiring prior
8
+ structural information.
9
+ type: software
10
+ authors:
11
+ - family-names: Zhao
12
+ given-names: Yichong
13
+ affiliation: "The University of Tokyo"
14
+ - family-names: Oono
15
+ given-names: Kenta
16
+ affiliation: "Preferred Networks, Inc."
17
+ - family-names: Takizawa
18
+ given-names: Hiroki
19
+ affiliation: "Preferred Networks, Inc."
20
+ - family-names: Kotera
21
+ given-names: Masaaki
22
+ affiliation: "Preferred Networks, Inc."
23
+ email: kotera@preferred.jp
24
+ repository-code: "https://huggingface.co/pfnet/GenerRNA"
25
+ url: "https://huggingface.co/pfnet/GenerRNA"
26
+ license: MIT
27
+ keywords:
28
+ - RNA design
29
+ - de novo design
30
+ - generative model
31
+ - language model
32
+ - transformer
33
+ - RNA generation
34
+ - computational biology
35
+ - bioinformatics
36
+ - drug discovery
37
+ preferred-citation:
38
+ type: article
39
+ title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
40
+ authors:
41
+ - family-names: Zhao
42
+ given-names: Yichong
43
+ - family-names: Oono
44
+ given-names: Kenta
45
+ - family-names: Takizawa
46
+ given-names: Hiroki
47
+ - family-names: Kotera
48
+ given-names: Masaaki
49
+ journal: "PLOS ONE"
50
+ year: 2024
51
+ month: 10
52
+ volume: 19
53
+ issue: 10
54
+ start: "e0310814"
55
+ doi: "10.1371/journal.pone.0310814"
56
+ publisher:
57
+ name: "Public Library of Science"
README.md CHANGED
@@ -1,11 +1,100 @@
1
- # GenerRNA
2
- GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
- Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Requirements
7
- A CUDA environment, and a minimum VRAM of 8GB was required.
8
- ### Dependencies
9
  ```
10
  torch>=2.0
11
  numpy
@@ -14,68 +103,149 @@ datasets==2.14.4
14
  tqdm
15
  ```
16
 
17
- # Usage
18
- Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined`
19
- #### Directory tree
20
- ```
21
- .
22
- β”œβ”€β”€ LICENSE
23
- β”œβ”€β”€ README.md
24
- β”œβ”€β”€ configs
25
- β”‚ β”œβ”€β”€ example_finetuning.py
26
- β”‚ └── example_pretraining.py
27
- β”œβ”€β”€ experiments_data
28
- β”œβ”€β”€ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
29
- β”œβ”€β”€ model.pt.part-ab
30
- β”œβ”€β”€ model.pt.part-ac
31
- β”œβ”€β”€ model.pt.part-ad
32
- β”œβ”€β”€ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
33
- β”œβ”€β”€ model.py # define the architecture
34
- β”œβ”€β”€ sampling.py # script to generate sequences
35
- β”œβ”€β”€ tokenization.py # preparete data
36
- β”œβ”€β”€ tokenizer_bpe_1024
37
- β”‚ β”œβ”€β”€ tokenizer.json
38
- β”‚ β”œβ”€β”€ ....
39
- β”œβ”€β”€ train.py # script for training/fine-tuning
40
- ```
41
 
42
- ### De novo Generation in a zero-shot fashion
43
- Usage example:
 
44
  ```
 
 
 
 
45
  python sampling.py \
46
  --out_path {output_file_path} \
47
  --max_new_tokens 256 \
48
- --ckpt_path {model.pt} \
49
- --tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024}
50
- ```
51
- ### Pre-training or Fine-tuning on your own sequences
52
- First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.
53
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  python tokenization.py \
55
- --data_dir {path_to_the_directory_containing_sequence_data} \
56
  --file_name {file_name_of_sequence_data} \
57
- --tokenizer_path {path_to_tokenizer_directory} \
58
  --out_dir {directory_to_save_tokenized_data} \
59
  --block_size 256
60
  ```
61
 
62
- Next, refer to `./configs/example_**.py` to create a config file of GPT model.
63
 
64
- Lastly, excute following command:
65
- ```
66
- python train.py \
67
- --config {path_to_your_config_file}
68
- ```
69
 
70
- ### Train your own tokenizer
71
- Usage example:
72
  ```
 
 
 
 
73
  python train_BPE.py \
74
- --txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \
75
  --vocab_size 50256 \
76
- --new_tokenizer_path {directory_to_save_trained_tokenizer} \
77
-
 
 
 
78
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- # License
81
- The source code is licensed MIT. See `LICENSE`
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - biology
5
+ - rna
6
+ - rna-design
7
+ - rna-generation
8
+ - de-novo-design
9
+ - generative-model
10
+ - language-model
11
+ - transformer
12
+ - gpt
13
+ - nucleotide
14
+ - bioinformatics
15
+ - computational-biology
16
+ - drug-discovery
17
+ - molecular-design
18
+ pipeline_tag: text-generation
19
+ ---
20
 
21
+ # GenerRNA: A Generative Language Model for *de novo* RNA Design
22
+
23
+ [![Paper (PLOS ONE)](https://img.shields.io/badge/Paper-PLOS%20ONE%202024-orange)](https://doi.org/10.1371/journal.pone.0310814)
24
+ [![Preprint (bioRxiv)](https://img.shields.io/badge/Preprint-bioRxiv-red)](https://doi.org/10.1101/2024.02.01.578496)
25
+ [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
26
+ [![Model on Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-pfnet%2FGenerRNA-yellow)](https://huggingface.co/pfnet/GenerRNA)
27
+
28
+ **GenerRNA is a generative pre-trained language model for *de novo* RNA sequence design.** It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences **without any structural input, functional label, or sequence alignment**. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.
29
+
30
+ With GenerRNA you can:
31
+
32
+ - **Generate RNA in a zero-shot manner** to explore the RNA sequence space, or
33
+ - **Fine-tune on your own dataset** to generate RNAs belonging to a particular family or possessing specific characteristics (e.g., high binding affinity to a target protein).
34
+
35
+ > Developed by [Preferred Networks, Inc.](https://www.preferred.jp/en/) and The University of Tokyo. Introduced in *PLOS ONE* (2024): [GenerRNA: A generative pre-trained language model for *de novo* RNA design](https://doi.org/10.1371/journal.pone.0310814).
36
+
37
+ ---
38
+
39
+ ## Table of Contents
40
+
41
+ - [Model Summary](#model-summary)
42
+ - [Key Features](#key-features)
43
+ - [Model Details](#model-details)
44
+ - [Intended Use & Use Cases](#intended-use--use-cases)
45
+ - [Requirements](#requirements)
46
+ - [Quickstart](#quickstart)
47
+ - [Training & Fine-tuning](#training--fine-tuning)
48
+ - [Repository Structure](#repository-structure)
49
+ - [Training Data](#training-data)
50
+ - [Limitations](#limitations)
51
+ - [FAQ](#faq)
52
+ - [Citation](#citation)
53
+ - [License](#license)
54
+
55
+ ---
56
+
57
+ ## Model Summary
58
+
59
+ GenerRNA is a **Transformer decoder-only (GPT-style) language model** trained on RNA nucleotide sequences. By treating RNA as a sequence of tokens, it learns statistical and structural regularities of RNA directly from data and can then **sample entirely new sequences**. GenerRNA was pre-trained on ~16 million RNA sequences (16.09M), encompassing ~17.4 billion nucleotides. Generated RNAs are novel (distinct from training sequences) yet fold into stable secondary structures, and the model can be fine-tuned to design functional RNAs such as protein binders β€” all without requiring prior structural knowledge.
60
+
61
+ ## Key Features
62
+
63
+ - 🧬 **De novo RNA generation** β€” create novel RNA sequences from scratch; no structure, label, or alignment required.
64
+ - 🎯 **Zero-shot or fine-tuned** β€” explore RNA space out of the box, or specialize the model for a target family or function.
65
+ - πŸ”¬ **Structurally plausible outputs** β€” generated sequences fold into stable secondary structures (low minimum free energy).
66
+ - 🧩 **Transformer / GPT architecture** β€” a familiar, scalable decoder-only design (~350M parameters).
67
+ - ⚑ **Two checkpoints provided** β€” an updated long-context model and the original historical model.
68
+ - πŸ“– **Open & reproducible** β€” MIT-licensed code, tokenizer, checkpoints, and the data behind the paper's figures.
69
+
70
+ ## Model Details
71
+
72
+ | | |
73
+ |---|---|
74
+ | **Model type** | Generative language model (decoder-only Transformer, GPT-style) |
75
+ | **Domain** | RNA / nucleotide sequences |
76
+ | **Parameters** | 350M (24 transformer layers, model dimension 1280) |
77
+ | **Context window** | 1024 tokens (~4000 nucleotides) |
78
+ | **Tokenizer** | Byte-Pair Encoding (BPE), vocabulary size 1024 |
79
+ | **Checkpoints** | `model_updated.pt` (recommended; longer context, deduplicated data) Β· original split model in `experiment_data/historical_version/` |
80
+ | **Framework** | PyTorch (β‰₯ 2.0) |
81
+ | **License** | MIT |
82
+ | **Paper** | *PLOS ONE* 19(10):e0310814 (2024) Β· [doi:10.1371/journal.pone.0310814](https://doi.org/10.1371/journal.pone.0310814) |
83
+ | **Developed by** | Preferred Networks, Inc. & The University of Tokyo |
84
+
85
+ ## Intended Use & Use Cases
86
+
87
+ GenerRNA is intended for **research in RNA biology, synthetic biology, and RNA-based therapeutics / drug discovery**. Typical use cases include:
88
+
89
+ - Exploring the diversity of the RNA sequence space.
90
+ - Generating candidate RNAs from a target family by fine-tuning on family-specific data.
91
+ - Designing RNAs with desired functional properties, such as aptamers/binders with high affinity to a target protein (demonstrated for the RNA-binding proteins **ELAVL1** and **SRSF1** in the paper).
92
+ - Serving as a pre-trained backbone for downstream RNA modeling and design tasks.
93
+
94
+ ## Requirements
95
+
96
+ A CUDA environment with a minimum of **8 GB VRAM** is required.
97
 
 
 
 
98
  ```
99
  torch>=2.0
100
  numpy
 
103
  tqdm
104
  ```
105
 
106
+ ## Quickstart
107
+
108
+ Clone the repository (it ships with the recommended checkpoint `model_updated.pt` and its `tokenizer/`):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
+ ```bash
111
+ git clone https://huggingface.co/pfnet/GenerRNA
112
+ cd GenerRNA
113
  ```
114
+
115
+ ### De novo generation (zero-shot)
116
+
117
+ ```bash
118
  python sampling.py \
119
  --out_path {output_file_path} \
120
  --max_new_tokens 256 \
121
+ --ckpt_path model_updated.pt \
122
+ --tokenizer_path tokenizer
 
 
 
123
  ```
124
+
125
+ > **Want to use the original (historical) model instead?** It is stored as split files. Recombine it and use its dedicated tokenizer:
126
+ >
127
+ > ```bash
128
+ > cat experiment_data/historical_version/model.pt.part-* > model.pt
129
+ > python sampling.py \
130
+ > --out_path {output_file_path} \
131
+ > --max_new_tokens 256 \
132
+ > --ckpt_path model.pt \
133
+ > --tokenizer_path experiment_data/historical_version/tokenizer_bpe_1024
134
+ > ```
135
+
136
+ ## Training & Fine-tuning
137
+
138
+ **1. Tokenize your sequences** (one sequence per line, no header):
139
+
140
+ ```bash
141
  python tokenization.py \
142
+ --data_dir {path_to_directory_containing_sequence_data} \
143
  --file_name {file_name_of_sequence_data} \
144
+ --tokenizer_path tokenizer \
145
  --out_dir {directory_to_save_tokenized_data} \
146
  --block_size 256
147
  ```
148
 
149
+ **2. Create a config** based on `configs/example_pretraining.py` (training from scratch) or `configs/example_finetuning.py` (fine-tuning).
150
 
151
+ **3. Train / fine-tune:**
 
 
 
 
152
 
153
+ ```bash
154
+ python train.py --config {path_to_your_config_file}
155
  ```
156
+
157
+ ### Train your own tokenizer (optional)
158
+
159
+ ```bash
160
  python train_BPE.py \
161
+ --txt_file_path {path_to_training_file_one_sequence_per_line} \
162
  --vocab_size 50256 \
163
+ --new_tokenizer_path {directory_to_save_trained_tokenizer}
164
+ ```
165
+
166
+ ## Repository Structure
167
+
168
  ```
169
+ .
170
+ β”œβ”€β”€ LICENSE
171
+ β”œβ”€β”€ README.md
172
+ β”œβ”€β”€ CITATION.cff # machine-readable citation metadata
173
+ β”œβ”€β”€ model.py # model architecture (decoder-only Transformer)
174
+ β”œβ”€β”€ sampling.py # generate sequences from a trained model
175
+ β”œβ”€β”€ tokenization.py # tokenize sequence data for training
176
+ β”œβ”€β”€ train.py # pre-training / fine-tuning entry point
177
+ β”œβ”€β”€ train_BPE.py # train a new BPE tokenizer
178
+ β”œβ”€β”€ model_updated.pt # recommended checkpoint (longer context, deduplicated data)
179
+ β”œβ”€β”€ tokenizer/ # BPE tokenizer for model_updated.pt
180
+ β”œβ”€β”€ configs/
181
+ β”‚ β”œβ”€β”€ example_pretraining.py
182
+ β”‚ └── example_finetuning.py
183
+ └── experiment_data/
184
+ β”œβ”€β”€ *.csv # data underlying the paper's figures
185
+ β”œβ”€β”€ pretraining_data.sh # how the pre-training corpus was built (RNAcentral + MMseqs2)
186
+ └── historical_version/ # original model (split into parts) + its tokenizer
187
+ β”œβ”€β”€ model.pt.part-a{a,b,c,d}
188
+ └── tokenizer_bpe_1024/
189
+ ```
190
+
191
+ ## Training Data
192
+
193
+ GenerRNA was pre-trained on RNA sequences from **[RNAcentral](https://rnacentral.org/)** (release 22, which aggregates 51 expert databases). Starting from **34.39 million** raw sequences, deduplication with **[MMseqs2](https://github.com/soedinglab/MMseqs2)** at **80% sequence identity** yielded a pre-training corpus of **~16 million sequences (16.09M), encompassing ~17.4 billion nucleotides**. GenerRNA has a context window of **1024 tokens (~4000 nucleotides)**. The pre-processing pipeline is in [`experiment_data/pretraining_data.sh`](experiment_data/pretraining_data.sh), and the data underlying the paper's figures is provided in `experiment_data/`. See the [paper](https://doi.org/10.1371/journal.pone.0310814) for full dataset details.
194
+
195
+ ## Limitations
196
+
197
+ - GenerRNA models RNA **sequence**; it does not explicitly predict tertiary structure or function. Validate candidates with downstream structure/function tools and wet-lab experiments.
198
+ - A CUDA GPU is required for generation and training as provided.
199
+ - Zero-shot outputs reflect the natural distribution of the training data; targeting a specific family or property generally requires fine-tuning.
200
+ - Generated sequences are computational hypotheses and should be experimentally validated before any real-world application.
201
+
202
+ ## FAQ
203
+
204
+ **What is GenerRNA?**
205
+ GenerRNA is a generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences *de novo*, without requiring structural information, functional labels, or sequence alignments.
206
+
207
+ **How is GenerRNA different from other RNA models?**
208
+ Most RNA models are *discriminative* β€” they predict structure or properties from a given sequence. GenerRNA is *generative*: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
209
+
210
+ **Do I need RNA structure or alignments as input?**
211
+ No. GenerRNA generates sequences directly from its learned distribution; no structure or alignment is needed.
212
+
213
+ **Can I generate RNAs from a specific family or with a specific function?**
214
+ Yes. Fine-tune GenerRNA on a family- or function-specific dataset. The paper demonstrates designing RNAs with high binding affinity to the proteins ELAVL1 and SRSF1.
215
+
216
+ **Which checkpoint should I use?**
217
+ Use `model_updated.pt` (longer context, trained on deduplicated data). The original split model is kept in `experiment_data/historical_version/` for reproducibility.
218
+
219
+ **Is GenerRNA free to use?**
220
+ Yes. The code and weights are released under the MIT License. Please cite the paper if you use GenerRNA in your work.
221
+
222
+ **How do I cite GenerRNA?**
223
+ See [Citation](#citation) below.
224
+
225
+ ## Citation
226
+
227
+ If you use GenerRNA, its checkpoints, or this repository in your research, please cite:
228
+
229
+ ```bibtex
230
+ @article{zhao2024generrna,
231
+ title = {GenerRNA: A generative pre-trained language model for de novo RNA design},
232
+ author = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
233
+ journal = {PLOS ONE},
234
+ volume = {19},
235
+ number = {10},
236
+ pages = {e0310814},
237
+ year = {2024},
238
+ doi = {10.1371/journal.pone.0310814},
239
+ publisher = {Public Library of Science}
240
+ }
241
+ ```
242
+
243
+ **Plain text:** Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for *de novo* RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814
244
+
245
+ - πŸ“„ **Paper (PLOS ONE):** https://doi.org/10.1371/journal.pone.0310814
246
+ - πŸ“ **Preprint (bioRxiv):** https://doi.org/10.1101/2024.02.01.578496
247
+ - πŸ€— **Model:** https://huggingface.co/pfnet/GenerRNA
248
+
249
+ ## License
250
 
251
+ The source code is licensed under the **MIT License** β€” see [`LICENSE`](LICENSE). Β© 2024 Yichong Zhao, Masaaki Kotera, Kenta Oono, Hiroki Takizawa.