Improve model card for discoverability and add citation metadata

- Rewrite README into a structured Hugging Face model card with YAML
metadata (license, tags, pipeline_tag), badges, Model Summary/Details,
Intended Use, Limitations, and an FAQ
- Fix usage instructions and directory tree (use model_updated.pt +
tokenizer/; historical split model is under experiment_data/historical_version/)
- Align all figures with the PLOS ONE paper (16.09M sequences / 17.4B nt,
350M params, model dimension 1280, 1024-token context, BPE vocab 1024)
- Add BibTeX / plain-text citation and paper + preprint DOI links
- Add CITATION.cff and .gitignore

Files changed (3) hide show

.gitignore +12 -0
CITATION.cff +57 -0
README.md +222 -52

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+# Python
+__pycache__/
+*.py[cod]
+# Local generation outputs (not part of the published repository)
+outputs/
+# Claude Code local settings
+.claude/settings.local.json
+# OS / editor cruft
+.DS_Store

CITATION.cff ADDED Viewed

	@@ -0,0 +1,57 @@

+cff-version: 1.2.0
+message: "If you use GenerRNA in your research, please cite the article below."
+title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
+abstract: >-
+  GenerRNA is a generative pre-trained language model for de novo RNA design,
+  based on a Transformer decoder-only architecture. It generates novel RNA
+  sequences in a zero-shot manner or after fine-tuning, without requiring prior
+  structural information.
+type: software
+authors:
+  - family-names: Zhao
+    given-names: Yichong
+    affiliation: "The University of Tokyo"
+  - family-names: Oono
+    given-names: Kenta
+    affiliation: "Preferred Networks, Inc."
+  - family-names: Takizawa
+    given-names: Hiroki
+    affiliation: "Preferred Networks, Inc."
+  - family-names: Kotera
+    given-names: Masaaki
+    affiliation: "Preferred Networks, Inc."
+    email: kotera@preferred.jp
+repository-code: "https://huggingface.co/pfnet/GenerRNA"
+url: "https://huggingface.co/pfnet/GenerRNA"
+license: MIT
+keywords:
+  - RNA design
+  - de novo design
+  - generative model
+  - language model
+  - transformer
+  - RNA generation
+  - computational biology
+  - bioinformatics
+  - drug discovery
+preferred-citation:
+  type: article
+  title: "GenerRNA: A generative pre-trained language model for de novo RNA design"
+  authors:
+    - family-names: Zhao
+      given-names: Yichong
+    - family-names: Oono
+      given-names: Kenta
+    - family-names: Takizawa
+      given-names: Hiroki
+    - family-names: Kotera
+      given-names: Masaaki
+  journal: "PLOS ONE"
+  year: 2024
+  month: 10
+  volume: 19
+  issue: 10
+  start: "e0310814"
+  doi: "10.1371/journal.pone.0310814"
+  publisher:
+    name: "Public Library of Science"

README.md CHANGED Viewed

@@ -1,11 +1,100 @@
-# GenerRNA
-GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides.
-Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics.
-# Requirements
-A CUDA environment, and a minimum VRAM of 8GB was required.
-### Dependencies
 ```
 torch>=2.0
 numpy
@@ -14,68 +103,149 @@ datasets==2.14.4
 tqdm
 ```
-# Usage
-Firstly, combine the split model using the command `cat model.pt.part-* > model.pt.recombined`
-#### Directory tree
-```
-.
-├── LICENSE
-├── README.md
-├── configs
-│   ├── example_finetuning.py
-│   └── example_pretraining.py
-├── experiments_data
-├── model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
-├── model.pt.part-ab
-├── model.pt.part-ac
-├── model.pt.part-ad
-├── model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
-├── model.py         # define the architecture
-├── sampling.py      # script to generate sequences
-├── tokenization.py  # preparete data
-├── tokenizer_bpe_1024
-│   ├── tokenizer.json
-│   ├── ....
-├── train.py # script for training/fine-tuning
-```
-### De novo Generation in a zero-shot fashion
-Usage example:
 ```
 python sampling.py \
     --out_path {output_file_path} \
     --max_new_tokens 256 \
-    --ckpt_path {model.pt} \
-    --tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024}
-```
-### Pre-training or Fine-tuning on your own sequences
-First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.
 ```
 python tokenization.py \
-    --data_dir {path_to_the_directory_containing_sequence_data} \
     --file_name {file_name_of_sequence_data} \
-    --tokenizer_path {path_to_tokenizer_directory}  \
     --out_dir {directory_to_save_tokenized_data} \
     --block_size 256
 ```
-Next, refer to `./configs/example_**.py` to create a config file of GPT model.
-Lastly, excute following command:
-```
-python train.py \
-    --config {path_to_your_config_file}
-```
-### Train your own tokenizer
-Usage example:
 ```
 python train_BPE.py \
-    --txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \
     --vocab_size 50256 \
-    --new_tokenizer_path {directory_to_save_trained_tokenizer} \
 ```
-# License
-The source code is licensed MIT. See `LICENSE`

+---
+license: mit
+tags:
+  - biology
+  - rna
+  - rna-design
+  - rna-generation
+  - de-novo-design
+  - generative-model
+  - language-model
+  - transformer
+  - gpt
+  - nucleotide
+  - bioinformatics
+  - computational-biology
+  - drug-discovery
+  - molecular-design
+pipeline_tag: text-generation
+---
+# GenerRNA: A Generative Language Model for *de novo* RNA Design
+[![Paper (PLOS ONE)](https://img.shields.io/badge/Paper-PLOS%20ONE%202024-orange)](https://doi.org/10.1371/journal.pone.0310814)
+[![Preprint (bioRxiv)](https://img.shields.io/badge/Preprint-bioRxiv-red)](https://doi.org/10.1101/2024.02.01.578496)
+[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
+[![Model on Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-pfnet%2FGenerRNA-yellow)](https://huggingface.co/pfnet/GenerRNA)
+**GenerRNA is a generative pre-trained language model for *de novo* RNA sequence design.** It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences **without any structural input, functional label, or sequence alignment**. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.
+With GenerRNA you can:
+- **Generate RNA in a zero-shot manner** to explore the RNA sequence space, or
+- **Fine-tune on your own dataset** to generate RNAs belonging to a particular family or possessing specific characteristics (e.g., high binding affinity to a target protein).
+> Developed by [Preferred Networks, Inc.](https://www.preferred.jp/en/) and The University of Tokyo. Introduced in *PLOS ONE* (2024): [GenerRNA: A generative pre-trained language model for *de novo* RNA design](https://doi.org/10.1371/journal.pone.0310814).
+---
+## Table of Contents
+- [Model Summary](#model-summary)
+- [Key Features](#key-features)
+- [Model Details](#model-details)
+- [Intended Use & Use Cases](#intended-use--use-cases)
+- [Requirements](#requirements)
+- [Quickstart](#quickstart)
+- [Training & Fine-tuning](#training--fine-tuning)
+- [Repository Structure](#repository-structure)
+- [Training Data](#training-data)
+- [Limitations](#limitations)
+- [FAQ](#faq)
+- [Citation](#citation)
+- [License](#license)
+---
+## Model Summary
+GenerRNA is a **Transformer decoder-only (GPT-style) language model** trained on RNA nucleotide sequences. By treating RNA as a sequence of tokens, it learns statistical and structural regularities of RNA directly from data and can then **sample entirely new sequences**. GenerRNA was pre-trained on ~16 million RNA sequences (16.09M), encompassing ~17.4 billion nucleotides. Generated RNAs are novel (distinct from training sequences) yet fold into stable secondary structures, and the model can be fine-tuned to design functional RNAs such as protein binders — all without requiring prior structural knowledge.
+## Key Features
+- 🧬 **De novo RNA generation** — create novel RNA sequences from scratch; no structure, label, or alignment required.
+- 🎯 **Zero-shot or fine-tuned** — explore RNA space out of the box, or specialize the model for a target family or function.
+- 🔬 **Structurally plausible outputs** — generated sequences fold into stable secondary structures (low minimum free energy).
+- 🧩 **Transformer / GPT architecture** — a familiar, scalable decoder-only design (~350M parameters).
+- ⚡ **Two checkpoints provided** — an updated long-context model and the original historical model.
+- 📖 **Open & reproducible** — MIT-licensed code, tokenizer, checkpoints, and the data behind the paper's figures.
+## Model Details
+|  |  |
+|---|---|
+| **Model type** | Generative language model (decoder-only Transformer, GPT-style) |
+| **Domain** | RNA / nucleotide sequences |
+| **Parameters** | 350M (24 transformer layers, model dimension 1280) |
+| **Context window** | 1024 tokens (~4000 nucleotides) |
+| **Tokenizer** | Byte-Pair Encoding (BPE), vocabulary size 1024 |
+| **Checkpoints** | `model_updated.pt` (recommended; longer context, deduplicated data) · original split model in `experiment_data/historical_version/` |
+| **Framework** | PyTorch (≥ 2.0) |
+| **License** | MIT |
+| **Paper** | *PLOS ONE* 19(10):e0310814 (2024) · [doi:10.1371/journal.pone.0310814](https://doi.org/10.1371/journal.pone.0310814) |
+| **Developed by** | Preferred Networks, Inc. & The University of Tokyo |
+## Intended Use & Use Cases
+GenerRNA is intended for **research in RNA biology, synthetic biology, and RNA-based therapeutics / drug discovery**. Typical use cases include:
+- Exploring the diversity of the RNA sequence space.
+- Generating candidate RNAs from a target family by fine-tuning on family-specific data.
+- Designing RNAs with desired functional properties, such as aptamers/binders with high affinity to a target protein (demonstrated for the RNA-binding proteins **ELAVL1** and **SRSF1** in the paper).
+- Serving as a pre-trained backbone for downstream RNA modeling and design tasks.
+## Requirements
+A CUDA environment with a minimum of **8 GB VRAM** is required.
 ```
 torch>=2.0
 numpy
 tqdm
 ```
+## Quickstart
+Clone the repository (it ships with the recommended checkpoint `model_updated.pt` and its `tokenizer/`):
+```bash
+git clone https://huggingface.co/pfnet/GenerRNA
+cd GenerRNA
 ```
+### De novo generation (zero-shot)
+```bash
 python sampling.py \
     --out_path {output_file_path} \
     --max_new_tokens 256 \
+    --ckpt_path model_updated.pt \
+    --tokenizer_path tokenizer
 ```
+> **Want to use the original (historical) model instead?** It is stored as split files. Recombine it and use its dedicated tokenizer:
+>
+> ```bash
+> cat experiment_data/historical_version/model.pt.part-* > model.pt
+> python sampling.py \
+>     --out_path {output_file_path} \
+>     --max_new_tokens 256 \
+>     --ckpt_path model.pt \
+>     --tokenizer_path experiment_data/historical_version/tokenizer_bpe_1024
+> ```
+## Training & Fine-tuning
+**1. Tokenize your sequences** (one sequence per line, no header):
+```bash
 python tokenization.py \
+    --data_dir {path_to_directory_containing_sequence_data} \
     --file_name {file_name_of_sequence_data} \
+    --tokenizer_path tokenizer \
     --out_dir {directory_to_save_tokenized_data} \
     --block_size 256
 ```
+**2. Create a config** based on `configs/example_pretraining.py` (training from scratch) or `configs/example_finetuning.py` (fine-tuning).
+**3. Train / fine-tune:**
+```bash
+python train.py --config {path_to_your_config_file}
 ```
+### Train your own tokenizer (optional)
+```bash
 python train_BPE.py \
+    --txt_file_path {path_to_training_file_one_sequence_per_line} \
     --vocab_size 50256 \
+    --new_tokenizer_path {directory_to_save_trained_tokenizer}
+```
+## Repository Structure
 ```
+.
+├── LICENSE
+├── README.md
+├── CITATION.cff               # machine-readable citation metadata
+├── model.py                   # model architecture (decoder-only Transformer)
+├── sampling.py                # generate sequences from a trained model
+├── tokenization.py            # tokenize sequence data for training
+├── train.py                   # pre-training / fine-tuning entry point
+├── train_BPE.py               # train a new BPE tokenizer
+├── model_updated.pt           # recommended checkpoint (longer context, deduplicated data)
+├── tokenizer/                 # BPE tokenizer for model_updated.pt
+├── configs/
+│   ├── example_pretraining.py
+│   └── example_finetuning.py
+└── experiment_data/
+    ├── *.csv                  # data underlying the paper's figures
+    ├── pretraining_data.sh    # how the pre-training corpus was built (RNAcentral + MMseqs2)
+    └── historical_version/    # original model (split into parts) + its tokenizer
+        ├── model.pt.part-a{a,b,c,d}
+        └── tokenizer_bpe_1024/
+```
+## Training Data
+GenerRNA was pre-trained on RNA sequences from **[RNAcentral](https://rnacentral.org/)** (release 22, which aggregates 51 expert databases). Starting from **34.39 million** raw sequences, deduplication with **[MMseqs2](https://github.com/soedinglab/MMseqs2)** at **80% sequence identity** yielded a pre-training corpus of **~16 million sequences (16.09M), encompassing ~17.4 billion nucleotides**. GenerRNA has a context window of **1024 tokens (~4000 nucleotides)**. The pre-processing pipeline is in [`experiment_data/pretraining_data.sh`](experiment_data/pretraining_data.sh), and the data underlying the paper's figures is provided in `experiment_data/`. See the [paper](https://doi.org/10.1371/journal.pone.0310814) for full dataset details.
+## Limitations
+- GenerRNA models RNA **sequence**; it does not explicitly predict tertiary structure or function. Validate candidates with downstream structure/function tools and wet-lab experiments.
+- A CUDA GPU is required for generation and training as provided.
+- Zero-shot outputs reflect the natural distribution of the training data; targeting a specific family or property generally requires fine-tuning.
+- Generated sequences are computational hypotheses and should be experimentally validated before any real-world application.
+## FAQ
+**What is GenerRNA?**
+GenerRNA is a generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences *de novo*, without requiring structural information, functional labels, or sequence alignments.
+**How is GenerRNA different from other RNA models?**
+Most RNA models are *discriminative* — they predict structure or properties from a given sequence. GenerRNA is *generative*: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
+**Do I need RNA structure or alignments as input?**
+No. GenerRNA generates sequences directly from its learned distribution; no structure or alignment is needed.
+**Can I generate RNAs from a specific family or with a specific function?**
+Yes. Fine-tune GenerRNA on a family- or function-specific dataset. The paper demonstrates designing RNAs with high binding affinity to the proteins ELAVL1 and SRSF1.
+**Which checkpoint should I use?**
+Use `model_updated.pt` (longer context, trained on deduplicated data). The original split model is kept in `experiment_data/historical_version/` for reproducibility.
+**Is GenerRNA free to use?**
+Yes. The code and weights are released under the MIT License. Please cite the paper if you use GenerRNA in your work.
+**How do I cite GenerRNA?**
+See [Citation](#citation) below.
+## Citation
+If you use GenerRNA, its checkpoints, or this repository in your research, please cite:
+```bibtex
+@article{zhao2024generrna,
+  title     = {GenerRNA: A generative pre-trained language model for de novo RNA design},
+  author    = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
+  journal   = {PLOS ONE},
+  volume    = {19},
+  number    = {10},
+  pages     = {e0310814},
+  year      = {2024},
+  doi       = {10.1371/journal.pone.0310814},
+  publisher = {Public Library of Science}
+}
+```
+**Plain text:** Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for *de novo* RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814
+- 📄 **Paper (PLOS ONE):** https://doi.org/10.1371/journal.pone.0310814
+- 📝 **Preprint (bioRxiv):** https://doi.org/10.1101/2024.02.01.578496
+- 🤗 **Model:** https://huggingface.co/pfnet/GenerRNA
+## License
+The source code is licensed under the **MIT License** — see [`LICENSE`](LICENSE). © 2024 Yichong Zhao, Masaaki Kotera, Kenta Oono, Hiroki Takizawa.