Moreza009
/

AAVGen

+---
+license: apache-2.0
+datasets:
+- Moreza009/AAV_datasets
+base_model:
+- nferruz/ProtGPT2
+---
+<h1 align="center">AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting</h1>
+</br>
+<p align="center">
+  <a href="https://opensource.org/licenses/Apache-2.0">
+    <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License: Apache 2.0">
+  </a>
+  <a href="https://www.python.org/downloads/">
+    <img src="https://img.shields.io/badge/python-3.8+-blue.svg" alt="Python 3.8+">
+  </a>
+  <a href="https://github.com/mohammad-gh009/AAVGen">
+    <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=githu" alt="Github">
+  </a>
+  <a href="">
+    <img src="https://img.shields.io/badge/arXiv-2508.18579-b31b1b.svg" alt="arXive">
+  </a>
+</p>
+<p align="center">
+  <img src="https://github.com/mohammad-gh009/AAVGen/blob/main/assets/Logo.png" alt="Logo" width="500">
+</p>
+---
+## Abstract
+Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
+</br>
+---
+## Model Details
+### Model Description
+AAVGen is a generative protein language model designed for precision engineering of Adeno-associated Virus (AAV) capsid sequences with optimized multi-property profiles. It was developed to generate novel AAV capsid variants with improved production fitness, kidney tropism, and thermostability relative to wild-type AAV2. The model was trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) on AAV2 and AAV9 VP1 capsid datasets, followed by reinforcement learning via Group Sequence Policy Optimization (GSPO) guided by ESM-2-based regression reward models.
+- **Developed by:** Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari
+- **Institution:** Regenerative Medicine Research Center & Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan, Iran
+- **Corresponding Author:** Yousof Gheisari (ygheisari@med.mui.ac.ir)
+- **Model type:** Causal Language Model (Generative Protein Language Model)
+- **Language(s):** Protein sequences (amino acid alphabet)
+- **License:** Apache-2.0
+- **Finetuned from model:** [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2)
+### Model Sources
+- **Repository:** [Moreza009/AAVGen](https://huggingface.co/Moreza009/AAVGen)
+- **Dataset:** [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets)
+---
+## Uses
+### Direct Use
+AAVGen can be used to generate novel AAV capsid protein sequences (VP1) by providing a start token (`<|endoftext|>\nM`). The generated sequences are intended for in silico screening, functional evaluation, and downstream experimental validation in AAV-based gene therapy development. The model is particularly suited for generating capsid variants optimized for renal tropism, high production fitness, and thermal stability.
+### Downstream Use
+AAVGen-generated sequences can be used as candidates for:
+- Directed evolution and rational capsid engineering pipelines
+- Scoring and selection using the companion regression models ([Moreza009/AAV-Fitness](https://huggingface.co/Moreza009/AAV-Fitness), [Moreza009/AAV-Thermostability](https://huggingface.co/Moreza009/AAV-Thermostability), [Moreza009/AAV-Kidney-Tropism](https://huggingface.co/Moreza009/AAV-Kidney-Tropism))
+- Structural modeling with tools such as AlphaFold3
+- Gene therapy vector development targeting the kidney
+### Out-of-Scope Use
+- Generation of capsid sequences for serotypes substantially different from AAV2/AAV9 without additional fine-tuning
+- Direct clinical or therapeutic use without extensive experimental validation
+- Applications requiring absolute sequence novelty guarantees (a small fraction of generated sequences may match training set variants)
+---
+## Bias, Risks, and Limitations
+- The model was trained primarily on AAV2 and AAV9 VP1 sequences; generated sequences will be heavily biased toward these serotypes.
+- Regression-based reward models carry inherent prediction uncertainty (MAE-based margins are used to flag uncertain predictions). Functional classifications should be treated as predictions, not experimental ground truth.
+- Kidney tropism and thermostability regression models showed moderate predictive correlation (Spearman ρ = 0.35 and 0.26, respectively), meaning reward signals for these properties are noisier than for production fitness.
+- Approximately 4% of generated sequences are repetitive duplicates; downstream pipelines should deduplicate outputs.
+- None of the generated sequences have been experimentally validated at the time of publication.
+### Recommendations
+Users should employ the companion ESM-2-based regression models for in silico pre-screening of generated sequences before experimental follow-up. Sequences classified as "Best" or "Good" (relative to WT scores and MAE margins) are recommended for prioritization. Structural validation using AlphaFold3 or equivalent tools is strongly encouraged before any experimental work.
+---
+## How to Get Started with the Model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_name = "Moreza009/AAVGen"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
+model.eval()
+# Generate AAV capsid sequences
+prompt = tokenizer.eos_token + "\n" + "M"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=754,
+        do_sample=True,
+        temperature=1.0,
+        top_p=1.0,
+        repetition_penalty=1.0,
+    )
+generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_sequence)
+```
+---
+## Training Details
+### Training Data
+AAVGen was trained on AAV2 and AAV9 VP1 capsid sequence datasets available at [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets). The dataset includes sequences paired with experimental scores for production fitness, kidney tropism, and thermostability. For AAV9 sequences, the variable insert region was reconstructed by inserting the variable AA segment at position 588 of the full VP1 backbone. Only sequences with a non-negative fitness score were retained, and duplicate sequences were removed prior to training.
+### Training Procedure
+Training proceeded in two stages:
+**Stage 1 — Supervised Fine-Tuning (SFT):**
+ProtGPT2 was fine-tuned on the combined AAV2 and AAV9 VP1 sequence dataset to learn foundational residue–residue co-evolutionary relationships. Sequences were formatted in FASTA-like style with `<|endoftext|>` tokens as delimiters and line breaks every 60 residues.
+**Stage 2 — Reinforcement Learning via GSPO:**
+The SFT model was further optimized using the GSPO framework from TRL, guided by a composite reward function consisting of five components:
+1. **Production fitness reward** (weight: 1.0) — predicted by `Moreza009/AAV-Fitness`
+2. **Kidney tropism reward** (weight: 1.0) — predicted by `Moreza009/AAV-Kidney-Tropism`
+3. **Thermostability reward** (weight: 1.0) — predicted by `Moreza009/AAV-Thermostability`
+4. **Length control reward** (weight: 0.1) — penalizes sequences deviating from target VP1 length (735 aa; σ=3)
+5. **Uniqueness reward** (weight: 0.1) — penalizes repeated sequences within a training batch
+Reward signals from the three regression models were mapped through a custom **reward logic mapper** that translates raw predicted scores into reward values by comparing them against the WT AAV2 score. Only sequences exceeding the WT score receive positive reward, ensuring that optimization is anchored to the natural reference.
+#### Preprocessing
+- Sequences formatted with `<|endoftext|>` as start/end tokens
+- FASTA-style line wrapping at 60 residues for SFT
+- AAV9 inserts reconstructed by inserting variable regions at position 588 of the full VP1 backbone
+- Duplicate sequences removed; fitness score ≥ 0 filter applied
+#### Training Hyperparameters
+**SFT Phase:**
+- **Training regime:** fp16 mixed precision
+- Base model: `nferruz/ProtGPT2`
+- Learning rate: 1e-4 (linear schedule)
+- Batch size per device: 4; gradient accumulation: 4
+- Epochs: 3
+- Max sequence length: 300 tokens
+- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
+- Weight decay: 0.01; warmup ratio: 0.01
+**GSPO Phase:**
+- **Training regime:** fp16 mixed precision
+- Learning rate: 2e-6 (cosine schedule)
+- Batch size per device: 4; gradient accumulation: 8
+- Number of generations per step: 32
+- Epochs: 5
+- Max completion length: 754 tokens
+- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
+- Weight decay: 0.01; warmup steps: 50
+- Importance sampling level: sequence
+- Gradient checkpointing: enabled
+#### Speeds, Sizes, and Times
+All training was performed on a server with an NVIDIA V100 GPU (32 GB VRAM) and AMD EPYC 7502 CPU (32 GB RAM):
+- SFT training: ~9 hours 5 minutes
+- GSPO training: ~9 hours 38 minutes
+---
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+Evaluation was performed on a set of 500,000 sequences generated by AAVGen, initiated with the fixed start token `"M"`, using sampling-based decoding (temperature=1.0, top_p=1.0, top_k=None) with a maximum length of 500 tokens and a batch size of 64.
+#### Factors
+Evaluation was stratified across three dimensions: sequence quality/novelty, predicted functional properties, and structural fidelity to WT AAV2.
+#### Metrics
+- **Uniqueness:** Fraction of non-duplicate sequences in the generated pool
+- **Length distribution:** Comparison of generated sequence lengths to training set (median, IQR)
+- **Sequence identity and similarity:** Global pairwise alignment to WT AAV2 (Biopython PairwiseAligner; match=2, mismatch=-1, gap open=-2, gap extend=-0.5)
+- **Edit distance:** Minimum residue-level edits from generated sequence to WT AAV2
+- **Functional classification:** Predicted scores from regression models classified as "Best" (>WT + 4×MAE), "Good" (WT + 1–4×MAE), "Uncertain" (WT to WT + 1×MAE), or "Bad" (<WT)
+- **Spearman correlation:** Between predicted scores for each pair of optimized properties
+- **Structural RMSD:** Cα RMSD between AlphaFold3-predicted structures of generated variants and the WT AAV2 PDB structure (VP3 subunit)
+### Results
+**Sequence Diversity and Fidelity:**
+- ~4% of the 500,000 generated sequences were duplicates
+- After deduplication, 1,787 sequences matched training set entries; 230 were identical to WT AAV2; none matched WT AAV9
+- Length distribution closely matched training data (generated median: 741, IQR: 740–743; training median: 741, IQR: 737–743)
+- High sequence similarity to AAV2 WT: median identity 99.18% (IQR: 98.91–99.32%), median similarity 99.32% (IQR: 99.05–99.46%)
+- Median edit distance from WT AAV2: 13% (IQR: 10–15%)
+**Functional Property Analysis (436,765 unique, non-WT, non-training sequences):**
+| Property | Best | Good | Uncertain | Bad |
+|---|---|---|---|---|
+| Production Fitness | 435,448 (99.70%) | 669 (0.15%) | 128 (0.03%) | 559 (0.13%) |
+| Kidney Tropism | 1 (<0.01%) | 491,439 (98.27%) | 5,416 (1.24%) | 2,155 (0.43%) |
+| Thermostability | 0 (0%) | 386,844 (88.57%) | 43,626 (9.99%) | 6,295 (1.44%) |
+Strong positive Spearman correlations were observed between all three predicted property pairs, confirming co-optimization without property trade-offs.
+**Structural Analysis:**
+AlphaFold3-based structural modeling of 500 randomly sampled "Good"/"Best" sequences showed high structural conservation relative to WT AAV2 (VP3), with low RMSD values. RMSD was negatively correlated with predicted functional scores, confirming that sequences with higher predicted performance better preserved the WT structural scaffold.
+#### Summary
+AAVGen successfully generates a diverse library of novel AAV capsid variants that retain high structural and sequence similarity to WT AAV2 while exhibiting substantially improved predicted production fitness, kidney tropism, and thermostability. The vast majority of generated sequences are classified as "Good" or "Best" across all three design objectives, with strong co-optimization across properties.
+---
+## Environmental Impact
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
+- **Hardware Type:** NVIDIA V100 GPU (32 GB VRAM)
+- **Hours used:** ~46 hours total (across all regression models, SFT, and GSPO phases)
+- **Cloud Provider:** On-premise institutional server
+- **Compute Region:** Isfahan, Iran
+- **Carbon Emitted:** Not calculated
+---
+## Technical Specifications
+### Model Architecture and Objective
+AAVGen is based on ProtGPT2, a GPT-2 architecture pre-trained on UniRef50 protein sequences. The model uses a causal language modeling (CLM) objective during SFT and a GSPO-based policy optimization objective during RL fine-tuning. The GSPO framework optimizes the model toward a composite reward derived from three ESM-2-based regression models predicting production fitness, kidney tropism, and thermostability, plus two auxiliary rewards for sequence length control and batch uniqueness.
+### Compute Infrastructure
+#### Hardware
+- GPU: NVIDIA V100, 32 GB VRAM
+- CPU: AMD EPYC 7502, 32 GB RAM
+#### Software
+- Python, PyTorch
+- Transformers (Hugging Face)
+- TRL (GRPO/GSPO framework)
+- Datasets (Hugging Face)
+- scikit-learn, Biopython 1.85
+- AlphaFold3 (structural evaluation)
+- PyMOL (structural alignment)
+---
+## Citation
+If you use AAVGen in your research, please cite:
+**BibTeX:**
+```bibtex
+@article{ghaffarzadeh2025aavgen,
+  title={AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting},
+  author={Ghaffarzadeh-Esfahani, Mohammadreza and Gheisari, Yousof},
+  journal={[Journal Name]},
+  year={2025},
+  institution={Regenerative Medicine Research Center, Isfahan University of Medical Sciences}
+}
+```
+**APA:**
+Ghaffarzadeh-Esfahani, M., & Gheisari, Y. (2025). AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting. *[Journal Name]*. Isfahan University of Medical Sciences.
+---
+## Model Card Authors
+Mohammadreza Ghaffarzadeh-Esfahani
+## Model Card Contact
+Mohammadreza Ghaffarzadeh-Esfahani
+Email: mreghafarzadeh@gmail.com