---
license: apache-2.0
datasets:
- Moreza009/AAV_datasets
base_model:
- nferruz/ProtGPT2
tags:
- biology
- protein
- medicine
- capsid_engineering
pipeline_tag: text-generation
---
AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting
---
## Abstract
Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
---
## Model Details
### Model Description
AAVGen is a generative protein language model designed for precision engineering of Adeno-associated Virus (AAV) capsid sequences with optimized multi-property profiles. It was developed to generate novel AAV capsid variants with improved production fitness, kidney tropism, and thermostability relative to wild-type AAV2. The model was trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) on AAV2 and AAV9 VP1 capsid datasets, followed by reinforcement learning via Group Sequence Policy Optimization (GSPO) guided by ESM-2-based regression reward models.
- **Developed by:** Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari
- **Institution:** Regenerative Medicine Research Center & Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan, Iran
- **Corresponding Author:** Yousof Gheisari (ygheisari@med.mui.ac.ir)
- **Model type:** Causal Language Model (Generative Protein Language Model)
- **Language(s):** Protein sequences (amino acid alphabet)
- **License:** Apache-2.0
- **Finetuned from model:** [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2)
### Model Sources
- **Repository:** [Moreza009/AAVGen](https://huggingface.co/Moreza009/AAVGen)
- **Dataset:** [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets)
---
## Uses
### Direct Use
AAVGen can be used to generate novel AAV capsid protein sequences (VP1) by providing a start token (`<|endoftext|>\nM`). The generated sequences are intended for in silico screening, functional evaluation, and downstream experimental validation in AAV-based gene therapy development. The model is particularly suited for generating capsid variants optimized for renal tropism, high production fitness, and thermal stability.
### Downstream Use
AAVGen-generated sequences can be used as candidates for:
- Directed evolution and rational capsid engineering pipelines
- Scoring and selection using the companion regression models ([Moreza009/AAV-Fitness](https://huggingface.co/Moreza009/AAV-Fitness), [Moreza009/AAV-Thermostability](https://huggingface.co/Moreza009/AAV-Thermostability), [Moreza009/AAV-Kidney-Tropism](https://huggingface.co/Moreza009/AAV-Kidney-Tropism))
- Structural modeling with tools such as AlphaFold3
- Gene therapy vector development targeting the kidney
### Out-of-Scope Use
- Generation of capsid sequences for serotypes substantially different from AAV2/AAV9 without additional fine-tuning
- Direct clinical or therapeutic use without extensive experimental validation
- Applications requiring absolute sequence novelty guarantees (a small fraction of generated sequences may match training set variants)
---
## Bias, Risks, and Limitations
- The model was trained primarily on AAV2 and AAV9 VP1 sequences; generated sequences will be heavily biased toward these serotypes.
- Regression-based reward models carry inherent prediction uncertainty (MAE-based margins are used to flag uncertain predictions). Functional classifications should be treated as predictions, not experimental ground truth.
- Kidney tropism and thermostability regression models showed moderate predictive correlation (Spearman ρ = 0.35 and 0.26, respectively), meaning reward signals for these properties are noisier than for production fitness.
- Approximately 4% of generated sequences are repetitive duplicates; downstream pipelines should deduplicate outputs.
- None of the generated sequences have been experimentally validated at the time of publication.
### Recommendations
Users should employ the companion ESM-2-based regression models for in silico pre-screening of generated sequences before experimental follow-up. Sequences classified as "Best" or "Good" (relative to WT scores and MAE margins) are recommended for prioritization. Structural validation using AlphaFold3 or equivalent tools is strongly encouraged before any experimental work.
---
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "Moreza009/AAVGen"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
model.eval()
# Generate AAV capsid sequences
prompt = tokenizer.eos_token + "\n" + "M"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=754,
do_sample=True,
temperature=1.0,
top_p=1.0,
repetition_penalty=1.0,
)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_sequence)
```
---
## Training Details
### Training Data
AAVGen was trained on AAV2 and AAV9 VP1 capsid sequence datasets available at [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets). The dataset includes sequences paired with experimental scores for production fitness, kidney tropism, and thermostability. For AAV9 sequences, the variable insert region was reconstructed by inserting the variable AA segment at position 588 of the full VP1 backbone. Only sequences with a non-negative fitness score were retained, and duplicate sequences were removed prior to training.
### Training Procedure
Training proceeded in two stages:
**Stage 1 — Supervised Fine-Tuning (SFT):**
ProtGPT2 was fine-tuned on the combined AAV2 and AAV9 VP1 sequence dataset to learn foundational residue–residue co-evolutionary relationships. Sequences were formatted in FASTA-like style with `<|endoftext|>` tokens as delimiters and line breaks every 60 residues.
**Stage 2 — Reinforcement Learning via GSPO:**
The SFT model was further optimized using the GSPO framework from TRL, guided by a composite reward function consisting of five components:
1. **Production fitness reward** (weight: 1.0) — predicted by `Moreza009/AAV-Fitness`
2. **Kidney tropism reward** (weight: 1.0) — predicted by `Moreza009/AAV-Kidney-Tropism`
3. **Thermostability reward** (weight: 1.0) — predicted by `Moreza009/AAV-Thermostability`
4. **Length control reward** (weight: 0.1) — penalizes sequences deviating from target VP1 length (735 aa; σ=3)
5. **Uniqueness reward** (weight: 0.1) — penalizes repeated sequences within a training batch
Reward signals from the three regression models were mapped through a custom **reward logic mapper** that translates raw predicted scores into reward values by comparing them against the WT AAV2 score. Only sequences exceeding the WT score receive positive reward, ensuring that optimization is anchored to the natural reference.
#### Preprocessing
- Sequences formatted with `<|endoftext|>` as start/end tokens
- FASTA-style line wrapping at 60 residues for SFT
- AAV9 inserts reconstructed by inserting variable regions at position 588 of the full VP1 backbone
- Duplicate sequences removed; fitness score ≥ 0 filter applied
#### Training Hyperparameters
**SFT Phase:**
- **Training regime:** fp16 mixed precision
- Base model: `nferruz/ProtGPT2`
- Learning rate: 1e-4 (linear schedule)
- Batch size per device: 4; gradient accumulation: 4
- Epochs: 3
- Max sequence length: 300 tokens
- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
- Weight decay: 0.01; warmup ratio: 0.01
**GSPO Phase:**
- **Training regime:** fp16 mixed precision
- Learning rate: 2e-6 (cosine schedule)
- Batch size per device: 4; gradient accumulation: 8
- Number of generations per step: 32
- Epochs: 5
- Max completion length: 754 tokens
- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
- Weight decay: 0.01; warmup steps: 50
- Importance sampling level: sequence
- Gradient checkpointing: enabled
#### Speeds, Sizes, and Times
All training was performed on a server with an NVIDIA V100 GPU (32 GB VRAM) and AMD EPYC 7502 CPU (32 GB RAM):
- SFT training: ~9 hours 5 minutes
- GSPO training: ~9 hours 38 minutes
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Evaluation was performed on a set of 500,000 sequences generated by AAVGen, initiated with the fixed start token `"M"`, using sampling-based decoding (temperature=1.0, top_p=1.0, top_k=None) with a maximum length of 500 tokens and a batch size of 64.
#### Factors
Evaluation was stratified across three dimensions: sequence quality/novelty, predicted functional properties, and structural fidelity to WT AAV2.
#### Metrics
- **Uniqueness:** Fraction of non-duplicate sequences in the generated pool
- **Length distribution:** Comparison of generated sequence lengths to training set (median, IQR)
- **Sequence identity and similarity:** Global pairwise alignment to WT AAV2 (Biopython PairwiseAligner; match=2, mismatch=-1, gap open=-2, gap extend=-0.5)
- **Edit distance:** Minimum residue-level edits from generated sequence to WT AAV2
- **Functional classification:** Predicted scores from regression models classified as "Best" (>WT + 4×MAE), "Good" (WT + 1–4×MAE), "Uncertain" (WT to WT + 1×MAE), or "Bad" (