Moreza009 commited on
Commit
3e8c29b
·
verified ·
1 Parent(s): 2cbfe72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +304 -3
README.md CHANGED
@@ -1,3 +1,304 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Moreza009/AAV_datasets
5
+ base_model:
6
+ - nferruz/ProtGPT2
7
+ ---
8
+
9
+ <h1 align="center">AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting</h1>
10
+
11
+ </br>
12
+
13
+ <p align="center">
14
+ <a href="https://opensource.org/licenses/Apache-2.0">
15
+ <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License: Apache 2.0">
16
+ </a>
17
+ <a href="https://www.python.org/downloads/">
18
+ <img src="https://img.shields.io/badge/python-3.8+-blue.svg" alt="Python 3.8+">
19
+ </a>
20
+ <a href="https://github.com/mohammad-gh009/AAVGen">
21
+ <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=githu" alt="Github">
22
+ </a>
23
+ <a href="">
24
+ <img src="https://img.shields.io/badge/arXiv-2508.18579-b31b1b.svg" alt="arXive">
25
+ </a>
26
+ </p>
27
+
28
+ <p align="center">
29
+ <img src="https://github.com/mohammad-gh009/AAVGen/blob/main/assets/Logo.png" alt="Logo" width="500">
30
+ </p>
31
+
32
+ ---
33
+
34
+ ## Abstract
35
+
36
+ Adeno-associated viruses (AAVs) are promising vectors for gene therapy, but their native serotypes face limitations in tissue tropism, immune evasion, and production efficiency. Engineering capsids to overcome these hurdles is challenging due to the vast sequence space and the difficulty of simultaneously optimizing multiple functional properties. The complexity also adds when it comes to the kidney, which presents unique anatomical barriers and cellular targets that require precise and efficient vector engineering. Here, we present AAVGen, a generative artificial intelligence framework for de novo design of AAV capsids with enhanced multi-trait profiles. AAVGen integrates a protein language model (PLM) with supervised fine-tuning (SFT) and a reinforcement learning technique termed Group Sequence Policy Optimization (GSPO). The model is guided by a composite reward signal derived from three ESM-2-based regression predictors, each trained to predict a key property: production fitness, kidney tropism, and thermostability. Our results demonstrate that AAVGen produces a diverse library of novel VP1 protein sequences. In silico validations revealed that the majority of the generated variants have superior performance across all three employed indices, indicating successful multi-objective optimization. Furthermore, structural analysis via AlphaFold3 confirms that the generated sequences preserve the canonical capsid folding despite sequence diversification. AAVGen establishes a foundation for data-driven viral vector engineering, accelerating the development of next-generation AAV vectors with tailored functional characteristics.
37
+
38
+ </br>
39
+
40
+ ---
41
+ ## Model Details
42
+
43
+ ### Model Description
44
+
45
+ AAVGen is a generative protein language model designed for precision engineering of Adeno-associated Virus (AAV) capsid sequences with optimized multi-property profiles. It was developed to generate novel AAV capsid variants with improved production fitness, kidney tropism, and thermostability relative to wild-type AAV2. The model was trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) on AAV2 and AAV9 VP1 capsid datasets, followed by reinforcement learning via Group Sequence Policy Optimization (GSPO) guided by ESM-2-based regression reward models.
46
+
47
+ - **Developed by:** Mohammadreza Ghaffarzadeh-Esfahani, Yousof Gheisari
48
+ - **Institution:** Regenerative Medicine Research Center & Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan, Iran
49
+ - **Corresponding Author:** Yousof Gheisari (ygheisari@med.mui.ac.ir)
50
+ - **Model type:** Causal Language Model (Generative Protein Language Model)
51
+ - **Language(s):** Protein sequences (amino acid alphabet)
52
+ - **License:** Apache-2.0
53
+ - **Finetuned from model:** [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2)
54
+
55
+ ### Model Sources
56
+
57
+ - **Repository:** [Moreza009/AAVGen](https://huggingface.co/Moreza009/AAVGen)
58
+ - **Dataset:** [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets)
59
+
60
+ ---
61
+
62
+ ## Uses
63
+
64
+ ### Direct Use
65
+
66
+ AAVGen can be used to generate novel AAV capsid protein sequences (VP1) by providing a start token (`<|endoftext|>\nM`). The generated sequences are intended for in silico screening, functional evaluation, and downstream experimental validation in AAV-based gene therapy development. The model is particularly suited for generating capsid variants optimized for renal tropism, high production fitness, and thermal stability.
67
+
68
+ ### Downstream Use
69
+
70
+ AAVGen-generated sequences can be used as candidates for:
71
+ - Directed evolution and rational capsid engineering pipelines
72
+ - Scoring and selection using the companion regression models ([Moreza009/AAV-Fitness](https://huggingface.co/Moreza009/AAV-Fitness), [Moreza009/AAV-Thermostability](https://huggingface.co/Moreza009/AAV-Thermostability), [Moreza009/AAV-Kidney-Tropism](https://huggingface.co/Moreza009/AAV-Kidney-Tropism))
73
+ - Structural modeling with tools such as AlphaFold3
74
+ - Gene therapy vector development targeting the kidney
75
+
76
+ ### Out-of-Scope Use
77
+
78
+ - Generation of capsid sequences for serotypes substantially different from AAV2/AAV9 without additional fine-tuning
79
+ - Direct clinical or therapeutic use without extensive experimental validation
80
+ - Applications requiring absolute sequence novelty guarantees (a small fraction of generated sequences may match training set variants)
81
+
82
+ ---
83
+
84
+ ## Bias, Risks, and Limitations
85
+
86
+ - The model was trained primarily on AAV2 and AAV9 VP1 sequences; generated sequences will be heavily biased toward these serotypes.
87
+ - Regression-based reward models carry inherent prediction uncertainty (MAE-based margins are used to flag uncertain predictions). Functional classifications should be treated as predictions, not experimental ground truth.
88
+ - Kidney tropism and thermostability regression models showed moderate predictive correlation (Spearman ρ = 0.35 and 0.26, respectively), meaning reward signals for these properties are noisier than for production fitness.
89
+ - Approximately 4% of generated sequences are repetitive duplicates; downstream pipelines should deduplicate outputs.
90
+ - None of the generated sequences have been experimentally validated at the time of publication.
91
+
92
+ ### Recommendations
93
+
94
+ Users should employ the companion ESM-2-based regression models for in silico pre-screening of generated sequences before experimental follow-up. Sequences classified as "Best" or "Good" (relative to WT scores and MAE margins) are recommended for prioritization. Structural validation using AlphaFold3 or equivalent tools is strongly encouraged before any experimental work.
95
+
96
+ ---
97
+
98
+ ## How to Get Started with the Model
99
+
100
+ ```python
101
+ from transformers import AutoTokenizer, AutoModelForCausalLM
102
+ import torch
103
+
104
+ model_name = "Moreza009/AAVGen"
105
+ device = "cuda" if torch.cuda.is_available() else "cpu"
106
+
107
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
108
+ tokenizer.pad_token = tokenizer.eos_token
109
+ model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
110
+ model.eval()
111
+
112
+ # Generate AAV capsid sequences
113
+ prompt = tokenizer.eos_token + "\n" + "M"
114
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
115
+
116
+ with torch.no_grad():
117
+ outputs = model.generate(
118
+ **inputs,
119
+ max_new_tokens=754,
120
+ do_sample=True,
121
+ temperature=1.0,
122
+ top_p=1.0,
123
+ repetition_penalty=1.0,
124
+ )
125
+
126
+ generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
127
+ print(generated_sequence)
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Training Details
133
+
134
+ ### Training Data
135
+
136
+ AAVGen was trained on AAV2 and AAV9 VP1 capsid sequence datasets available at [Moreza009/AAV_datasets](https://huggingface.co/datasets/Moreza009/AAV_datasets). The dataset includes sequences paired with experimental scores for production fitness, kidney tropism, and thermostability. For AAV9 sequences, the variable insert region was reconstructed by inserting the variable AA segment at position 588 of the full VP1 backbone. Only sequences with a non-negative fitness score were retained, and duplicate sequences were removed prior to training.
137
+
138
+ ### Training Procedure
139
+
140
+ Training proceeded in two stages:
141
+
142
+ **Stage 1 — Supervised Fine-Tuning (SFT):**
143
+ ProtGPT2 was fine-tuned on the combined AAV2 and AAV9 VP1 sequence dataset to learn foundational residue–residue co-evolutionary relationships. Sequences were formatted in FASTA-like style with `<|endoftext|>` tokens as delimiters and line breaks every 60 residues.
144
+
145
+ **Stage 2 — Reinforcement Learning via GSPO:**
146
+ The SFT model was further optimized using the GSPO framework from TRL, guided by a composite reward function consisting of five components:
147
+ 1. **Production fitness reward** (weight: 1.0) — predicted by `Moreza009/AAV-Fitness`
148
+ 2. **Kidney tropism reward** (weight: 1.0) — predicted by `Moreza009/AAV-Kidney-Tropism`
149
+ 3. **Thermostability reward** (weight: 1.0) — predicted by `Moreza009/AAV-Thermostability`
150
+ 4. **Length control reward** (weight: 0.1) — penalizes sequences deviating from target VP1 length (735 aa; σ=3)
151
+ 5. **Uniqueness reward** (weight: 0.1) — penalizes repeated sequences within a training batch
152
+
153
+ Reward signals from the three regression models were mapped through a custom **reward logic mapper** that translates raw predicted scores into reward values by comparing them against the WT AAV2 score. Only sequences exceeding the WT score receive positive reward, ensuring that optimization is anchored to the natural reference.
154
+
155
+ #### Preprocessing
156
+
157
+ - Sequences formatted with `<|endoftext|>` as start/end tokens
158
+ - FASTA-style line wrapping at 60 residues for SFT
159
+ - AAV9 inserts reconstructed by inserting variable regions at position 588 of the full VP1 backbone
160
+ - Duplicate sequences removed; fitness score ≥ 0 filter applied
161
+
162
+ #### Training Hyperparameters
163
+
164
+ **SFT Phase:**
165
+ - **Training regime:** fp16 mixed precision
166
+ - Base model: `nferruz/ProtGPT2`
167
+ - Learning rate: 1e-4 (linear schedule)
168
+ - Batch size per device: 4; gradient accumulation: 4
169
+ - Epochs: 3
170
+ - Max sequence length: 300 tokens
171
+ - Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
172
+ - Weight decay: 0.01; warmup ratio: 0.01
173
+
174
+ **GSPO Phase:**
175
+ - **Training regime:** fp16 mixed precision
176
+ - Learning rate: 2e-6 (cosine schedule)
177
+ - Batch size per device: 4; gradient accumulation: 8
178
+ - Number of generations per step: 32
179
+ - Epochs: 5
180
+ - Max completion length: 754 tokens
181
+ - Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-8)
182
+ - Weight decay: 0.01; warmup steps: 50
183
+ - Importance sampling level: sequence
184
+ - Gradient checkpointing: enabled
185
+
186
+ #### Speeds, Sizes, and Times
187
+
188
+ All training was performed on a server with an NVIDIA V100 GPU (32 GB VRAM) and AMD EPYC 7502 CPU (32 GB RAM):
189
+ - SFT training: ~9 hours 5 minutes
190
+ - GSPO training: ~9 hours 38 minutes
191
+
192
+ ---
193
+
194
+ ## Evaluation
195
+
196
+ ### Testing Data, Factors & Metrics
197
+
198
+ #### Testing Data
199
+
200
+ Evaluation was performed on a set of 500,000 sequences generated by AAVGen, initiated with the fixed start token `"M"`, using sampling-based decoding (temperature=1.0, top_p=1.0, top_k=None) with a maximum length of 500 tokens and a batch size of 64.
201
+
202
+ #### Factors
203
+
204
+ Evaluation was stratified across three dimensions: sequence quality/novelty, predicted functional properties, and structural fidelity to WT AAV2.
205
+
206
+ #### Metrics
207
+
208
+ - **Uniqueness:** Fraction of non-duplicate sequences in the generated pool
209
+ - **Length distribution:** Comparison of generated sequence lengths to training set (median, IQR)
210
+ - **Sequence identity and similarity:** Global pairwise alignment to WT AAV2 (Biopython PairwiseAligner; match=2, mismatch=-1, gap open=-2, gap extend=-0.5)
211
+ - **Edit distance:** Minimum residue-level edits from generated sequence to WT AAV2
212
+ - **Functional classification:** Predicted scores from regression models classified as "Best" (>WT + 4×MAE), "Good" (WT + 1–4×MAE), "Uncertain" (WT to WT + 1×MAE), or "Bad" (<WT)
213
+ - **Spearman correlation:** Between predicted scores for each pair of optimized properties
214
+ - **Structural RMSD:** Cα RMSD between AlphaFold3-predicted structures of generated variants and the WT AAV2 PDB structure (VP3 subunit)
215
+
216
+ ### Results
217
+
218
+ **Sequence Diversity and Fidelity:**
219
+ - ~4% of the 500,000 generated sequences were duplicates
220
+ - After deduplication, 1,787 sequences matched training set entries; 230 were identical to WT AAV2; none matched WT AAV9
221
+ - Length distribution closely matched training data (generated median: 741, IQR: 740–743; training median: 741, IQR: 737–743)
222
+ - High sequence similarity to AAV2 WT: median identity 99.18% (IQR: 98.91–99.32%), median similarity 99.32% (IQR: 99.05–99.46%)
223
+ - Median edit distance from WT AAV2: 13% (IQR: 10–15%)
224
+
225
+ **Functional Property Analysis (436,765 unique, non-WT, non-training sequences):**
226
+
227
+ | Property | Best | Good | Uncertain | Bad |
228
+ |---|---|---|---|---|
229
+ | Production Fitness | 435,448 (99.70%) | 669 (0.15%) | 128 (0.03%) | 559 (0.13%) |
230
+ | Kidney Tropism | 1 (<0.01%) | 491,439 (98.27%) | 5,416 (1.24%) | 2,155 (0.43%) |
231
+ | Thermostability | 0 (0%) | 386,844 (88.57%) | 43,626 (9.99%) | 6,295 (1.44%) |
232
+
233
+ Strong positive Spearman correlations were observed between all three predicted property pairs, confirming co-optimization without property trade-offs.
234
+
235
+ **Structural Analysis:**
236
+ AlphaFold3-based structural modeling of 500 randomly sampled "Good"/"Best" sequences showed high structural conservation relative to WT AAV2 (VP3), with low RMSD values. RMSD was negatively correlated with predicted functional scores, confirming that sequences with higher predicted performance better preserved the WT structural scaffold.
237
+
238
+ #### Summary
239
+
240
+ AAVGen successfully generates a diverse library of novel AAV capsid variants that retain high structural and sequence similarity to WT AAV2 while exhibiting substantially improved predicted production fitness, kidney tropism, and thermostability. The vast majority of generated sequences are classified as "Good" or "Best" across all three design objectives, with strong co-optimization across properties.
241
+
242
+ ---
243
+
244
+ ## Environmental Impact
245
+
246
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
247
+
248
+ - **Hardware Type:** NVIDIA V100 GPU (32 GB VRAM)
249
+ - **Hours used:** ~46 hours total (across all regression models, SFT, and GSPO phases)
250
+ - **Cloud Provider:** On-premise institutional server
251
+ - **Compute Region:** Isfahan, Iran
252
+ - **Carbon Emitted:** Not calculated
253
+
254
+ ---
255
+
256
+ ## Technical Specifications
257
+
258
+ ### Model Architecture and Objective
259
+
260
+ AAVGen is based on ProtGPT2, a GPT-2 architecture pre-trained on UniRef50 protein sequences. The model uses a causal language modeling (CLM) objective during SFT and a GSPO-based policy optimization objective during RL fine-tuning. The GSPO framework optimizes the model toward a composite reward derived from three ESM-2-based regression models predicting production fitness, kidney tropism, and thermostability, plus two auxiliary rewards for sequence length control and batch uniqueness.
261
+
262
+ ### Compute Infrastructure
263
+
264
+ #### Hardware
265
+ - GPU: NVIDIA V100, 32 GB VRAM
266
+ - CPU: AMD EPYC 7502, 32 GB RAM
267
+
268
+ #### Software
269
+ - Python, PyTorch
270
+ - Transformers (Hugging Face)
271
+ - TRL (GRPO/GSPO framework)
272
+ - Datasets (Hugging Face)
273
+ - scikit-learn, Biopython 1.85
274
+ - AlphaFold3 (structural evaluation)
275
+ - PyMOL (structural alignment)
276
+
277
+ ---
278
+
279
+ ## Citation
280
+
281
+ If you use AAVGen in your research, please cite:
282
+
283
+ **BibTeX:**
284
+ ```bibtex
285
+ @article{ghaffarzadeh2025aavgen,
286
+ title={AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting},
287
+ author={Ghaffarzadeh-Esfahani, Mohammadreza and Gheisari, Yousof},
288
+ journal={[Journal Name]},
289
+ year={2025},
290
+ institution={Regenerative Medicine Research Center, Isfahan University of Medical Sciences}
291
+ }
292
+ ```
293
+
294
+ **APA:**
295
+ Ghaffarzadeh-Esfahani, M., & Gheisari, Y. (2025). AAVGen: Precision Engineering of Adeno-associated Virus for Renal Selective Targeting. *[Journal Name]*. Isfahan University of Medical Sciences.
296
+
297
+ ---
298
+
299
+ ## Model Card Authors
300
+ Mohammadreza Ghaffarzadeh-Esfahani
301
+
302
+ ## Model Card Contact
303
+ Mohammadreza Ghaffarzadeh-Esfahani
304
+ Email: mreghafarzadeh@gmail.com