File size: 15,895 Bytes
7411921
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
#!/usr/bin/env python3
"""
Upload FlowFinal model components to Hugging Face Hub.
"""

import os
from huggingface_hub import HfApi, upload_file, upload_folder
import shutil
from datetime import datetime
import json

def create_model_card():
    """Create a comprehensive model card for FlowFinal."""
    model_card = """---
license: mit
tags:
- protein-generation
- antimicrobial-peptides
- flow-matching
- protein-design
- esm
- amp
library_name: pytorch
---

# FlowFinal: AMP Flow Matching Model

FlowFinal is a state-of-the-art flow matching model for generating antimicrobial peptides (AMPs). The model uses continuous normalizing flows to generate protein sequences in the ESM-2 embedding space.

## Model Description

- **Model Type**: Flow Matching for Protein Generation
- **Domain**: Antimicrobial Peptide (AMP) Generation
- **Base Model**: ESM-2 (650M parameters)
- **Architecture**: Transformer-based flow matching with classifier-free guidance (CFG)
- **Training Data**: Curated AMP dataset with ~7K sequences

## Key Features

- **Classifier-Free Guidance (CFG)**: Enables controlled generation with different conditioning strengths
- **ESM-2 Integration**: Leverages pre-trained protein language model embeddings
- **Compression Architecture**: Efficient 16x compression of ESM-2 embeddings (1280 β†’ 80 dimensions)
- **Multiple CFG Scales**: Support for no conditioning (0.0), weak (3.0), strong (7.5), and very strong (15.0) guidance

## Model Components

### Core Architecture
- `final_flow_model.py`: Main flow matching model implementation
- `compressor_with_embeddings.py`: Embedding compression/decompression modules
- `final_sequence_decoder.py`: ESM-2 embedding to sequence decoder

### Trained Weights
- `final_compressor_model.pth`: Trained compressor (315MB)
- `final_decompressor_model.pth`: Trained decompressor (158MB)
- `amp_flow_model_final_optimized.pth`: Main flow model checkpoint

### Generated Samples (Today's Results)
- Generated AMP sequences with different CFG scales
- HMD-AMP validation results showing 8.8% AMP prediction rate

## Performance Results

### HMD-AMP Validation (80 sequences tested)
- **Total AMPs Predicted**: 7/80 (8.8%)
- **By CFG Configuration**:
  - No CFG: 1/20 (5.0%)
  - Weak CFG: 2/20 (10.0%)  
  - Strong CFG: 4/20 (20.0%) ← Best performance
  - Very Strong CFG: 0/20 (0.0%)

### Best Performing Sequences
1. `ILVLVLARRIVGVIVAKVVLYAIVRSVVAAAKSISAVTVAKVTVFFQTTA` (No CFG)
2. `EDLSKAKAELQRYLLLSEIVSAFTALTRFYVVLTKIFQIRVKLIAVGQIL` (Weak CFG)
3. `IKLSRIAGIIVKRIRVASGDAQRLITASIGFTLSVVLAARFITIILGIVI` (Strong CFG)

## Usage

```python
from generate_amps import AMPGenerator

# Initialize generator
generator = AMPGenerator(
    model_path="amp_flow_model_final_optimized.pth",
    device='cuda'
)

# Generate AMP samples
samples = generator.generate_amps(
    num_samples=20,
    num_steps=25,
    cfg_scale=7.5  # Strong CFG recommended
)
```

## Training Details

- **Optimizer**: AdamW with cosine annealing
- **Learning Rate**: 4e-4 (final)
- **Epochs**: 2000
- **Final Loss**: 1.318
- **Training Time**: 2.3 hours on H100
- **Dataset Size**: 6,983 samples

## Files Structure

```
FlowFinal/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ final_compressor_model.pth
β”‚   β”œβ”€β”€ final_decompressor_model.pth
β”‚   └── amp_flow_model_final_optimized.pth
β”œβ”€β”€ generated_samples/
β”‚   β”œβ”€β”€ generated_sequences_20250829.fasta
β”‚   └── hmd_amp_detailed_results.csv
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ final_flow_model.py
β”‚   β”œβ”€β”€ compressor_with_embeddings.py
β”‚   β”œβ”€β”€ final_sequence_decoder.py
β”‚   └── generate_amps.py
└── README.md
```

## Citation

If you use FlowFinal in your research, please cite:

```bibtex
@misc{flowfinal2025,
  title={FlowFinal: Flow Matching for Antimicrobial Peptide Generation},
  author={Edward Sun},
  year={2025},
  url={https://huggingface.co/esunAI/FlowFinal}
}
```

## License

This model is released under the MIT License.
"""
    return model_card

def main():
    print("πŸš€ Starting comprehensive upload to Hugging Face Hub...")
    
    # Initialize API
    api = HfApi()
    repo_id = "esunAI/FlowFinal"
    today = "20250829"
    
    # Create model card
    print("πŸ“ Creating model card...")
    model_card = create_model_card()
    with open("README.md", "w") as f:
        f.write(model_card)
    
    # Upload model card
    print("πŸ“€ Uploading model card...")
    upload_file(
        path_or_fileobj="README.md",
        path_in_repo="README.md",
        repo_id=repo_id,
        commit_message="Add comprehensive model card"
    )
    
    # Upload main model components
    print("πŸ“€ Uploading main model files...")
    model_files = [
        "final_flow_model.py",
        "compressor_with_embeddings.py", 
        "final_sequence_decoder.py",
        "generate_amps.py",
        "amp_flow_training_single_gpu_full_data.py",
        "cfg_dataset.py",
        "decode_and_test_sequences.py"
    ]
    
    for file in model_files:
        if os.path.exists(file):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file,
                path_in_repo=f"src/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload trained model weights
    print("πŸ“€ Uploading model weights...")
    weight_files = [
        ("final_compressor_model.pth", "models/final_compressor_model.pth"),
        ("final_decompressor_model.pth", "models/final_decompressor_model.pth"),
        ("normalization_stats.pt", "models/normalization_stats.pt")
    ]
    
    for local_file, repo_path in weight_files:
        if os.path.exists(local_file):
            print(f"  Uploading {local_file} -> {repo_path}...")
            upload_file(
                path_or_fileobj=local_file,
                path_in_repo=repo_path,
                repo_id=repo_id,
                commit_message=f"Add {local_file}"
            )
    
    # Upload ALL flow model checkpoints from today
    print("πŸ“€ Uploading flow model checkpoints...")
    checkpoint_files = [
        ("/data2/edwardsun/flow_checkpoints/amp_flow_model_final_optimized.pth", "models/amp_flow_model_final_optimized.pth"),
        ("/data2/edwardsun/flow_checkpoints/amp_flow_model_best_optimized.pth", "models/amp_flow_model_best_optimized.pth"),
        ("/data2/edwardsun/flow_checkpoints/amp_flow_model_best_optimized_20250829_RETRAINED.pth", "models/amp_flow_model_best_optimized_20250829_RETRAINED.pth")
    ]
    
    for checkpoint_path, repo_path in checkpoint_files:
        if os.path.exists(checkpoint_path):
            print(f"  Uploading {os.path.basename(checkpoint_path)}...")
            upload_file(
                path_or_fileobj=checkpoint_path,
                path_in_repo=repo_path,
                repo_id=repo_id,
                commit_message=f"Add {os.path.basename(checkpoint_path)}"
            )
    
    # Upload paper and documentation files
    print("πŸ“€ Uploading paper and documentation files...")
    paper_files = [
        "paper_results.tex",
        "supplementary_data.tex",
        "latex_tables.tex"
    ]
    
    for file in paper_files:
        if os.path.exists(file):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file,
                path_in_repo=f"paper/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload training logs
    print("πŸ“€ Uploading training logs...")
    log_files = [
        "fresh_training_aug29.log",
        "h100_maximized_training.log", 
        "training_output_h100_max.log",
        "training_output.log",
        "launch_full_data_training.sh"
    ]
    
    for file in log_files:
        if os.path.exists(file):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file,
                path_in_repo=f"training_logs/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload datasets
    print("πŸ“€ Uploading datasets...")
    dataset_files = [
        ("all_peptides_data.json", "datasets/all_peptides_data.json"),
        ("combined_final.fasta", "datasets/combined_final.fasta"),
        ("cfgdata.fasta", "datasets/cfgdata.fasta"),
        ("uniprotkb_AND_reviewed_true_AND_model_o_2025_08_29.fasta", "datasets/uniprotkb_reviewed_proteins.fasta")
    ]
    
    for local_file, repo_path in dataset_files:
        if os.path.exists(local_file):
            print(f"  Uploading {local_file}...")
            upload_file(
                path_or_fileobj=local_file,
                path_in_repo=repo_path,
                repo_id=repo_id,
                commit_message=f"Add {local_file}"
            )
    
    # Upload today's results and analysis
    print("πŸ“€ Uploading today's results and analysis...")
    result_files = [
        "generated_sequences_20250829_144923.fasta",
        "hmd_amp_detailed_results.csv",
        "hmd_amp_cfg_analysis.csv",
        "complete_amp_results.csv",
        "summary_statistics.csv"
    ]
    
    for file in result_files:
        if os.path.exists(file):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file,
                path_in_repo=f"results/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload today's raw embeddings
    print("πŸ“€ Uploading today's raw embeddings...")
    embedding_dir = "/data2/edwardsun/generated_samples"
    
    embedding_files = [
        f"generated_amps_best_model_no_cfg_{today}.pt",
        f"generated_amps_best_model_weak_cfg_{today}.pt",
        f"generated_amps_best_model_strong_cfg_{today}.pt",
        f"generated_amps_best_model_very_strong_cfg_{today}.pt"
    ]
    
    for file in embedding_files:
        file_path = os.path.join(embedding_dir, file)
        if os.path.exists(file_path):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file_path,
                path_in_repo=f"generated_samples/embeddings/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload decoded sequences from today
    print("πŸ“€ Uploading decoded sequences from today...")
    decoded_dir = "/data2/edwardsun/decoded_sequences"
    decoded_files = [
        f"decoded_sequences_no_cfg_00_{today}.txt",
        f"decoded_sequences_weak_cfg_30_{today}.txt",
        f"decoded_sequences_strong_cfg_75_{today}.txt",
        f"decoded_sequences_very_strong_cfg_150_{today}.txt"
    ]
    
    for file in decoded_files:
        file_path = os.path.join(decoded_dir, file)
        if os.path.exists(file_path):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file_path,
                path_in_repo=f"generated_samples/decoded_sequences/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload APEX analysis results from today
    print("πŸ“€ Uploading APEX analysis results...")
    apex_dir = "/data2/edwardsun/apex_results"
    apex_files = [
        f"apex_results_no_cfg_00_{today}.json",
        f"apex_results_weak_cfg_30_{today}.json", 
        f"apex_results_strong_cfg_75_{today}.json",
        f"apex_results_very_strong_cfg_150_{today}.json",
        f"apex_results_all_cfg_comparison_{today}.json",
        f"mic_summary_{today}.json"
    ]
    
    for file in apex_files:
        file_path = os.path.join(apex_dir, file)
        if os.path.exists(file_path):
            print(f"  Uploading {file}...")
            upload_file(
                path_or_fileobj=file_path,
                path_in_repo=f"analysis/apex_results/{file}",
                repo_id=repo_id,
                commit_message=f"Add {file}"
            )
    
    # Upload additional dataset file from data2
    print("πŸ“€ Uploading additional dataset files...")
    additional_dataset_path = "/data2/edwardsun/decoded_sequences/all_dataset_peptides_sequences.txt"
    if os.path.exists(additional_dataset_path):
        print("  Uploading all_dataset_peptides_sequences.txt...")
        upload_file(
            path_or_fileobj=additional_dataset_path,
            path_in_repo="datasets/all_dataset_peptides_sequences.txt",
            repo_id=repo_id,
            commit_message="Add complete dataset sequences"
        )
    
    # Create comprehensive summary
    print("πŸ“€ Creating comprehensive summary...")
    
    # Count uploaded files
    uploaded_files = {
        "model_components": len([f for f in model_files if os.path.exists(f)]),
        "weight_files": len([f for f, _ in weight_files if os.path.exists(f)]),
        "checkpoints": len([f for f, _ in checkpoint_files if os.path.exists(f)]),
        "paper_files": len([f for f in paper_files if os.path.exists(f)]),
        "training_logs": len([f for f in log_files if os.path.exists(f)]),
        "datasets": len([f for f, _ in dataset_files if os.path.exists(f)]),
        "results": len([f for f in result_files if os.path.exists(f)]),
        "embeddings": len([f for f in embedding_files if os.path.exists(os.path.join(embedding_dir, f))]),
        "decoded_sequences": len([f for f in decoded_files if os.path.exists(os.path.join(decoded_dir, f))]),
        "apex_results": len([f for f in apex_files if os.path.exists(os.path.join(apex_dir, f))])
    }
    
    summary = {
        "model_name": "FlowFinal",
        "upload_date": datetime.now().isoformat(),
        "training_date": today,
        "total_sequences_generated": 80,
        "hmd_amp_predictions": 7,
        "hmd_amp_rate": 8.8,
        "best_cfg_configuration": "strong_cfg (20% AMP rate)",
        "training_details": {
            "epochs": 2000,
            "final_loss": 1.318,
            "training_time": "2.3 hours",
            "hardware": "H100",
            "dataset_size": 6983
        },
        "uploaded_files": uploaded_files,
        "total_files_uploaded": sum(uploaded_files.values()),
        "repository_structure": {
            "src/": "Main model implementation files",
            "models/": "Trained model weights and checkpoints", 
            "paper/": "LaTeX files and paper documentation",
            "training_logs/": "Complete training logs and scripts",
            "datasets/": "Training datasets and protein sequences",
            "results/": "Generated sequences and validation results",
            "generated_samples/": "Raw embeddings and decoded sequences",
            "analysis/": "APEX antimicrobial activity analysis"
        }
    }
    
    with open("comprehensive_summary.json", "w") as f:
        json.dump(summary, f, indent=2)
    
    upload_file(
        path_or_fileobj="comprehensive_summary.json",
        path_in_repo="comprehensive_summary.json",
        repo_id=repo_id,
        commit_message="Add comprehensive model and results summary"
    )
    
    print("βœ… Comprehensive upload complete!")
    print(f"🌐 Your complete FlowFinal repository is now available at: https://huggingface.co/{repo_id}")
    print("\nπŸ“Š Upload Summary:")
    for category, count in uploaded_files.items():
        print(f"  - {category.replace('_', ' ').title()}: {count} files")
    print(f"  - Total files uploaded: {sum(uploaded_files.values())} files")
    print(f"\n🎯 Key Results:")
    print(f"  - Generated 80 sequences with different CFG scales")
    print(f"  - HMD-AMP validated 7 sequences as AMPs (8.8% success rate)")
    print(f"  - Strong CFG (7.5) performed best with 20% AMP rate")
    print(f"  - Complete training logs, datasets, and analysis included")
    print(f"  - Ready for final paper submission!")

if __name__ == "__main__":
    main()