File size: 5,662 Bytes
0839907 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | # FastGen Inference
Generate images and videos using pretrained or distilled models.
| Script | Modality | Modes |
|--------|----------|-------|
| [`image_model_inference.py`](inference/image_model_inference.py) | Image | Unconditional, class-conditional, T2I |
| [`video_model_inference.py`](inference/video_model_inference.py) | Video | T2V, I2V, V2V, Video2World |
## Key Arguments
| Argument | Description |
|----------|-------------|
| `--do_student_sampling` | Run distilled student (few-step) |
| `--do_teacher_sampling` | Run teacher (multi-step) |
| `--ckpt_path` | Path to distilled checkpoint |
| `--num_steps` | Sampling steps for teacher |
| `--classes N` | Class-conditional with N classes |
| `--unconditional` | Unconditional generation |
| `--input_image_file` | Input images for I2V |
| `--source_video_file` | Source videos for V2V |
| `--fps` | Output video frame rate |
| `model.guidance_scale` | CFG scale (config override) |
| `trainer.seed` | Random seed for reproducibility (config override) |
## Example commands
### Unconditional
```bash
python scripts/inference/image_model_inference.py \
--config fastgen/configs/experiments/EDM/config_sft_edm_cifar10.py \
--do_student_sampling False --unconditional --num_samples 16 --num_steps 18
```
### Class-Conditional
```bash
python scripts/inference/image_model_inference.py \
--config fastgen/configs/experiments/DiT/config_sft_sit_xl.py \
--do_student_sampling False --classes 1000 --num_steps 50 \
--prompt_file scripts/inference/prompts/classes.txt \
- model.guidance_scale=4.0
```
### Text-to-Image (T2I)
```bash
python scripts/inference/image_model_inference.py \
--config fastgen/configs/experiments/Flux/config_sft.py \
--do_student_sampling False --num_steps 50 \
- model.guidance_scale=3.5
```
### Text-to-Video (T2V)
```bash
python scripts/inference/video_model_inference.py \
--config fastgen/configs/experiments/WanT2V/config_dmd2.py \
--do_student_sampling False --num_steps 50 --fps 16 \
--neg_prompt_file scripts/inference/prompts/negative_prompt.txt \
- model.guidance_scale=5.0
```
### Image-to-Video (I2V)
```bash
python scripts/inference/video_model_inference.py \
--config fastgen/configs/experiments/WanI2V/config_dmd2_wan22_5b.py \
--do_student_sampling False --num_steps 50 --fps 16 \
--neg_prompt_file scripts/inference/prompts/negative_prompt.txt \
--input_image_file scripts/inference/prompts/source_image_paths.txt \
- model.guidance_scale=5.0
```
### Video-to-Video (V2V)
```bash
python scripts/inference/video_model_inference.py \
--config fastgen/configs/experiments/WanV2V/config_sft.py \
--do_student_sampling False --num_steps 50 --fps 16 \
--neg_prompt_file scripts/inference/prompts/negative_prompt.txt \
--source_video_file scripts/inference/prompts/source_video_paths.txt \
- model.guidance_scale=5.0
```
### Video2World (Cosmos)
```bash
python scripts/inference/video_model_inference.py \
--config fastgen/configs/experiments/CosmosPredict2/config_sft.py \
--do_student_sampling False --num_steps 35 --fps 24 \
--neg_prompt_file scripts/inference/prompts/negative_prompt_cosmos.txt \
--input_image_file scripts/inference/prompts/source_image_paths.txt \
--num_conditioning_frames 1 \
- model.guidance_scale=5.0 model.net.is_video2world=True model.input_shape="[16, 24, 88, 160]"
```
### Causal / Autoregressive
Use causal configs (e.g., `config_sft_causal_wan22_5b.py`) for autoregressive generation.
```bash
python scripts/inference/video_model_inference.py \
--config fastgen/configs/experiments/WanI2V/config_sft_causal_wan22_5b.py \
--do_student_sampling False --num_steps 50 --fps 16 \
--neg_prompt_file scripts/inference/prompts/negative_prompt.txt \
--input_image_file scripts/inference/prompts/source_image_paths.txt \
- model.guidance_scale=5.0
```
For generating longer videos via extrapolation:
- `--num_segments N`: Generate N consecutive video segments autoregressively (default: 1)
- `--overlap_frames K`: Overlap K latent frames between segments for temporal consistency (default: 0)
---
## FID Evaluation
Compute Fréchet Inception Distance (FID) for image models using [`fid/compute_fid_from_ckpts.py`](fid/compute_fid_from_ckpts.py).
### Usage
```bash
torchrun --nproc_per_node=8 scripts/fid/compute_fid_from_ckpts.py \
--config fastgen/configs/experiments/EDM/config_dmd2_cifar10.py
```
This script:
1. Loads checkpoints from `trainer.checkpointer.save_dir`
2. Generates `eval.num_samples` images using student sampling
3. Computes FID against reference statistics
4. Saves results to `{save_path}/{eval.samples_dir}/fid.json`
### Config Options
| Parameter | Description |
|-----------|-------------|
| `eval.num_samples` | Number of samples to generate (default: 50000) |
| `eval.min_ckpt` | Minimum checkpoint iteration to evaluate |
| `eval.max_ckpt` | Maximum checkpoint iteration to evaluate |
| `eval.samples_dir` | Subdirectory name for generated samples |
| `eval.save_images` | Save visualization grid instead of computing FID |
### Reference Statistics
FID reference statistics are computed following the [EDM](https://github.com/NVlabs/edm) and [EDM2](https://github.com/NVlabs/edm2) repositories. Store them in `$DATA_ROOT_DIR/fid-refs/`:
| Dataset | Reference File |
|---------|----------------|
| CIFAR-10 | `fid-refs/cifar10-32x32.npz` |
| ImageNet-64 (EDM) | `fid-refs/imagenet-64x64.npz` |
| ImageNet-64 (EDM2) | `fid-refs/imagenet-64x64-edmv2.npz` |
| ImageNet-256 | `fid-refs/imagenet_256.pkl` |
| COCO-2014 | `fid-refs/coco2014_eval_30k.npz` |
|