| | --- |
| | license: apple-amlr |
| | language: |
| | - en |
| | - th |
| | tags: |
| | - normalizing-flows |
| | - generative-models |
| | - art |
| | - autoregressive-models |
| | --- |
| | # STARFlow: Scalable Transformer Auto-Regressive Flow |
| |
|
| | <div align="center"> |
| | <img src="starflow_logo.png" alt="STARFlow Logo" width="300"> |
| | </div> |
| |
|
| | <div align="center"> |
| |
|
| | [](https://arxiv.org/abs/2506.06276) |
| | [](https://arxiv.org/abs/2511.20462) |
| | [](https://neurips.cc/Conferences/2025) |
| |
|
| | </div> |
| |
|
| | This is the official open source release of **STARFlow** and **STARFlow-V**, state-of-the-art transformer autoregressive flow models for high-quality image and video generation. |
| |
|
| | ## π Overview |
| |
|
| | **STARFlow** introduces a novel transformer autoregressive flow architecture that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. The model achieves state-of-the-art results in both text-to-image and text-to-video generation tasks. |
| |
|
| | - **[STARFlow](https://arxiv.org/abs/2506.06276)**: Scaling Latent Normalizing Flows for High-resolution Image Synthesis (NeurIPS 2025 Spotlight) |
| | - **[STARFlow-V](https://arxiv.org/abs/2511.20462)**: End-to-End Video Generative Modeling with Normalizing Flows (Arxiv) |
| |
|
| | π¬ **[View Video Results Gallery](https://starflow-v.github.io)** - See examples of generated videos and comparisons |
| |
|
| | ## π Quick Start |
| |
|
| | ### Environment Setup |
| |
|
| | ```bash |
| | # Clone the repository |
| | git clone https://github.com/apple/ml-starflow |
| | cd ml-starflow |
| | |
| | # Set up conda environment (recommended) |
| | bash scripts/setup_conda.sh |
| | |
| | # Or install dependencies manually |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | ### Model Checkpoints |
| |
|
| | **Important**: You'll need to download the pretrained model checkpoints and place them in the `ckpts/` directory. For example: |
| |
|
| | - `ckpts/starflow_3B_t2i_256x256.pth` - For text-to-image generation |
| | - `ckpts/starflow-v_7B_t2v_caus_480p_v3.pth` - For text-to-video generation |
| |
|
| |
|
| | ### Text-to-Image Generation |
| |
|
| | Generate high-quality images from text prompts: |
| |
|
| | ```bash |
| | # Basic image generation (256x256) |
| | bash scripts/test_sample_image.sh "a film still of a cat playing piano" |
| | |
| | # Custom prompt and settings |
| | torchrun --standalone --nproc_per_node 1 sample.py \ |
| | --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \ |
| | --checkpoint_path "ckpts/starflow_3B_t2i_256x256.pth" \ |
| | --caption "your custom prompt here" \ |
| | --sample_batch_size 8 \ |
| | --cfg 3.6 \ |
| | --aspect_ratio "1:1" \ |
| | --seed 999 |
| | ``` |
| |
|
| | ### Text-to-Video Generation |
| |
|
| | Generate videos from text descriptions: |
| |
|
| | ```bash |
| | # Basic video generation (480p, ~5 seconds) |
| | bash scripts/test_sample_video.sh "a corgi dog looks at the camera" |
| | |
| | # With custom input image for TI2V video generation |
| | bash scripts/test_sample_video.sh "a cat playing piano" "/path/to/input/image.jpg" |
| | |
| | # Longer video generation (specify target length in frames) |
| | bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 241 # ~15 seconds at 16fps |
| | bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 481 # ~30 seconds at 16fps |
| | |
| | # Advanced video generation |
| | torchrun --standalone --nproc_per_node 8 sample.py \ |
| | --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \ |
| | --checkpoint_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \ |
| | --caption "your video prompt here" \ |
| | --sample_batch_size 1 \ |
| | --cfg 3.5 \ |
| | --aspect_ratio "16:9" \ |
| | --out_fps 16 \ |
| | --jacobi 1 --jacobi_th 0.001 \ |
| | --target_length 161 # Customize video length |
| | ``` |
| |
|
| | ## π οΈ Training |
| |
|
| | ### Image Training |
| |
|
| | Train your own STARFlow model for text-to-image generation: |
| |
|
| | ```bash |
| | # Quick training test |
| | bash scripts/test_train_image.sh 10 16 |
| | |
| | # Full training with custom parameters |
| | torchrun --standalone --nproc_per_node 8 train.py \ |
| | --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \ |
| | --epochs 100 \ |
| | --batch_size 1024 \ |
| | --wandb_name "my_starflow_training" |
| | ``` |
| |
|
| | ### Video Training |
| |
|
| | Train STARFlow-V for text-to-video generation: |
| |
|
| | ```bash |
| | # Quick training test |
| | bash scripts/test_train_video.sh 10 8 |
| | |
| | # Resume training from checkpoint |
| | torchrun --standalone --nproc_per_node 8 train.py \ |
| | --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \ |
| | --resume_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \ |
| | --epochs 100 \ |
| | --batch_size 192 |
| | ``` |
| |
|
| | ## π§ Utilities |
| |
|
| | ### Video Processing |
| |
|
| | Extract individual frames from multi-video grids: |
| |
|
| | ```bash |
| | # Extract frames from a video containing multiple video grids |
| | python scripts/extract_image_from_video.py --input_video path/to/video.mp4 --output_dir output/ |
| | |
| | # Extract images with custom settings |
| | python scripts/extract_images.py input_file.mp4 |
| | ``` |
| |
|
| | ## π Model Architecture |
| |
|
| | ### STARFlow (3B Parameters - Text-to-Image) |
| | - **Resolution**: 256Γ256 |
| | - **Architecture**: 6-block deep-shallow architecture |
| | - **Text Encoder**: T5-XL |
| | - **VAE**: SD-VAE |
| | - **Features**: RoPE positional encoding, mixed precision training |
| |
|
| | ### STARFlow-V (7B Parameters - Text-to-Video) |
| | - **Resolution**: Up to 640Γ480 (480p) |
| | - **Temporal**: 81 frames (16 FPS = ~5 seconds) |
| | - **Architecture**: 6-block deep-shallow architecture (full sequence) |
| | - **Text Encoder**: T5-XL |
| | - **VAE**: WAN2.2-VAE |
| | - **Features**: Causal attention, autoregressive generation, variable length support |
| |
|
| | ## π§ Key Features |
| |
|
| | - **Autoregressive Flow Architecture**: Novel combination of autoregressive models and normalizing flows |
| | - **High-Quality Generation**: Competetive FID scores and visual quality to State-of-the-art Diffusion Models |
| | - **Flexible Resolution**: Support for various aspect ratios and resolutions |
| | - **Efficient Training**: FSDP support for large-scale distributed training |
| | - **Fast Sampling**: Block-wise Jacobi iteration for accelerated inference |
| | - **Text Conditioning**: Advanced text-to-image/video capabilities |
| | - **Video Generation**: Temporal consistency and smooth motion |
| |
|
| | ## π Configuration |
| |
|
| | ### Key Parameters |
| |
|
| | #### Image Generation (`starflow_3B_t2i_256x256.yaml`) |
| | - `img_size: 256` - Output image resolution |
| | - `txt_size: 128` - Text sequence length |
| | - `channels: 3072` - Model hidden dimension |
| | - `cfg: 3.6` - Classifier-free guidance scale |
| | - `noise_std: 0.3` - Flow noise standard deviation |
| |
|
| | #### Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`) |
| | - `img_size: 640` - Video frame resolution |
| | - `vid_size: '81:16'` - Temporal dimensions (frames:downsampling) |
| | - `fps_cond: 1` - FPS conditioning enabled |
| | - `temporal_causal: 1` - Causal temporal attention |
| |
|
| | ### Sampling Options |
| | - `--cfg` - Classifier-free guidance scale (higher = more prompt adherence) |
| | - `--jacobi` - Enable Jacobi iteration for faster sampling |
| | - `--jacobi_th` - Jacobi convergence threshold |
| | - `--jacobi_block_size` - Block size for Jacobi iteration |
| | - `--aspect_ratio` - Output aspect ratio ("1:1", "16:9", "4:3", etc.) |
| | - `--seed` - Random seed for reproducible generation |
| |
|
| | ## π Project Structure |
| |
|
| | ``` |
| | βββ train.py # Main training script |
| | βββ sample.py # Sampling and inference |
| | βββ transformer_flow.py # Core model implementation |
| | βββ dataset.py # Dataset loading and preprocessing |
| | βββ finetune_decoder.py # Decoder fine-tuning script |
| | βββ utils/ # Utility modules |
| | β βββ common.py # Core utility functions |
| | β βββ model_setup.py # Model configuration and setup |
| | β βββ training.py # Training utilities and metrics |
| | β βββ inference.py # Evaluation and metrics |
| | βββ configs/ # Model configuration files |
| | β βββ starflow_3B_t2i_256x256.yaml |
| | β βββ starflow-v_7B_t2v_caus_480p.yaml |
| | βββ scripts/ # Example training and sampling scripts |
| | β βββ test_sample_image.sh |
| | β βββ test_sample_video.sh |
| | β βββ test_train_image.sh |
| | β βββ test_train_video.sh |
| | β βββ setup_conda.sh |
| | β βββ extract_images.py |
| | β βββ extract_image_from_video.py |
| | βββ misc/ # Additional utilities |
| | βββ pe.py # Positional encodings |
| | βββ lpips.py # LPIPS loss |
| | βββ wan_vae2.py # Video VAE implementation |
| | ``` |
| |
|
| | ## π‘ Tips |
| |
|
| | ### Image Generation |
| | 1. Use guidance scales between 2.0-5.0 for balanced quality and diversity |
| | 2. Experiment with different aspect ratios for your use case |
| | 3. Enable Jacobi iteration (`--jacobi 1`) for faster sampling |
| | 4. Use higher resolution models for detailed outputs |
| | 5. The default script uses optimized settings: `--jacobi_th 0.001` and `--jacobi_block_size 16` |
| |
|
| | ### Video Generation |
| | 1. Start with shorter sequences (81 frames) and gradually increase length (161, 241, 481+ frames) |
| | 2. Use input images (`--input_image`) for more controlled generation |
| | 3. Adjust FPS settings based on content type (8-24 FPS) |
| | 4. Consider temporal consistency when crafting prompts |
| | 5. The default script uses `--jacobi_block_size 64`. |
| | 6. **Longer videos**: Use `--target_length` to generate videos beyond the training length (requires `--jacobi 1`) |
| | 7. **Frame reference**: 81 frames β 5s, 161 frames β 10s, 241 frames β 15s, 481 frames β 30s (at 16fps) |
| |
|
| | ### Training |
| | 1. Use FSDP for efficient large model training |
| | 2. Start with smaller batch sizes and scale up |
| | 3. Monitor loss curves and adjust learning rates accordingly |
| | 4. Use gradient checkpointing to reduce memory usage |
| | 5. The test scripts include `--dry_run 1` for validation |
| |
|
| | ## π Citation |
| |
|
| | If you use STARFlow in your research, please cite: |
| |
|
| | ```bibtex |
| | @article{gu2025starflow, |
| | title={STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis}, |
| | author={Gu, Jiatao and Chen, Tianrong and Berthelot, David and Zheng, Huangjie and Wang, Yuyang and Zhang, Ruixiang and Dinh, Laurent and Bautista, Miguel Angel and Susskind, Josh and Zhai, Shuangfei}, |
| | journal={NeurIPS}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## π License |
| |
|
| | LICENSE: Please check out the repository [LICENSE](LICENSE) before using the provided code and [LICENSE_MODEL](LICENSE_MODEL) for the released models. |
| |
|
| | ## π€ Contributing |
| |
|
| | We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. |