Instructions to use akrao9/Boomer-T2I with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use akrao9/Boomer-T2I with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("akrao9/Boomer-T2I", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
| license: other | |
| tags: | |
| - text-to-image | |
| - diffusion | |
| - linear-attention | |
| - pytorch | |
| - safetensors | |
| language: | |
| - en | |
| pipeline_tag: text-to-image | |
| # Boomer FLA | |
| Boomer FLA is a 657M parameter text-to-image diffusion model that generates **1024Γ1024px** images from text prompts. | |
| Instead of standard quadratic self-attention, it uses **GatedDeltaNet** , a bidirectional Flash Linear Attention mixer , as the backbone of its transformer blocks. This keeps memory flat regardless of sequence length. Every 6th block adds a full SDPA layer for global spatial coherence. | |
| Text conditioning uses **Gemma 4 2B** (1536-dim embeddings, up to 300 tokens). Decoding uses the **DC-AE f32c32** VAE with 32Γ spatial compression, producing 32Γ32 latents from 1024px images. Inference uses **STORK-2** flow sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210)). | |
| Trained on 3.8 Million JourneyDB 512 px latents and finetuned on maxu/Fine-T2i 600k high-resolution latents at 1024px. | |
| --- | |
| ## Sample Outputs | |
|  | |
| 1. **Misty Pine Forest:** A cinematic, wide-angle shot of a misty pine forest at sunrise, deep green valleys, soft morning light piercing through fog, photorealistic, 8k resolution. | |
| 2. **Black Sand Beach:** A dramatic black sand beach in Iceland, towering basalt columns, massive white waves crashing on the shore, moody overcast sky, high detail landscape architecture. | |
| 3. **Alpine Lake:** A serene alpine lake reflecting jagged snow-capped mountain peaks, crystal clear turquoise water, vibrant wildflower meadows in the foreground, golden hour lighting. | |
| 4. **Tuscan Hills:** A sweeping view of rolling terracotta hills in Tuscany, isolated cypress trees lining a dirt road, warm late afternoon sun casting long shadows, classic landscape photography. | |
| 5. **Desert Oasis:** A futuristic desert oasis at dusk, neon-infused crystalline structures contrasting with deep orange sand dunes, hyper-detailed, sprawling cinematic atmosphere. | |
| 6. **Hidden Lagoon:** A majestic waterfall cascading down a sheer mossy cliff into a hidden tropical lagoon, lush emerald foliage, sunbeams cutting through the canopy, long exposure water effect. | |
| ## Architecture | |
|  | |
| --- | |
| | Property | Value | | |
| |---|---| | |
| | Parameters | 657M | | |
| | Backbone | Bidirectional GatedDeltaNet (Flash Linear Attention) | | |
| | Depth | 24 layers | | |
| | Hidden dim | 896 | | |
| | Heads | 14 | | |
| | Image attention | Every 6th layer (full SDPA + 2D RoPE) | | |
| | Patch size | 1 β one token per latent pixel (256 tokens at 512px, 1024 tokens at 1024px) | | |
| | Text encoder | Gemma 4 2B (`google/gemma-4-E2B-it`) | | |
| | VAE | DC-AE f32c32 (`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers`) | | |
| | Sampler | STORK-2, 32 steps | | |
| | Dtype | bfloat16 | | |
| --- | |
| ## Training details | |
| | Setting | Value | | |
| |---|---| | |
| | Pre-train dataset | JourneyDB (~3.8M images, 512px, patch size 1) | | |
| | Fine-tune dataset | FineT2I (~600k images, 1024px, patch size 1) | | |
| | Optimizer | Fused AdamW | | |
| | Hardware | RTX PRO 6000 BLACKWELL SERVER EDITION | | |
| | Precision | bfloat16 | | |
| --- | |
| ## VRAM and RAM requirements | |
| Measured at 1024Γ1024px, bfloat16, STORK-2, CFG batch=2. | |
| | Component | VRAM | | |
| |---|---| | |
| | DiT weights (EMA, bf16) | 1.25 GB | | |
| | Gemma 4 2B text encoder | 8.62 GB | | |
| | Denoising peak (CFG on) | 1.36 GB | | |
| | VAE decode peak | 3.51 GB | | |
| | Mode | Peak VRAM | Minimum GPU | | |
| |---|---|---| | |
| | **Condition-cache** β pre-encoded embeddings, no text encoder in VRAM | **4.76 GB** | RTX 3060 8GB, T4 | | |
| | **Fresh-prompt** β text encoder + DiT + VAE together | **13.38 GB** | RTX 3090, A100 | | |
| **System RAM**: loading the text encoder (Gemma 4 2B) requires ~9 GB of system RAM even when using GPU. For condition cache mode, encoding can be done on CPU with ~9 GB RAM the generation step then needs only 5 GB VRAM. | |
| --- | |
| ## Usage | |
| ### Install | |
| ```bash | |
| pip install -U torch diffusers transformers accelerate safetensors torchvision scipy | |
| pip install git+https://github.com/fla-org/flash-linear-attention.git | |
| ``` | |
| ### Generate | |
| ```python | |
| import torch | |
| from diffusers import DiffusionPipeline | |
| pipe = DiffusionPipeline.from_pretrained( | |
| "akrao9/Boomer-T2I", | |
| custom_pipeline="pipeline_boomer", | |
| trust_remote_code=True, | |
| torch_dtype=torch.bfloat16, | |
| ).to("cuda") | |
| image = pipe("A hyper-detailed, cinematic landscape photography shot of a pristine, mirror-like alpine lake nestled deeply between towering, jagged snow-capped mountain peaks. The scene is captured during the perfect golden hour, with the low-angled warm sun casting deep amber and violet hues across the rugged granite rock faces. In the foreground, vibrant clusters of purple lupines and orange poppies dot a lush emerald meadow that meets the crystal-clear turquoise edge of the water. Wisps of soft, low-hanging morning mist drift lazily across the lake's surface, breaking the perfect reflection of the monumental peaks above. Shot on 35mm lens, ultra-sharp focus, dramatic depth of field, 8k resolution, path-traced lighting textures.")[0] | |
| image.save("output.png") | |
| ``` | |
| ### Note | |
| The transformer weights (1.25 GB) are downloaded from this repo. The VAE and Gemma 4 2B text encoder are fetched automatically from their upstream HuggingFace repos on first use (10 GB total). Accept the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and run `hf auth login` before first use. | |
| ### Parameters | |
| ```python | |
| image = pipe( | |
| "a rocky coastline at sunset with crashing waves", | |
| steps=32, # denoising steps β 32 is recommended with STORK-2 | |
| cfg_scale=4.5, # classifier-free guidance scale (4.0β5.0) | |
| cfg_rescale=0.5, # reduces over-saturation at high CFG | |
| seed=42, | |
| )[0] | |
| ``` | |
| ### Low VRAM β condition cache mode | |
| Encode prompts once on any machine (including CPU), save the embedding, then generate with only the 1.25 GB DiT loaded. Peak VRAM drops from 13.38 GB β 4.76 GB. | |
| ```python | |
| # Step 1 β encode on any machine (even CPU with 9GB RAM) | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| TE_REPO = "google/gemma-4-E2B-it" | |
| tokenizer = AutoProcessor.from_pretrained(TE_REPO) | |
| text_encoder = AutoModelForCausalLM.from_pretrained( | |
| TE_REPO, torch_dtype=torch.bfloat16 | |
| ).get_decoder() | |
| tokens = tokenizer( | |
| "a mountain lake surrounded by alpine peaks", | |
| max_length=300, padding="max_length", | |
| truncation=True, return_tensors="pt", | |
| ) | |
| with torch.inference_mode(): | |
| hidden = text_encoder( | |
| tokens["input_ids"], attention_mask=tokens["attention_mask"] | |
| )[0] | |
| idx = [0] + list(range(-299, 0)) | |
| emb = hidden[:, idx] | |
| mask = tokens["attention_mask"][:, idx] | |
| torch.save({"emb": emb.cpu(), "mask": mask.cpu()}, "condition.pt") | |
| ``` | |
| ```python | |
| # Step 2 β generate on low-VRAM GPU | |
| # Load only the DiT weights; skip text encoder by providing embeddings directly | |
| import torch | |
| from huggingface_hub import snapshot_download | |
| import sys | |
| local = snapshot_download("akrao9/Boomer-T2I") | |
| sys.path.insert(0, local) | |
| from pipeline_boomer import BoomerPipeline | |
| from modeling_boomer_fla import BoomerFLADiT | |
| transformer = BoomerFLADiT.from_pretrained(local, subfolder="transformer").to("cuda", dtype=torch.bfloat16) | |
| pipe = BoomerPipeline(transformer=transformer).to("cuda") | |
| # pipe.text_encoder and pipe.vae remain None until __call__ β override with pre-encoded emb manually | |
| ``` | |
| --- | |
| ## Capabilities and limitations | |
| Trained on **JourneyDB** (pretrain, 512px) and **FineT2I** (finetune, 1024px). | |
| - **Strongest at** β dramatic landscapes, natural environments, and scenic/architectural scenes | |
| - **Also works for** β many everyday objects when embedded in a scene (quality varies) | |
| - **Not reliable** β humans, portraits, and people (faces and anatomy are inconsistent) | |
| - **Less reliable** β animals, legible text in images, very fine small details | |
| - Landscapes can show a painterly/HDR bias from heavily post-processed stock images in the training set | |
| - Not safety filtered β outputs may reflect biases in the training data | |
| - Maximum tested resolution: **1024Γ1024px** | |
| --- | |
| ## Acknowledgements | |
| - **STORK-2** β inference sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210)) | |
| - **Plateau logit-normal** β training timestep distribution from [FLUX.2 representation comparison](https://bfl.ai/research/representation-comparison) (Black Forest Labs, 2025). Boomer uses ΞΌ=0, Ο=1 with flow shift 1.5. | |
| - **Sana** β flow shift, logit-normal sampling, forward noising, and related helpers adapted from [NVlabs/Sana](https://github.com/NVlabs/Sana) (`boomer/sana_flow.py`, `boomer/sana_latent_cache.py`); several DiT blocks follow Sana conventions ([Xie et al., 2024](https://arxiv.org/abs/2410.10629)) | |
| --- | |
| ## License | |
| The Boomer FLA model weights are released for research and personal use. | |
| Upstream component licenses: | |
| - DC-AE VAE: [mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers) | |
| - Gemma 4 text encoder: [Gemma Terms of Use](https://ai.google.dev/gemma/terms) | |