Boomer-T2I / README.md
akrao9's picture
Update README.md
70594b6 verified
---
license: other
tags:
- text-to-image
- diffusion
- linear-attention
- pytorch
- safetensors
language:
- en
pipeline_tag: text-to-image
---
# Boomer FLA
Boomer FLA is a 657M parameter text-to-image diffusion model that generates **1024Γ—1024px** images from text prompts.
Instead of standard quadratic self-attention, it uses **GatedDeltaNet** , a bidirectional Flash Linear Attention mixer , as the backbone of its transformer blocks. This keeps memory flat regardless of sequence length. Every 6th block adds a full SDPA layer for global spatial coherence.
Text conditioning uses **Gemma 4 2B** (1536-dim embeddings, up to 300 tokens). Decoding uses the **DC-AE f32c32** VAE with 32Γ— spatial compression, producing 32Γ—32 latents from 1024px images. Inference uses **STORK-2** flow sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210)).
Trained on 3.8 Million JourneyDB 512 px latents and finetuned on maxu/Fine-T2i 600k high-resolution latents at 1024px.
---
## Sample Outputs
![Boomer-T2I Portfolio Landscape Grid](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/fN3TCPf5lGBFCq_thgYkw.png)
1. **Misty Pine Forest:** A cinematic, wide-angle shot of a misty pine forest at sunrise, deep green valleys, soft morning light piercing through fog, photorealistic, 8k resolution.
2. **Black Sand Beach:** A dramatic black sand beach in Iceland, towering basalt columns, massive white waves crashing on the shore, moody overcast sky, high detail landscape architecture.
3. **Alpine Lake:** A serene alpine lake reflecting jagged snow-capped mountain peaks, crystal clear turquoise water, vibrant wildflower meadows in the foreground, golden hour lighting.
4. **Tuscan Hills:** A sweeping view of rolling terracotta hills in Tuscany, isolated cypress trees lining a dirt road, warm late afternoon sun casting long shadows, classic landscape photography.
5. **Desert Oasis:** A futuristic desert oasis at dusk, neon-infused crystalline structures contrasting with deep orange sand dunes, hyper-detailed, sprawling cinematic atmosphere.
6. **Hidden Lagoon:** A majestic waterfall cascading down a sheer mossy cliff into a hidden tropical lagoon, lush emerald foliage, sunbeams cutting through the canopy, long exposure water effect.
## Architecture
![image](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/9svsLkAt5aV9iWYD25xS_.png)
---
| Property | Value |
|---|---|
| Parameters | 657M |
| Backbone | Bidirectional GatedDeltaNet (Flash Linear Attention) |
| Depth | 24 layers |
| Hidden dim | 896 |
| Heads | 14 |
| Image attention | Every 6th layer (full SDPA + 2D RoPE) |
| Patch size | 1 β€” one token per latent pixel (256 tokens at 512px, 1024 tokens at 1024px) |
| Text encoder | Gemma 4 2B (`google/gemma-4-E2B-it`) |
| VAE | DC-AE f32c32 (`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers`) |
| Sampler | STORK-2, 32 steps |
| Dtype | bfloat16 |
---
## Training details
| Setting | Value |
|---|---|
| Pre-train dataset | JourneyDB (~3.8M images, 512px, patch size 1) |
| Fine-tune dataset | FineT2I (~600k images, 1024px, patch size 1) |
| Optimizer | Fused AdamW |
| Hardware | RTX PRO 6000 BLACKWELL SERVER EDITION |
| Precision | bfloat16 |
---
## VRAM and RAM requirements
Measured at 1024Γ—1024px, bfloat16, STORK-2, CFG batch=2.
| Component | VRAM |
|---|---|
| DiT weights (EMA, bf16) | 1.25 GB |
| Gemma 4 2B text encoder | 8.62 GB |
| Denoising peak (CFG on) | 1.36 GB |
| VAE decode peak | 3.51 GB |
| Mode | Peak VRAM | Minimum GPU |
|---|---|---|
| **Condition-cache** β€” pre-encoded embeddings, no text encoder in VRAM | **4.76 GB** | RTX 3060 8GB, T4 |
| **Fresh-prompt** β€” text encoder + DiT + VAE together | **13.38 GB** | RTX 3090, A100 |
**System RAM**: loading the text encoder (Gemma 4 2B) requires ~9 GB of system RAM even when using GPU. For condition cache mode, encoding can be done on CPU with ~9 GB RAM the generation step then needs only 5 GB VRAM.
---
## Usage
### Install
```bash
pip install -U torch diffusers transformers accelerate safetensors torchvision scipy
pip install git+https://github.com/fla-org/flash-linear-attention.git
```
### Generate
```python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"akrao9/Boomer-T2I",
custom_pipeline="pipeline_boomer",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("A hyper-detailed, cinematic landscape photography shot of a pristine, mirror-like alpine lake nestled deeply between towering, jagged snow-capped mountain peaks. The scene is captured during the perfect golden hour, with the low-angled warm sun casting deep amber and violet hues across the rugged granite rock faces. In the foreground, vibrant clusters of purple lupines and orange poppies dot a lush emerald meadow that meets the crystal-clear turquoise edge of the water. Wisps of soft, low-hanging morning mist drift lazily across the lake's surface, breaking the perfect reflection of the monumental peaks above. Shot on 35mm lens, ultra-sharp focus, dramatic depth of field, 8k resolution, path-traced lighting textures.")[0]
image.save("output.png")
```
### Note
The transformer weights (1.25 GB) are downloaded from this repo. The VAE and Gemma 4 2B text encoder are fetched automatically from their upstream HuggingFace repos on first use (10 GB total). Accept the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and run `hf auth login` before first use.
### Parameters
```python
image = pipe(
"a rocky coastline at sunset with crashing waves",
steps=32, # denoising steps β€” 32 is recommended with STORK-2
cfg_scale=4.5, # classifier-free guidance scale (4.0–5.0)
cfg_rescale=0.5, # reduces over-saturation at high CFG
seed=42,
)[0]
```
### Low VRAM β€” condition cache mode
Encode prompts once on any machine (including CPU), save the embedding, then generate with only the 1.25 GB DiT loaded. Peak VRAM drops from 13.38 GB β†’ 4.76 GB.
```python
# Step 1 β€” encode on any machine (even CPU with 9GB RAM)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
TE_REPO = "google/gemma-4-E2B-it"
tokenizer = AutoProcessor.from_pretrained(TE_REPO)
text_encoder = AutoModelForCausalLM.from_pretrained(
TE_REPO, torch_dtype=torch.bfloat16
).get_decoder()
tokens = tokenizer(
"a mountain lake surrounded by alpine peaks",
max_length=300, padding="max_length",
truncation=True, return_tensors="pt",
)
with torch.inference_mode():
hidden = text_encoder(
tokens["input_ids"], attention_mask=tokens["attention_mask"]
)[0]
idx = [0] + list(range(-299, 0))
emb = hidden[:, idx]
mask = tokens["attention_mask"][:, idx]
torch.save({"emb": emb.cpu(), "mask": mask.cpu()}, "condition.pt")
```
```python
# Step 2 β€” generate on low-VRAM GPU
# Load only the DiT weights; skip text encoder by providing embeddings directly
import torch
from huggingface_hub import snapshot_download
import sys
local = snapshot_download("akrao9/Boomer-T2I")
sys.path.insert(0, local)
from pipeline_boomer import BoomerPipeline
from modeling_boomer_fla import BoomerFLADiT
transformer = BoomerFLADiT.from_pretrained(local, subfolder="transformer").to("cuda", dtype=torch.bfloat16)
pipe = BoomerPipeline(transformer=transformer).to("cuda")
# pipe.text_encoder and pipe.vae remain None until __call__ β€” override with pre-encoded emb manually
```
---
## Capabilities and limitations
Trained on **JourneyDB** (pretrain, 512px) and **FineT2I** (finetune, 1024px).
- **Strongest at** β€” dramatic landscapes, natural environments, and scenic/architectural scenes
- **Also works for** β€” many everyday objects when embedded in a scene (quality varies)
- **Not reliable** β€” humans, portraits, and people (faces and anatomy are inconsistent)
- **Less reliable** β€” animals, legible text in images, very fine small details
- Landscapes can show a painterly/HDR bias from heavily post-processed stock images in the training set
- Not safety filtered β€” outputs may reflect biases in the training data
- Maximum tested resolution: **1024Γ—1024px**
---
## Acknowledgements
- **STORK-2** β€” inference sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210))
- **Plateau logit-normal** β€” training timestep distribution from [FLUX.2 representation comparison](https://bfl.ai/research/representation-comparison) (Black Forest Labs, 2025). Boomer uses ΞΌ=0, Οƒ=1 with flow shift 1.5.
- **Sana** β€” flow shift, logit-normal sampling, forward noising, and related helpers adapted from [NVlabs/Sana](https://github.com/NVlabs/Sana) (`boomer/sana_flow.py`, `boomer/sana_latent_cache.py`); several DiT blocks follow Sana conventions ([Xie et al., 2024](https://arxiv.org/abs/2410.10629))
---
## License
The Boomer FLA model weights are released for research and personal use.
Upstream component licenses:
- DC-AE VAE: [mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers)
- Gemma 4 text encoder: [Gemma Terms of Use](https://ai.google.dev/gemma/terms)