---
license: other
tags:
  - text-to-image
  - diffusion
  - linear-attention
  - pytorch
  - safetensors
language:
  - en
pipeline_tag: text-to-image
---

# Boomer FLA

Boomer FLA is a 657M parameter text-to-image diffusion model that generates **1024×1024px** images from text prompts.

Instead of standard quadratic self-attention, it uses **GatedDeltaNet** , a bidirectional Flash Linear Attention mixer , as the backbone of its transformer blocks. This keeps memory flat regardless of sequence length. Every 6th block adds a full SDPA layer for global spatial coherence.

Text conditioning uses **Gemma 4 2B** (1536-dim embeddings, up to 300 tokens). Decoding uses the **DC-AE f32c32** VAE with 32× spatial compression, producing 32×32 latents from 1024px images. Inference uses **STORK-2** flow sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210)).

Trained on 3.8 Million JourneyDB 512 px latents and finetuned on maxu/Fine-T2i  600k high-resolution latents at 1024px.

---

## Sample Outputs

![Boomer-T2I Portfolio Landscape Grid](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/fN3TCPf5lGBFCq_thgYkw.png)

1. **Misty Pine Forest:** A cinematic, wide-angle shot of a misty pine forest at sunrise, deep green valleys, soft morning light piercing through fog, photorealistic, 8k resolution.
2. **Black Sand Beach:** A dramatic black sand beach in Iceland, towering basalt columns, massive white waves crashing on the shore, moody overcast sky, high detail landscape architecture.
3. **Alpine Lake:** A serene alpine lake reflecting jagged snow-capped mountain peaks, crystal clear turquoise water, vibrant wildflower meadows in the foreground, golden hour lighting.
4. **Tuscan Hills:** A sweeping view of rolling terracotta hills in Tuscany, isolated cypress trees lining a dirt road, warm late afternoon sun casting long shadows, classic landscape photography.
5. **Desert Oasis:** A futuristic desert oasis at dusk, neon-infused crystalline structures contrasting with deep orange sand dunes, hyper-detailed, sprawling cinematic atmosphere.
6. **Hidden Lagoon:** A majestic waterfall cascading down a sheer mossy cliff into a hidden tropical lagoon, lush emerald foliage, sunbeams cutting through the canopy, long exposure water effect.

## Architecture


![image](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/9svsLkAt5aV9iWYD25xS_.png)

---

| Property | Value |
|---|---|
| Parameters | 657M |
| Backbone | Bidirectional GatedDeltaNet (Flash Linear Attention) |
| Depth | 24 layers |
| Hidden dim | 896 |
| Heads | 14 |
| Image attention | Every 6th layer (full SDPA + 2D RoPE) |
| Patch size | 1 — one token per latent pixel (256 tokens at 512px, 1024 tokens at 1024px) |
| Text encoder | Gemma 4 2B (`google/gemma-4-E2B-it`) |
| VAE | DC-AE f32c32 (`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers`) |
| Sampler | STORK-2, 32 steps |
| Dtype | bfloat16 |

---

## Training details

| Setting | Value |
|---|---|
| Pre-train dataset | JourneyDB (~3.8M images, 512px, patch size 1) |
| Fine-tune dataset | FineT2I (~600k images, 1024px, patch size 1) |
| Optimizer | Fused AdamW |
| Hardware | RTX PRO 6000 BLACKWELL SERVER EDITION |
| Precision | bfloat16 |

---

## VRAM and RAM requirements

Measured at 1024×1024px, bfloat16, STORK-2, CFG batch=2.

| Component | VRAM |
|---|---|
| DiT weights (EMA, bf16) | 1.25 GB |
| Gemma 4 2B text encoder | 8.62 GB |
| Denoising peak (CFG on) | 1.36 GB |
| VAE decode peak | 3.51 GB |

| Mode | Peak VRAM | Minimum GPU |
|---|---|---|
| **Condition-cache** — pre-encoded embeddings, no text encoder in VRAM | **4.76 GB** | RTX 3060 8GB, T4 |
| **Fresh-prompt** — text encoder + DiT + VAE together | **13.38 GB** | RTX 3090, A100 |

**System RAM**: loading the text encoder (Gemma 4 2B) requires ~9 GB of system RAM even when using GPU. For condition cache mode, encoding can be done on CPU with ~9 GB RAM  the generation step then needs only 5 GB VRAM.

---

## Usage

### Install

```bash
pip install -U torch diffusers transformers accelerate safetensors torchvision scipy
pip install git+https://github.com/fla-org/flash-linear-attention.git
```

### Generate

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "akrao9/Boomer-T2I",
    custom_pipeline="pipeline_boomer",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("A hyper-detailed, cinematic landscape photography shot of a pristine, mirror-like alpine lake nestled deeply between towering, jagged snow-capped mountain peaks. The scene is captured during the perfect golden hour, with the low-angled warm sun casting deep amber and violet hues across the rugged granite rock faces. In the foreground, vibrant clusters of purple lupines and orange poppies dot a lush emerald meadow that meets the crystal-clear turquoise edge of the water. Wisps of soft, low-hanging morning mist drift lazily across the lake's surface, breaking the perfect reflection of the monumental peaks above. Shot on 35mm lens, ultra-sharp focus, dramatic depth of field, 8k resolution, path-traced lighting textures.")[0]
image.save("output.png")
```

### Note

The transformer weights (1.25 GB) are downloaded from this repo. The VAE and Gemma 4 2B text encoder are fetched automatically from their upstream HuggingFace repos on first use (10 GB total). Accept the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and run `hf auth login` before first use.
### Parameters

```python
image = pipe(
    "a rocky coastline at sunset with crashing waves",
    steps=32,        # denoising steps — 32 is recommended with STORK-2
    cfg_scale=4.5,   # classifier-free guidance scale (4.0–5.0)
    cfg_rescale=0.5, # reduces over-saturation at high CFG
    seed=42,
)[0]
```

### Low VRAM — condition cache mode

Encode prompts once on any machine (including CPU), save the embedding, then generate with only the 1.25 GB DiT loaded. Peak VRAM drops from 13.38 GB → 4.76 GB.

```python
# Step 1 — encode on any machine (even CPU with 9GB RAM)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

TE_REPO      = "google/gemma-4-E2B-it"
tokenizer    = AutoProcessor.from_pretrained(TE_REPO)
text_encoder = AutoModelForCausalLM.from_pretrained(
    TE_REPO, torch_dtype=torch.bfloat16
).get_decoder()

tokens = tokenizer(
    "a mountain lake surrounded by alpine peaks",
    max_length=300, padding="max_length",
    truncation=True, return_tensors="pt",
)
with torch.inference_mode():
    hidden = text_encoder(
        tokens["input_ids"], attention_mask=tokens["attention_mask"]
    )[0]
    idx  = [0] + list(range(-299, 0))
    emb  = hidden[:, idx]
    mask = tokens["attention_mask"][:, idx]

torch.save({"emb": emb.cpu(), "mask": mask.cpu()}, "condition.pt")
```

```python
# Step 2 — generate on low-VRAM GPU
# Load only the DiT weights; skip text encoder by providing embeddings directly
import torch
from huggingface_hub import snapshot_download
import sys

local = snapshot_download("akrao9/Boomer-T2I")
sys.path.insert(0, local)
from pipeline_boomer import BoomerPipeline
from modeling_boomer_fla import BoomerFLADiT

transformer = BoomerFLADiT.from_pretrained(local, subfolder="transformer").to("cuda", dtype=torch.bfloat16)
pipe = BoomerPipeline(transformer=transformer).to("cuda")
# pipe.text_encoder and pipe.vae remain None until __call__ — override with pre-encoded emb manually
```

---

## Capabilities and limitations

Trained on **JourneyDB** (pretrain, 512px) and **FineT2I** (finetune, 1024px).

- **Strongest at** — dramatic landscapes, natural environments, and scenic/architectural scenes
- **Also works for** — many everyday objects when embedded in a scene (quality varies)
- **Not reliable** — humans, portraits, and people (faces and anatomy are inconsistent)
- **Less reliable** — animals, legible text in images, very fine small details
- Landscapes can show a painterly/HDR bias from heavily post-processed stock images in the training set
- Not safety filtered — outputs may reflect biases in the training data
- Maximum tested resolution: **1024×1024px**

---

## Acknowledgements

- **STORK-2** — inference sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210))
- **Plateau logit-normal** — training timestep distribution from [FLUX.2 representation comparison](https://bfl.ai/research/representation-comparison) (Black Forest Labs, 2025). Boomer uses μ=0, σ=1 with flow shift 1.5.
- **Sana** — flow shift, logit-normal sampling, forward noising, and related helpers adapted from [NVlabs/Sana](https://github.com/NVlabs/Sana) (`boomer/sana_flow.py`, `boomer/sana_latent_cache.py`); several DiT blocks follow Sana conventions ([Xie et al., 2024](https://arxiv.org/abs/2410.10629))

---

## License

The Boomer FLA model weights are released for research and personal use.

Upstream component licenses:
- DC-AE VAE: [mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers)
- Gemma 4 text encoder: [Gemma Terms of Use](https://ai.google.dev/gemma/terms)