Update README.md

70594b6 verified about 23 hours ago

9.18 kB

	---
	license: other
	tags:
	- text-to-image
	- diffusion
	- linear-attention
	- pytorch
	- safetensors
	language:
	- en
	pipeline_tag: text-to-image
	---

	# Boomer FLA

	Boomer FLA is a 657M parameter text-to-image diffusion model that generates 1024×1024px images from text prompts.

	Instead of standard quadratic self-attention, it uses GatedDeltaNet , a bidirectional Flash Linear Attention mixer , as the backbone of its transformer blocks. This keeps memory flat regardless of sequence length. Every 6th block adds a full SDPA layer for global spatial coherence.

	Text conditioning uses Gemma 4 2B (1536-dim embeddings, up to 300 tokens). Decoding uses the DC-AE f32c32 VAE with 32× spatial compression, producing 32×32 latents from 1024px images. Inference uses STORK-2 flow sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210)).

	Trained on 3.8 Million JourneyDB 512 px latents and finetuned on maxu/Fine-T2i 600k high-resolution latents at 1024px.

	---

	## Sample Outputs

	![Boomer-T2I Portfolio Landscape Grid](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/fN3TCPf5lGBFCq_thgYkw.png)

	1. Misty Pine Forest: A cinematic, wide-angle shot of a misty pine forest at sunrise, deep green valleys, soft morning light piercing through fog, photorealistic, 8k resolution.
	2. Black Sand Beach: A dramatic black sand beach in Iceland, towering basalt columns, massive white waves crashing on the shore, moody overcast sky, high detail landscape architecture.
	3. Alpine Lake: A serene alpine lake reflecting jagged snow-capped mountain peaks, crystal clear turquoise water, vibrant wildflower meadows in the foreground, golden hour lighting.
	4. Tuscan Hills: A sweeping view of rolling terracotta hills in Tuscany, isolated cypress trees lining a dirt road, warm late afternoon sun casting long shadows, classic landscape photography.
	5. Desert Oasis: A futuristic desert oasis at dusk, neon-infused crystalline structures contrasting with deep orange sand dunes, hyper-detailed, sprawling cinematic atmosphere.
	6. Hidden Lagoon: A majestic waterfall cascading down a sheer mossy cliff into a hidden tropical lagoon, lush emerald foliage, sunbeams cutting through the canopy, long exposure water effect.

	## Architecture


	![image](https://cdn-uploads.huggingface.co/production/uploads/68a39870573ae211ac576be6/9svsLkAt5aV9iWYD25xS_.png)

	---

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 657M \|
	\| Backbone \| Bidirectional GatedDeltaNet (Flash Linear Attention) \|
	\| Depth \| 24 layers \|
	\| Hidden dim \| 896 \|
	\| Heads \| 14 \|
	\| Image attention \| Every 6th layer (full SDPA + 2D RoPE) \|
	\| Patch size \| 1 — one token per latent pixel (256 tokens at 512px, 1024 tokens at 1024px) \|
	\| Text encoder \| Gemma 4 2B (`google/gemma-4-E2B-it`) \|
	\| VAE \| DC-AE f32c32 (`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers`) \|
	\| Sampler \| STORK-2, 32 steps \|
	\| Dtype \| bfloat16 \|

	---

	## Training details

	\| Setting \| Value \|
	\|---\|---\|
	\| Pre-train dataset \| JourneyDB (~3.8M images, 512px, patch size 1) \|
	\| Fine-tune dataset \| FineT2I (~600k images, 1024px, patch size 1) \|
	\| Optimizer \| Fused AdamW \|
	\| Hardware \| RTX PRO 6000 BLACKWELL SERVER EDITION \|
	\| Precision \| bfloat16 \|

	---

	## VRAM and RAM requirements

	Measured at 1024×1024px, bfloat16, STORK-2, CFG batch=2.

	\| Component \| VRAM \|
	\|---\|---\|
	\| DiT weights (EMA, bf16) \| 1.25 GB \|
	\| Gemma 4 2B text encoder \| 8.62 GB \|
	\| Denoising peak (CFG on) \| 1.36 GB \|
	\| VAE decode peak \| 3.51 GB \|

	\| Mode \| Peak VRAM \| Minimum GPU \|
	\|---\|---\|---\|
	\| Condition-cache — pre-encoded embeddings, no text encoder in VRAM \| 4.76 GB \| RTX 3060 8GB, T4 \|
	\| Fresh-prompt — text encoder + DiT + VAE together \| 13.38 GB \| RTX 3090, A100 \|

	System RAM: loading the text encoder (Gemma 4 2B) requires ~9 GB of system RAM even when using GPU. For condition cache mode, encoding can be done on CPU with ~9 GB RAM the generation step then needs only 5 GB VRAM.

	---

	## Usage

	### Install

	```bash
	pip install -U torch diffusers transformers accelerate safetensors torchvision scipy
	pip install git+https://github.com/fla-org/flash-linear-attention.git
	```

	### Generate

	```python
	import torch
	from diffusers import DiffusionPipeline

	pipe = DiffusionPipeline.from_pretrained(
	"akrao9/Boomer-T2I",
	custom_pipeline="pipeline_boomer",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	).to("cuda")

	image = pipe("A hyper-detailed, cinematic landscape photography shot of a pristine, mirror-like alpine lake nestled deeply between towering, jagged snow-capped mountain peaks. The scene is captured during the perfect golden hour, with the low-angled warm sun casting deep amber and violet hues across the rugged granite rock faces. In the foreground, vibrant clusters of purple lupines and orange poppies dot a lush emerald meadow that meets the crystal-clear turquoise edge of the water. Wisps of soft, low-hanging morning mist drift lazily across the lake's surface, breaking the perfect reflection of the monumental peaks above. Shot on 35mm lens, ultra-sharp focus, dramatic depth of field, 8k resolution, path-traced lighting textures.")[0]
	image.save("output.png")
	```

	### Note

	The transformer weights (1.25 GB) are downloaded from this repo. The VAE and Gemma 4 2B text encoder are fetched automatically from their upstream HuggingFace repos on first use (10 GB total). Accept the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and run `hf auth login` before first use.
	### Parameters

	```python
	image = pipe(
	"a rocky coastline at sunset with crashing waves",
	steps=32, # denoising steps — 32 is recommended with STORK-2
	cfg_scale=4.5, # classifier-free guidance scale (4.0–5.0)
	cfg_rescale=0.5, # reduces over-saturation at high CFG
	seed=42,
	)[0]
	```

	### Low VRAM — condition cache mode

	Encode prompts once on any machine (including CPU), save the embedding, then generate with only the 1.25 GB DiT loaded. Peak VRAM drops from 13.38 GB → 4.76 GB.

	```python
	# Step 1 — encode on any machine (even CPU with 9GB RAM)
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	TE_REPO = "google/gemma-4-E2B-it"
	tokenizer = AutoProcessor.from_pretrained(TE_REPO)
	text_encoder = AutoModelForCausalLM.from_pretrained(
	TE_REPO, torch_dtype=torch.bfloat16
	).get_decoder()

	tokens = tokenizer(
	"a mountain lake surrounded by alpine peaks",
	max_length=300, padding="max_length",
	truncation=True, return_tensors="pt",
	)
	with torch.inference_mode():
	hidden = text_encoder(
	tokens["input_ids"], attention_mask=tokens["attention_mask"]
	)[0]
	idx = [0] + list(range(-299, 0))
	emb = hidden[:, idx]
	mask = tokens["attention_mask"][:, idx]

	torch.save({"emb": emb.cpu(), "mask": mask.cpu()}, "condition.pt")
	```

	```python
	# Step 2 — generate on low-VRAM GPU
	# Load only the DiT weights; skip text encoder by providing embeddings directly
	import torch
	from huggingface_hub import snapshot_download
	import sys

	local = snapshot_download("akrao9/Boomer-T2I")
	sys.path.insert(0, local)
	from pipeline_boomer import BoomerPipeline
	from modeling_boomer_fla import BoomerFLADiT

	transformer = BoomerFLADiT.from_pretrained(local, subfolder="transformer").to("cuda", dtype=torch.bfloat16)
	pipe = BoomerPipeline(transformer=transformer).to("cuda")
	# pipe.text_encoder and pipe.vae remain None until __call__ — override with pre-encoded emb manually
	```

	---

	## Capabilities and limitations

	Trained on JourneyDB (pretrain, 512px) and FineT2I (finetune, 1024px).

	- Strongest at — dramatic landscapes, natural environments, and scenic/architectural scenes
	- Also works for — many everyday objects when embedded in a scene (quality varies)
	- Not reliable — humans, portraits, and people (faces and anatomy are inconsistent)
	- Less reliable — animals, legible text in images, very fine small details
	- Landscapes can show a painterly/HDR bias from heavily post-processed stock images in the training set
	- Not safety filtered — outputs may reflect biases in the training data
	- Maximum tested resolution: 1024×1024px

	---

	## Acknowledgements

	- STORK-2 — inference sampling ([Tan et al., 2025](https://arxiv.org/abs/2505.24210))
	- Plateau logit-normal — training timestep distribution from [FLUX.2 representation comparison](https://bfl.ai/research/representation-comparison) (Black Forest Labs, 2025). Boomer uses μ=0, σ=1 with flow shift 1.5.
	- Sana — flow shift, logit-normal sampling, forward noising, and related helpers adapted from [NVlabs/Sana](https://github.com/NVlabs/Sana) (`boomer/sana_flow.py`, `boomer/sana_latent_cache.py`); several DiT blocks follow Sana conventions ([Xie et al., 2024](https://arxiv.org/abs/2410.10629))

	---

	## License

	The Boomer FLA model weights are released for research and personal use.

	Upstream component licenses:
	- DC-AE VAE: [mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers)
	- Gemma 4 text encoder: [Gemma Terms of Use](https://ai.google.dev/gemma/terms)