Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_13751 /en /api /pipelines /dreamlite.md

HuggingFaceDocBuilder

4 days ago

preview code

download

raw

17.7 kB

DreamLite

DreamLite is a text-to-image and image-editing model from ByteDance. It pairs a custom 2D U-Net (DreamLiteUNetModel) with the Qwen3-VL multimodal encoder as its prompt / image-instruction encoder, and uses an AutoencoderTiny (TAESD-style) VAE for fast latent encode/decode.

Two pipelines are exposed:

Pipeline	Modes	CFG	Use case
DreamLitePipeline	text-to-image and image-editing (auto-selected by whether `image` is `None`)	3-branch dual CFG (`guidance_scale` on text branch, `image_guidance_scale` on image branch, à la InstructPix2Pix)	Highest quality
DreamLiteMobilePipeline	text-to-image and image-editing (auto-selected by whether `image` is `None`)	None — distilled, single UNet forward per step	On-device / low-latency

Official checkpoints:

Base model: carlofkl/DreamLite-base
Distilled mobile model: carlofkl/DreamLite-mobile

Both pipelines auto-detect text-to-image vs. image-editing mode from whether the image argument is provided. There is no separate Img2Img class.

When loading an input image for editing, prefer diffusers.utils.load_image(...) over raw PIL.Image.open(...). load_image enforces an RGB conversion and applies EXIF orientation, both of which the pipeline assumes. A plain Image.open of an RGBA / palette / EXIF-rotated source will silently produce a different latent conditioning and degrade output quality.

Text-to-image (Base)

import torch
from diffusers import DreamLitePipeline

pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    prompt="a dog running on the grass",
    negative_prompt="",
    height=1024,
    width=1024,
    num_inference_steps=28,
    generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_t2i.png")

Image editing (Base)

Pass an image to enter edit mode. Both guidance_scale (text branch) and image_guidance_scale (image branch) are active here.

import torch
from diffusers import DreamLitePipeline
from diffusers.utils import load_image

pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

image = pipe(
    prompt="turn the cat into a corgi",
    image=source,
    height=1024,
    width=1024,
    num_inference_steps=28,
    generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_edit.png")

Text-to-image (Mobile)

The mobile pipeline is distilled and skips CFG entirely — a single UNet forward per step. It accepts the same prompt / height / width / num_inference_steps arguments, but ignores guidance_scale and image_guidance_scale if passed (a warning is logged).

import torch
from diffusers import DreamLiteMobilePipeline

pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

image = pipe(
    prompt="a dog running on the grass",
    height=1024,
    width=1024,
    num_inference_steps=4,
    generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_mobile_t2i.png")

Image editing (Mobile)

import torch
from diffusers import DreamLiteMobilePipeline
from diffusers.utils import load_image

pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

image = pipe(
    prompt="turn the cat into a corgi",
    image=source,
    height=1024,
    width=1024,
    num_inference_steps=4,
    generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_mobile_edit.png")

Notes and limitations

Both pipelines force batch_size = 1 internally; num_images_per_prompt controls how many samples are drawn from the same prompt rather than parallel batching.
The prompt encoder is Qwen3-VL, which is a multimodal model. Loading the full pipeline therefore requires sufficient GPU memory for both the U-Net and the Qwen3-VL text encoder (~4 GB + ~0.7 GB in bf16 for the base release).
The VAE is AutoencoderTiny and exposes encoder_block_out_channels; vae_scale_factor is derived from it at pipeline init time.

DreamLitePipeline[[diffusers.DreamLitePipeline]]

diffusers.DreamLitePipeline[[diffusers.DreamLitePipeline]]

Source

DreamLite pipeline for text-to-image and instruction-based image editing.

The same pipeline supports both modes; the operating mode is auto-detected from the inputs:

image is None -> text-to-image (single CFG on text).
image is not None -> image-to-image / instruction edit (dual CFG: text + image).

Components: text_encoder ([~transformers.Qwen3VLForConditionalGeneration]): Multimodal text/vision encoder used to produce conditioning embeddings. tokenizer ([~transformers.AutoTokenizer]): Tokenizer for text-only (generate) mode. processor ([~transformers.Qwen3VLProcessor]): Multimodal processor for edit mode (text + image template). vae ([~diffusers.AutoencoderTiny]): Mobile-friendly tiny VAE for latent encode/decode. unet ([~diffusers.DreamLiteUNetModel]): DreamLite UNet (GQA + qk_norm + depthwise-separable convs). scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]): Flow-matching Euler scheduler with dynamic shift.

Note: batch_size is currently forced to 1; num_images_per_prompt is supported.

__call__diffusers.DreamLitePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite.py#L388[{"name": "prompt", "val": ": typing.Optional[str] = None"}, {"name": "negative_prompt", "val": ": typing.Optional[str] = None"}, {"name": "image", "val": ": typing.Optional[PIL.Image.Image] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "guidance_scale", "val": ": float = 3.5"}, {"name": "image_guidance_scale", "val": ": float = 1.5"}, {"name": "num_inference_steps", "val": ": int = 30"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "max_sequence_length", "val": ": int = 200"}, {"name": "text_pad_embedding", "val": ": typing.Optional[torch.Tensor] = None"}]- prompt -- Text prompt.

negative_prompt -- Negative text prompt (defaults to empty string).
image -- Optional input image. If provided, the pipeline runs in edit / image-to-image mode with dual classifier-free guidance; otherwise it runs in text-to-image mode.
height -- Output resolution (height). Defaults to default_sample_size * vae_scale_factor (1024). The same default applies in both T2I and I2I; pass an explicit value to override.
width -- Output resolution (width). Defaults to default_sample_size * vae_scale_factor (1024). The same default applies in both T2I and I2I; pass an explicit value to override.
guidance_scale -- CFG scale on the text branch (both modes).
image_guidance_scale -- Additional CFG scale on the image branch (edit mode only).
num_inference_steps -- Number of denoising steps.
sigmas -- Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
num_images_per_prompt -- Output images per prompt (note: batch_size is forced to 1).
generator -- Random generator(s).
output_type -- "pil", "np", "pt" or "latent".
return_dict -- If True, returns a DreamLitePipelineOutput; else a tuple (images,).
max_sequence_length -- Maximum number of user-prompt tokens kept after dropping the chat-template prefix. Only applies to generate mode (the edit mode uses the multimodal processor's native padding).
text_pad_embedding -- Optional learned pad embedding for masked positions.0DreamLitePipelineOutput or tuple. Run the DreamLite pipeline.

Parameters:

prompt : Text prompt.

negative_prompt : Negative text prompt (defaults to empty string).

image : Optional input image. If provided, the pipeline runs in edit / image-to-image mode with dual classifier-free guidance; otherwise it runs in text-to-image mode.

height : Output resolution (height). Defaults to default_sample_size * vae_scale_factor (1024). The same default applies in both T2I and I2I; pass an explicit value to override.

width : Output resolution (width). Defaults to default_sample_size * vae_scale_factor (1024). The same default applies in both T2I and I2I; pass an explicit value to override.

guidance_scale : CFG scale on the text branch (both modes).

image_guidance_scale : Additional CFG scale on the image branch (edit mode only).

num_inference_steps : Number of denoising steps.

sigmas : Optional explicit FlowMatch sigmas; defaults to a uniform linspace.

num_images_per_prompt : Output images per prompt (note: batch_size is forced to 1).

generator : Random generator(s).

output_type : "pil", "np", "pt" or "latent".

return_dict : If True, returns a DreamLitePipelineOutput; else a tuple (images,).

max_sequence_length : Maximum number of user-prompt tokens kept after dropping the chat-template prefix. Only applies to generate mode (the edit mode uses the multimodal processor's native padding).

text_pad_embedding : Optional learned pad embedding for masked positions.

Returns:

DreamLitePipelineOutput or tuple.

DreamLiteMobilePipeline[[diffusers.DreamLiteMobilePipeline]]

diffusers.DreamLiteMobilePipeline[[diffusers.DreamLiteMobilePipeline]]

Source

DreamLite Mobile pipeline: a distilled, classifier-free-guidance-free variant of DreamLitePipeline for fast few-step inference (default 4 steps).

The operating mode is auto-detected from inputs (same as the base pipeline):

image is None -> text-to-image.
image is not None -> image-to-image / instruction edit.

Because classifier-free guidance is distilled away, guidance_scale and image_guidance_scale are accepted for API parity with DreamLitePipeline but are ignored in the denoising loop. negative_prompt is intentionally absent.

Components (identical to the base pipeline): text_encoder ([~transformers.Qwen3VLForConditionalGeneration]): Multimodal text/vision encoder. tokenizer ([~transformers.AutoTokenizer]): Tokenizer for text-only (generate) mode. processor ([~transformers.Qwen3VLProcessor]): Multimodal processor for edit mode. vae ([~diffusers.AutoencoderTiny]): Mobile-friendly tiny VAE. unet ([~diffusers.DreamLiteUNetModel]): DreamLite UNet. scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]): Flow-matching Euler scheduler with dynamic shift.

Note: batch_size is currently forced to 1; num_images_per_prompt is supported.

__call__diffusers.DreamLiteMobilePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite_mobile.py#L384[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "image", "val": ": typing.Optional[PIL.Image.Image] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": int = 4"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = None"}, {"name": "image_guidance_scale", "val": ": typing.Optional[float] = None"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "max_sequence_length", "val": ": int = 200"}, {"name": "text_pad_embedding", "val": ": typing.Optional[torch.Tensor] = None"}]- prompt -- Text prompt.

image -- Optional input image. If provided, runs in edit / image-to-image mode; otherwise runs in text-to-image mode.
height -- Output resolution (height). Defaults to default_sample_size * vae_scale_factor (1024).
width -- Output resolution (width). Defaults to default_sample_size * vae_scale_factor (1024).
num_inference_steps -- Number of denoising steps. Defaults to 4 (distilled).
guidance_scale -- Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.
image_guidance_scale -- Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.
sigmas -- Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
num_images_per_prompt -- Output images per prompt (note: batch_size is forced to 1).
generator -- Random generator(s).
output_type -- "pil", "np", "pt" or "latent".
return_dict -- If True, returns a DreamLitePipelineOutput; else (images,).
max_sequence_length -- Maximum number of user-prompt tokens kept after dropping the chat-template prefix. Only applies to generate mode (the edit mode uses the multimodal processor's native padding).
text_pad_embedding -- Optional learned pad embedding for masked positions.0DreamLitePipelineOutput or tuple. Run the distilled DreamLite Mobile pipeline.

Parameters:

prompt : Text prompt.

image : Optional input image. If provided, runs in edit / image-to-image mode; otherwise runs in text-to-image mode.

height : Output resolution (height). Defaults to default_sample_size * vae_scale_factor (1024).

width : Output resolution (width). Defaults to default_sample_size * vae_scale_factor (1024).

num_inference_steps : Number of denoising steps. Defaults to 4 (distilled).

guidance_scale : Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.

image_guidance_scale : Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.

sigmas : Optional explicit FlowMatch sigmas; defaults to a uniform linspace.

num_images_per_prompt : Output images per prompt (note: batch_size is forced to 1).

generator : Random generator(s).

output_type : "pil", "np", "pt" or "latent".

return_dict : If True, returns a DreamLitePipelineOutput; else (images,).

text_pad_embedding : Optional learned pad embedding for masked positions.

Returns:

DreamLitePipelineOutput or tuple.

DreamLitePipelineOutput[[diffusers.DreamLitePipelineOutput]]

diffusers.DreamLitePipelineOutput[[diffusers.DreamLitePipelineOutput]]

Source

Output class for DreamLite pipelines.

Parameters:

images (List[PIL.Image.Image] or np.ndarray) : List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels). PIL images or NumPy array present the denoised images of the diffusion pipeline.

Xet Storage Details

Size:: 17.7 kB
Xet hash:: 7b8034a77fc2ea3add7cb6913deb5a7690d23c405f9a67862a28abfd5e251aa6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.