Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_13751 /en /api /pipelines /dreamlite.md

HuggingFaceDocBuilder

4 days ago

preview code

download

raw

17.7 kB

	# DreamLite

	DreamLite is a text-to-image and image-editing model from ByteDance. It pairs a custom 2D U-Net
	(`DreamLiteUNetModel`) with the `Qwen3-VL` multimodal encoder as its prompt / image-instruction encoder,
	and uses an `AutoencoderTiny` (TAESD-style) VAE for fast latent encode/decode.

	Two pipelines are exposed:

	\| Pipeline \| Modes \| CFG \| Use case \|
	\|---\|---\|---\|---\|
	\| [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline) \| text-to-image and image-editing (auto-selected by whether `image` is `None`) \| 3-branch dual CFG (`guidance_scale` on text branch, `image_guidance_scale` on image branch, à la InstructPix2Pix) \| Highest quality \|
	\| [DreamLiteMobilePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLiteMobilePipeline) \| text-to-image and image-editing (auto-selected by whether `image` is `None`) \| None — distilled, single UNet forward per step \| On-device / low-latency \|

	Official checkpoints:

	* Base model: [carlofkl/DreamLite-base](https://huggingface.co/carlofkl/DreamLite-base)
	* Distilled mobile model: [carlofkl/DreamLite-mobile](https://huggingface.co/carlofkl/DreamLite-mobile)

	> [!TIP]
	> Both pipelines auto-detect text-to-image vs. image-editing mode from whether the `image` argument is
	> provided. There is no separate `Img2Img` class.

	> [!TIP]
	> When loading an input image for editing, prefer `diffusers.utils.load_image(...)` over raw `PIL.Image.open(...)`.
	> `load_image` enforces an RGB conversion and applies EXIF orientation, both of which the pipeline assumes.
	> A plain `Image.open` of an RGBA / palette / EXIF-rotated source will silently produce a different latent
	> conditioning and degrade output quality.

	## Text-to-image (Base)

	```python
	import torch
	from diffusers import DreamLitePipeline

	pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
	pipe = pipe.to("cuda")

	image = pipe(
	prompt="a dog running on the grass",
	negative_prompt="",
	height=1024,
	width=1024,
	num_inference_steps=28,
	generator=torch.Generator("cpu").manual_seed(42),
	).images[0]
	image.save("dreamlite_t2i.png")
	```

	## Image editing (Base)

	Pass an `image` to enter edit mode. Both `guidance_scale` (text branch) and `image_guidance_scale`
	(image branch) are active here.

	```python
	import torch
	from diffusers import DreamLitePipeline
	from diffusers.utils import load_image

	pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
	pipe = pipe.to("cuda")

	source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

	image = pipe(
	prompt="turn the cat into a corgi",
	image=source,
	height=1024,
	width=1024,
	num_inference_steps=28,
	generator=torch.Generator("cpu").manual_seed(42),
	).images[0]
	image.save("dreamlite_edit.png")
	```

	## Text-to-image (Mobile)

	The mobile pipeline is distilled and skips CFG entirely — a single UNet forward per step. It accepts the
	same `prompt` / `height` / `width` / `num_inference_steps` arguments, but ignores `guidance_scale` and
	`image_guidance_scale` if passed (a warning is logged).

	```python
	import torch
	from diffusers import DreamLiteMobilePipeline

	pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
	pipe = pipe.to("cuda")

	image = pipe(
	prompt="a dog running on the grass",
	height=1024,
	width=1024,
	num_inference_steps=4,
	generator=torch.Generator("cpu").manual_seed(42),
	).images[0]
	image.save("dreamlite_mobile_t2i.png")
	```

	## Image editing (Mobile)

	```python
	import torch
	from diffusers import DreamLiteMobilePipeline
	from diffusers.utils import load_image

	pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
	pipe = pipe.to("cuda")

	source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")

	image = pipe(
	prompt="turn the cat into a corgi",
	image=source,
	height=1024,
	width=1024,
	num_inference_steps=4,
	generator=torch.Generator("cpu").manual_seed(42),
	).images[0]
	image.save("dreamlite_mobile_edit.png")
	```

	## Notes and limitations

	* Both pipelines force `batch_size = 1` internally; `num_images_per_prompt` controls how many samples
	are drawn from the same prompt rather than parallel batching.
	* The prompt encoder is `Qwen3-VL`, which is a multimodal model. Loading the full pipeline therefore
	requires sufficient GPU memory for both the U-Net and the Qwen3-VL text encoder (~4 GB + ~0.7 GB
	in bf16 for the base release).
	* The VAE is `AutoencoderTiny` and exposes `encoder_block_out_channels`; `vae_scale_factor` is derived
	from it at pipeline init time.

	## DreamLitePipeline[[diffusers.DreamLitePipeline]]

	#### diffusers.DreamLitePipeline[[diffusers.DreamLitePipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite.py#L155)

	DreamLite pipeline for text-to-image and instruction-based image editing.

	The same pipeline supports both modes; the operating mode is auto-detected from the inputs:

	- `image is None` -> text-to-image (single CFG on text).
	- `image is not None` -> image-to-image / instruction edit (dual CFG: text + image).

	Components:
	text_encoder ([~transformers.Qwen3VLForConditionalGeneration]):
	Multimodal text/vision encoder used to produce conditioning embeddings.
	tokenizer ([~transformers.AutoTokenizer]):
	Tokenizer for text-only (generate) mode.
	processor ([~transformers.Qwen3VLProcessor]):
	Multimodal processor for edit mode (text + image template).
	vae ([~diffusers.AutoencoderTiny]):
	Mobile-friendly tiny VAE for latent encode/decode.
	unet ([~diffusers.DreamLiteUNetModel]):
	DreamLite UNet (GQA + qk_norm + depthwise-separable convs).
	scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]):
	Flow-matching Euler scheduler with dynamic shift.

	Note:
	`batch_size` is currently forced to `1`; `num_images_per_prompt` is supported.

	__call__diffusers.DreamLitePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite.py#L388[{"name": "prompt", "val": ": typing.Optional[str] = None"}, {"name": "negative_prompt", "val": ": typing.Optional[str] = None"}, {"name": "image", "val": ": typing.Optional[PIL.Image.Image] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "guidance_scale", "val": ": float = 3.5"}, {"name": "image_guidance_scale", "val": ": float = 1.5"}, {"name": "num_inference_steps", "val": ": int = 30"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "max_sequence_length", "val": ": int = 200"}, {"name": "text_pad_embedding", "val": ": typing.Optional[torch.Tensor] = None"}]- prompt -- Text prompt.
	- negative_prompt -- Negative text prompt (defaults to empty string).
	- image -- Optional input image. If provided, the pipeline runs in edit / image-to-image mode
	with dual classifier-free guidance; otherwise it runs in text-to-image mode.
	- height -- Output resolution (height). Defaults to `default_sample_size * vae_scale_factor` (1024).
	The same default applies in both T2I and I2I; pass an explicit value to override.
	- width -- Output resolution (width). Defaults to `default_sample_size * vae_scale_factor` (1024).
	The same default applies in both T2I and I2I; pass an explicit value to override.
	- guidance_scale -- CFG scale on the text branch (both modes).
	- image_guidance_scale -- Additional CFG scale on the image branch (edit mode only).
	- num_inference_steps -- Number of denoising steps.
	- sigmas -- Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
	- num_images_per_prompt -- Output images per prompt (note: `batch_size` is forced to 1).
	- generator -- Random generator(s).
	- output_type -- `"pil"`, `"np"`, `"pt"` or `"latent"`.
	- return_dict -- If True, returns a [DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput); else a tuple `(images,)`.
	- max_sequence_length -- Maximum number of user-prompt tokens kept after dropping the chat-template
	prefix. Only applies to `generate` mode (the `edit` mode uses the multimodal processor's native
	padding).
	- text_pad_embedding -- Optional learned pad embedding for masked positions.0[DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput) or `tuple`.
	Run the DreamLite pipeline.

	Parameters:

	prompt : Text prompt.

	negative_prompt : Negative text prompt (defaults to empty string).

	image : Optional input image. If provided, the pipeline runs in edit / image-to-image mode with dual classifier-free guidance; otherwise it runs in text-to-image mode.

	height : Output resolution (height). Defaults to `default_sample_size * vae_scale_factor` (1024). The same default applies in both T2I and I2I; pass an explicit value to override.

	width : Output resolution (width). Defaults to `default_sample_size * vae_scale_factor` (1024). The same default applies in both T2I and I2I; pass an explicit value to override.

	guidance_scale : CFG scale on the text branch (both modes).

	image_guidance_scale : Additional CFG scale on the image branch (edit mode only).

	num_inference_steps : Number of denoising steps.

	sigmas : Optional explicit FlowMatch sigmas; defaults to a uniform linspace.

	num_images_per_prompt : Output images per prompt (note: `batch_size` is forced to 1).

	generator : Random generator(s).

	output_type : `"pil"`, `"np"`, `"pt"` or `"latent"`.

	return_dict : If True, returns a [DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput); else a tuple `(images,)`.

	max_sequence_length : Maximum number of user-prompt tokens kept after dropping the chat-template prefix. Only applies to `generate` mode (the `edit` mode uses the multimodal processor's native padding).

	text_pad_embedding : Optional learned pad embedding for masked positions.

	Returns:

	[DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput) or `tuple`.

	## DreamLiteMobilePipeline[[diffusers.DreamLiteMobilePipeline]]

	#### diffusers.DreamLiteMobilePipeline[[diffusers.DreamLiteMobilePipeline]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite_mobile.py#L156)

	DreamLite Mobile pipeline: a distilled, classifier-free-guidance-free variant of
	[DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline) for fast few-step inference (default 4 steps).

	The operating mode is auto-detected from inputs (same as the base pipeline):

	- `image is None` -> text-to-image.
	- `image is not None` -> image-to-image / instruction edit.

	Because classifier-free guidance is distilled away, `guidance_scale` and `image_guidance_scale` are
	accepted for API parity with [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline) but are ignored in the denoising loop. `negative_prompt`
	is intentionally absent.

	Components (identical to the base pipeline):
	text_encoder ([~transformers.Qwen3VLForConditionalGeneration]):
	Multimodal text/vision encoder.
	tokenizer ([~transformers.AutoTokenizer]):
	Tokenizer for text-only (generate) mode.
	processor ([~transformers.Qwen3VLProcessor]):
	Multimodal processor for edit mode.
	vae ([~diffusers.AutoencoderTiny]):
	Mobile-friendly tiny VAE.
	unet ([~diffusers.DreamLiteUNetModel]):
	DreamLite UNet.
	scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]):
	Flow-matching Euler scheduler with dynamic shift.

	Note:
	`batch_size` is currently forced to `1`; `num_images_per_prompt` is supported.

	__call__diffusers.DreamLiteMobilePipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_dreamlite_mobile.py#L384[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "image", "val": ": typing.Optional[PIL.Image.Image] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": int = 4"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = None"}, {"name": "image_guidance_scale", "val": ": typing.Optional[float] = None"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "max_sequence_length", "val": ": int = 200"}, {"name": "text_pad_embedding", "val": ": typing.Optional[torch.Tensor] = None"}]- prompt -- Text prompt.
	- image -- Optional input image. If provided, runs in edit / image-to-image mode;
	otherwise runs in text-to-image mode.
	- height -- Output resolution (height). Defaults to `default_sample_size * vae_scale_factor` (1024).
	- width -- Output resolution (width). Defaults to `default_sample_size * vae_scale_factor` (1024).
	- num_inference_steps -- Number of denoising steps. Defaults to 4 (distilled).
	- guidance_scale -- Accepted for API parity with [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline); ignored
	because CFG was distilled away.
	- image_guidance_scale -- Accepted for API parity with [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline); ignored
	because CFG was distilled away.
	- sigmas -- Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
	- num_images_per_prompt -- Output images per prompt (note: `batch_size` is forced to 1).
	- generator -- Random generator(s).
	- output_type -- `"pil"`, `"np"`, `"pt"` or `"latent"`.
	- return_dict -- If True, returns a [DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput); else `(images,)`.
	- max_sequence_length -- Maximum number of user-prompt tokens kept after dropping the chat-template
	prefix. Only applies to `generate` mode (the `edit` mode uses the multimodal processor's native
	padding).
	- text_pad_embedding -- Optional learned pad embedding for masked positions.0[DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput) or `tuple`.
	Run the distilled DreamLite Mobile pipeline.

	Parameters:

	prompt : Text prompt.

	image : Optional input image. If provided, runs in edit / image-to-image mode; otherwise runs in text-to-image mode.

	height : Output resolution (height). Defaults to `default_sample_size * vae_scale_factor` (1024).

	width : Output resolution (width). Defaults to `default_sample_size * vae_scale_factor` (1024).

	num_inference_steps : Number of denoising steps. Defaults to 4 (distilled).

	guidance_scale : Accepted for API parity with [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline); ignored because CFG was distilled away.

	image_guidance_scale : Accepted for API parity with [DreamLitePipeline](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipeline); ignored because CFG was distilled away.

	sigmas : Optional explicit FlowMatch sigmas; defaults to a uniform linspace.

	num_images_per_prompt : Output images per prompt (note: `batch_size` is forced to 1).

	generator : Random generator(s).

	output_type : `"pil"`, `"np"`, `"pt"` or `"latent"`.

	return_dict : If True, returns a [DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput); else `(images,)`.

	max_sequence_length : Maximum number of user-prompt tokens kept after dropping the chat-template prefix. Only applies to `generate` mode (the `edit` mode uses the multimodal processor's native padding).

	text_pad_embedding : Optional learned pad embedding for masked positions.

	Returns:

	[DreamLitePipelineOutput](/docs/diffusers/pr_13751/en/api/pipelines/dreamlite#diffusers.DreamLitePipelineOutput) or `tuple`.

	## DreamLitePipelineOutput[[diffusers.DreamLitePipelineOutput]]

	#### diffusers.DreamLitePipelineOutput[[diffusers.DreamLitePipelineOutput]]

	[Source](https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/dreamlite/pipeline_output.py#L25)

	Output class for DreamLite pipelines.

	Parameters:

	images (`List[PIL.Image.Image]` or `np.ndarray`) : List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, num_channels)`. PIL images or NumPy array present the denoised images of the diffusion pipeline.

Xet Storage Details

Size:: 17.7 kB
Xet hash:: 7b8034a77fc2ea3add7cb6913deb5a7690d23c405f9a67862a28abfd5e251aa6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.