Buckets:

HuggingFaceDocBuilder's picture
|
download
raw
16.3 kB

JoyAI-Image-Edit

JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing.

JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.

Model Description Download
JoyAI-Image-Edit Instruction-guided image editing with precise and controllable spatial manipulation Hugging Face
import torch
from diffusers import JoyImageEditPipeline
from diffusers.utils import load_image

pipeline = JoyImageEditPipeline.from_pretrained(
    "jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg")
prompt = "Add wings to the astronaut."

output = pipeline(
    image=image,
    prompt=prompt,
    num_inference_steps=40,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
output.save("joyimage_edit_output.png")

Spatial editing

JoyAI-Image supports three spatial editing prompt patterns: Object Move, Object Rotation, and Camera Control. For best results, follow the prompt templates below as closely as possible. For more information, refer to SpatialEdit.

Object Move

Move a target object into a specified region marked by a red box in the input image.

Move the <object> into the red box and finally remove the red box.

Object Rotation

Rotate an object to a specific canonical view. Supported <view> values: front, right, left, rear, front right, front left, rear right, rear left.

Rotate the <object> to show the <view> side view.

Camera Control

Change the camera viewpoint while keeping the 3D scene unchanged.

Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.

JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]]

diffusers.JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]]

Source

Diffusion pipeline for image editing using the JoyImage architecture.

The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D transformer, and decodes the result with a WAN VAE.

Model offloading order: text_encoder -> transformer -> vae.

__call__diffusers.JoyImageEditPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L600[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = None"}, {"name": "prompt", "val": ": str | list[str] = None"}, {"name": "height", "val": ": int | None = None"}, {"name": "width", "val": ": int | None = None"}, {"name": "num_inference_steps", "val": ": int = 40"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "sigmas", "val": ": typing.List[float] = None"}, {"name": "guidance_scale", "val": ": float = 4.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 4096"}, {"name": "enable_denormalization", "val": ": bool = True"}]- prompt (str or List[str]) -- The prompt or prompts to guide generation.

  • height (int) -- Height of the generated output in pixels.
  • width (int) -- Width of the generated output in pixels.
  • image (PipelineImageInput, optional) -- Reference image used for conditioning. When provided the pipeline operates in image-editing mode with num_items=2.
  • num_inference_steps (int, optional, defaults to 40) -- Number of denoising steps. More steps generally improve quality at the cost of slower inference.
  • timesteps (List[int], optional) -- Custom timesteps for the denoising process. When provided, num_inference_steps is inferred from the list length.
  • sigmas (List[float], optional) -- Custom sigmas for the denoising process. Mutually exclusive with timesteps.
  • guidance_scale (float, optional, defaults to 4.0) -- Classifier-free guidance scale.
  • negative_prompt (str or List[str], optional) -- Negative prompt(s) used to suppress undesired content.
  • num_images_per_prompt (int, optional, defaults to 1) -- Number of generated samples per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) -- RNG generator(s) for deterministic sampling.
  • latents (torch.Tensor, optional) -- Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
  • prompt_embeds (torch.Tensor, optional) -- Pre-computed prompt embeddings. When provided prompt can be omitted.
  • prompt_embeds_mask (torch.Tensor, optional) -- Attention mask for prompt_embeds.
  • negative_prompt_embeds (torch.Tensor, optional) -- Pre-computed negative prompt embeddings.
  • negative_prompt_embeds_mask (torch.Tensor, optional) -- Attention mask for negative_prompt_embeds.
  • output_type (str, optional, defaults to "pil") -- Output format. Pass "latent" to return raw latents.
  • return_dict (bool, optional, defaults to True) -- Whether to return a JoyImageEditPipelineOutput or a plain tensor.
  • callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) -- Callback invoked at the end of each denoising step with signature (self, step: int, timestep: int, callback_kwargs: Dict).
  • callback_on_step_end_tensor_inputs (List[str], optional, defaults to ["latents"]) -- Tensor keys included in callback_kwargs for callback_on_step_end.
  • max_sequence_length (int, optional, defaults to 4096) -- Maximum sequence length for prompt encoding.
  • enable_denormalization (bool, optional, defaults to True) -- Denormalise latents before VAE decoding.0[~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.TensorIf return_dict is True, returns a pipeline output object containing the generated image(s). Otherwise returns the image tensor directly.

Generate an edited image conditioned on a reference image and a text prompt.

Examples:

>>> import torch
>>> from diffusers import JoyImageEditPipeline
>>> from diffusers.utils import load_image

>>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers"
>>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg")
>>> output = pipe(
...     image=image,  # pass an image for editing; omit for text-to-image generation
...     prompt="Add wings to the astronaut.",
...     num_inference_steps=40,
...     guidance_scale=4.0,
...     generator=torch.manual_seed(0),
... )
>>> output.images[0].save("joyimage_edit.png")

Parameters:

prompt (str or List[str]) : The prompt or prompts to guide generation.

height (int) : Height of the generated output in pixels.

width (int) : Width of the generated output in pixels.

image (PipelineImageInput, optional) : Reference image used for conditioning. When provided the pipeline operates in image-editing mode with num_items=2.

num_inference_steps (int, optional, defaults to 40) : Number of denoising steps. More steps generally improve quality at the cost of slower inference.

timesteps (List[int], optional) : Custom timesteps for the denoising process. When provided, num_inference_steps is inferred from the list length.

sigmas (List[float], optional) : Custom sigmas for the denoising process. Mutually exclusive with timesteps.

guidance_scale (float, optional, defaults to 4.0) : Classifier-free guidance scale.

negative_prompt (str or List[str], optional) : Negative prompt(s) used to suppress undesired content.

num_images_per_prompt (int, optional, defaults to 1) : Number of generated samples per prompt.

generator (torch.Generator or List[torch.Generator], optional) : RNG generator(s) for deterministic sampling.

latents (torch.Tensor, optional) : Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.

prompt_embeds (torch.Tensor, optional) : Pre-computed prompt embeddings. When provided prompt can be omitted.

prompt_embeds_mask (torch.Tensor, optional) : Attention mask for prompt_embeds.

negative_prompt_embeds (torch.Tensor, optional) : Pre-computed negative prompt embeddings.

negative_prompt_embeds_mask (torch.Tensor, optional) : Attention mask for negative_prompt_embeds.

output_type (str, optional, defaults to "pil") : Output format. Pass "latent" to return raw latents.

return_dict (bool, optional, defaults to True) : Whether to return a JoyImageEditPipelineOutput or a plain tensor.

callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) : Callback invoked at the end of each denoising step with signature (self, step: int, timestep: int, callback_kwargs: Dict).

callback_on_step_end_tensor_inputs (List[str], optional, defaults to ["latents"]) : Tensor keys included in callback_kwargs for callback_on_step_end.

max_sequence_length (int, optional, defaults to 4096) : Maximum sequence length for prompt encoding.

enable_denormalization (bool, optional, defaults to True) : Denormalise latents before VAE decoding.

Returns:

[*~pipelines.joyimage.JoyImageEditPipelineOutput*] or *torch.Tensor*

If return_dict is True, returns a pipeline output object containing the generated image(s). Otherwise returns the image tensor directly.

check_inputs[[diffusers.JoyImageEditPipeline.check_inputs]]

Source

Validate pipeline inputs before the forward pass.

denormalize_latents[[diffusers.JoyImageEditPipeline.denormalize_latents]]

Source

Invert normalize_latents to recover the original latent scale.

Parameters:

latent : Normalised latent tensor.

Returns:

Latent tensor in the scale expected by vae.decode.

encode_prompt[[diffusers.JoyImageEditPipeline.encode_prompt]]

Source

Encode a text prompt into embeddings (text-only path).

Pre-computed prompt_embeds bypass encoding entirely.

Parameters:

prompt : Prompt string or list of prompt strings.

device : Target device.

num_images_per_prompt : Number of outputs to generate per prompt.

prompt_embeds : Pre-computed prompt embeddings.

prompt_embeds_mask : Attention mask for pre-computed embeddings.

max_sequence_length : Maximum output sequence length.

template_type : Prompt template key ("image" or "multiple_images").

Returns:

Tuple of (prompt_embeds, prompt_embeds_mask).

encode_prompt_multiple_images[[diffusers.JoyImageEditPipeline.encode_prompt_multiple_images]]

Source

Encode prompts that contain inline image tokens via the Qwen processor.

&amp;lt;image>\n placeholders in each prompt string are replaced by the Qwen vision special tokens before being fed to the multimodal encoder.

Parameters:

prompt : Prompt string(s), optionally containing &amp;lt;image>\n tokens.

device : Target device.

num_images_per_prompt : Number of outputs to generate per prompt.

images : Pixel tensors corresponding to the inline image tokens.

prompt_embeds : Pre-computed prompt embeddings.

prompt_embeds_mask : Attention mask for pre-computed embeddings.

template_type : Must be "multiple_images".

max_sequence_length : If set, truncate the output to this length (keeping the last max_sequence_length tokens).

Returns:

Tuple of (prompt_embeds, prompt_embeds_mask).

normalize_latents[[diffusers.JoyImageEditPipeline.normalize_latents]]

Source

Normalise latents using per-channel statistics from the VAE config.

Uses (latent - mean) / std when the VAE exposes latents_mean and latents_std; otherwise falls back to scaling by scaling_factor.

Parameters:

latent : Raw latent tensor from vae.encode.

Returns:

Normalised latent tensor.

prepare_latents[[diffusers.JoyImageEditPipeline.prepare_latents]]

Source

Prepare the initial noisy latent tensor for the denoising loop.

Parameters:

batch_size : Number of samples in the batch.

num_channels_latents : Latent channel dimension from the transformer config.

height : Spatial height in pixels.

width : Spatial width in pixels.

video_length : Number of frames (1 for image inference).

dtype : Floating-point dtype for the latent tensor.

device : Target device.

generator : RNG generator(s) for reproducible sampling.

latents : Optional user-provided initial noise for the target slot. When None random noise is sampled.

image : Optional list of PIL reference images to VAE-encode as conditioning slots.

enable_denormalization : Whether to normalise encoded reference latents.

Returns:

Tuple of (latents, image_latents) where latents has shape (B, 1, C, T, H', W') and image_latents has shape (B, N_ref, C, T, H', W') or None when no reference images are given.

JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]]

diffusers.JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]]

Source

Output class for JoyImageEdit generation pipelines.

Xet Storage Details

Size:
16.3 kB
·
Xet hash:
6681747857b7c336d28c6985905c273779fc8320ad2fbbf6f22c88d01d4c0c45

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.