Buckets:
JoyAI-Image-Edit
JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing.
JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.
| Model | Description | Download |
|---|---|---|
| JoyAI-Image-Edit | Instruction-guided image editing with precise and controllable spatial manipulation | Hugging Face |
import torch
from diffusers import JoyImageEditPipeline
from diffusers.utils import load_image
pipeline = JoyImageEditPipeline.from_pretrained(
"jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16
)
pipeline.to("cuda")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg")
prompt = "Add wings to the astronaut."
output = pipeline(
image=image,
prompt=prompt,
num_inference_steps=40,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
output.save("joyimage_edit_output.png")
Spatial editing
JoyAI-Image supports three spatial editing prompt patterns: Object Move, Object Rotation, and Camera Control. For best results, follow the prompt templates below as closely as possible. For more information, refer to SpatialEdit.
Object Move
Move a target object into a specified region marked by a red box in the input image.
Move the <object> into the red box and finally remove the red box.
Object Rotation
Rotate an object to a specific canonical view. Supported <view> values: front, right, left, rear, front right, front left, rear right, rear left.
Rotate the <object> to show the <view> side view.
Camera Control
Change the camera viewpoint while keeping the 3D scene unchanged.
Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.
JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]]
diffusers.JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]]
Diffusion pipeline for image editing using the JoyImage architecture.
The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D transformer, and decodes the result with a WAN VAE.
Model offloading order: text_encoder -> transformer -> vae.
__call__diffusers.JoyImageEditPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L600[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = None"}, {"name": "prompt", "val": ": str | list[str] = None"}, {"name": "height", "val": ": int | None = None"}, {"name": "width", "val": ": int | None = None"}, {"name": "num_inference_steps", "val": ": int = 40"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "sigmas", "val": ": typing.List[float] = None"}, {"name": "guidance_scale", "val": ": float = 4.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 4096"}, {"name": "enable_denormalization", "val": ": bool = True"}]- prompt (str or List[str]) -- The prompt or prompts to guide generation.
- height (int) -- Height of the generated output in pixels.
- width (int) -- Width of the generated output in pixels.
- image (PipelineImageInput, optional) --
Reference image used for conditioning. When provided the pipeline operates in image-editing mode with
num_items=2. - num_inference_steps (int, optional, defaults to 40) -- Number of denoising steps. More steps generally improve quality at the cost of slower inference.
- timesteps (List[int], optional) --
Custom timesteps for the denoising process. When provided,
num_inference_stepsis inferred from the list length. - sigmas (List[float], optional) --
Custom sigmas for the denoising process. Mutually exclusive with
timesteps. - guidance_scale (float, optional, defaults to 4.0) -- Classifier-free guidance scale.
- negative_prompt (str or List[str], optional) -- Negative prompt(s) used to suppress undesired content.
- num_images_per_prompt (int, optional, defaults to 1) -- Number of generated samples per prompt.
- generator (torch.Generator or List[torch.Generator], optional) -- RNG generator(s) for deterministic sampling.
- latents (torch.Tensor, optional) -- Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
- prompt_embeds (torch.Tensor, optional) --
Pre-computed prompt embeddings. When provided
promptcan be omitted. - prompt_embeds_mask (torch.Tensor, optional) --
Attention mask for
prompt_embeds. - negative_prompt_embeds (torch.Tensor, optional) -- Pre-computed negative prompt embeddings.
- negative_prompt_embeds_mask (torch.Tensor, optional) --
Attention mask for
negative_prompt_embeds. - output_type (str, optional, defaults to
"pil") -- Output format. Pass"latent"to return raw latents. - return_dict (bool, optional, defaults to True) -- Whether to return a JoyImageEditPipelineOutput or a plain tensor.
- callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) --
Callback invoked at the end of each denoising step with signature
(self, step: int, timestep: int, callback_kwargs: Dict). - callback_on_step_end_tensor_inputs (List[str], optional, defaults to
["latents"]) -- Tensor keys included incallback_kwargsforcallback_on_step_end. - max_sequence_length (int, optional, defaults to 4096) -- Maximum sequence length for prompt encoding.
- enable_denormalization (bool, optional, defaults to True) --
Denormalise latents before VAE decoding.0[~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.TensorIf
return_dictisTrue, returns a pipeline output object containing the generated image(s). Otherwise returns the image tensor directly.
Generate an edited image conditioned on a reference image and a text prompt.
Examples:
>>> import torch
>>> from diffusers import JoyImageEditPipeline
>>> from diffusers.utils import load_image
>>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers"
>>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg")
>>> output = pipe(
... image=image, # pass an image for editing; omit for text-to-image generation
... prompt="Add wings to the astronaut.",
... num_inference_steps=40,
... guidance_scale=4.0,
... generator=torch.manual_seed(0),
... )
>>> output.images[0].save("joyimage_edit.png")
Parameters:
prompt (str or List[str]) : The prompt or prompts to guide generation.
height (int) : Height of the generated output in pixels.
width (int) : Width of the generated output in pixels.
image (PipelineImageInput, optional) : Reference image used for conditioning. When provided the pipeline operates in image-editing mode with num_items=2.
num_inference_steps (int, optional, defaults to 40) : Number of denoising steps. More steps generally improve quality at the cost of slower inference.
timesteps (List[int], optional) : Custom timesteps for the denoising process. When provided, num_inference_steps is inferred from the list length.
sigmas (List[float], optional) : Custom sigmas for the denoising process. Mutually exclusive with timesteps.
guidance_scale (float, optional, defaults to 4.0) : Classifier-free guidance scale.
negative_prompt (str or List[str], optional) : Negative prompt(s) used to suppress undesired content.
num_images_per_prompt (int, optional, defaults to 1) : Number of generated samples per prompt.
generator (torch.Generator or List[torch.Generator], optional) : RNG generator(s) for deterministic sampling.
latents (torch.Tensor, optional) : Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
prompt_embeds (torch.Tensor, optional) : Pre-computed prompt embeddings. When provided prompt can be omitted.
prompt_embeds_mask (torch.Tensor, optional) : Attention mask for prompt_embeds.
negative_prompt_embeds (torch.Tensor, optional) : Pre-computed negative prompt embeddings.
negative_prompt_embeds_mask (torch.Tensor, optional) : Attention mask for negative_prompt_embeds.
output_type (str, optional, defaults to "pil") : Output format. Pass "latent" to return raw latents.
return_dict (bool, optional, defaults to True) : Whether to return a JoyImageEditPipelineOutput or a plain tensor.
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) : Callback invoked at the end of each denoising step with signature (self, step: int, timestep: int, callback_kwargs: Dict).
callback_on_step_end_tensor_inputs (List[str], optional, defaults to ["latents"]) : Tensor keys included in callback_kwargs for callback_on_step_end.
max_sequence_length (int, optional, defaults to 4096) : Maximum sequence length for prompt encoding.
enable_denormalization (bool, optional, defaults to True) : Denormalise latents before VAE decoding.
Returns:
[*~pipelines.joyimage.JoyImageEditPipelineOutput*] or *torch.Tensor*
If return_dict is True, returns a pipeline output object containing the generated image(s).
Otherwise returns the image tensor directly.
check_inputs[[diffusers.JoyImageEditPipeline.check_inputs]]
Validate pipeline inputs before the forward pass.
denormalize_latents[[diffusers.JoyImageEditPipeline.denormalize_latents]]
Invert normalize_latents to recover the original latent scale.
Parameters:
latent : Normalised latent tensor.
Returns:
Latent tensor in the scale expected by vae.decode.
encode_prompt[[diffusers.JoyImageEditPipeline.encode_prompt]]
Encode a text prompt into embeddings (text-only path).
Pre-computed prompt_embeds bypass encoding entirely.
Parameters:
prompt : Prompt string or list of prompt strings.
device : Target device.
num_images_per_prompt : Number of outputs to generate per prompt.
prompt_embeds : Pre-computed prompt embeddings.
prompt_embeds_mask : Attention mask for pre-computed embeddings.
max_sequence_length : Maximum output sequence length.
template_type : Prompt template key ("image" or "multiple_images").
Returns:
Tuple of (prompt_embeds, prompt_embeds_mask).
encode_prompt_multiple_images[[diffusers.JoyImageEditPipeline.encode_prompt_multiple_images]]
Encode prompts that contain inline image tokens via the Qwen processor.
&lt;image>\n placeholders in each prompt string are replaced by the Qwen vision special tokens before being
fed to the multimodal encoder.
Parameters:
prompt : Prompt string(s), optionally containing &lt;image>\n tokens.
device : Target device.
num_images_per_prompt : Number of outputs to generate per prompt.
images : Pixel tensors corresponding to the inline image tokens.
prompt_embeds : Pre-computed prompt embeddings.
prompt_embeds_mask : Attention mask for pre-computed embeddings.
template_type : Must be "multiple_images".
max_sequence_length : If set, truncate the output to this length (keeping the last max_sequence_length tokens).
Returns:
Tuple of (prompt_embeds, prompt_embeds_mask).
normalize_latents[[diffusers.JoyImageEditPipeline.normalize_latents]]
Normalise latents using per-channel statistics from the VAE config.
Uses (latent - mean) / std when the VAE exposes latents_mean and latents_std; otherwise falls back to
scaling by scaling_factor.
Parameters:
latent : Raw latent tensor from vae.encode.
Returns:
Normalised latent tensor.
prepare_latents[[diffusers.JoyImageEditPipeline.prepare_latents]]
Prepare the initial noisy latent tensor for the denoising loop.
Parameters:
batch_size : Number of samples in the batch.
num_channels_latents : Latent channel dimension from the transformer config.
height : Spatial height in pixels.
width : Spatial width in pixels.
video_length : Number of frames (1 for image inference).
dtype : Floating-point dtype for the latent tensor.
device : Target device.
generator : RNG generator(s) for reproducible sampling.
latents : Optional user-provided initial noise for the target slot. When None random noise is sampled.
image : Optional list of PIL reference images to VAE-encode as conditioning slots.
enable_denormalization : Whether to normalise encoded reference latents.
Returns:
Tuple of (latents, image_latents) where latents has shape (B, 1, C, T, H', W') and
image_latents has shape (B, N_ref, C, T, H', W') or None when no reference images are given.
JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]]
diffusers.JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]]
Output class for JoyImageEdit generation pipelines.
Xet Storage Details
- Size:
- 16.3 kB
- Xet hash:
- 176189bd97fee5882b67368f14c4ecb0ba6ec17184aa082f3417adcbe025c09f
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.