Buckets:
| # JoyAI-Image-Edit | |
| [JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image) is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing. | |
| JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control. | |
| | Model | Description | Download | | |
| |:-----:|:-----------:|:--------:| | |
| | JoyAI-Image-Edit | Instruction-guided image editing with precise and controllable spatial manipulation | [Hugging Face](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Diffusers) | | |
| ```python | |
| import torch | |
| from diffusers import JoyImageEditPipeline | |
| from diffusers.utils import load_image | |
| pipeline = JoyImageEditPipeline.from_pretrained( | |
| "jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16 | |
| ) | |
| pipeline.to("cuda") | |
| image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg") | |
| prompt = "Add wings to the astronaut." | |
| output = pipeline( | |
| image=image, | |
| prompt=prompt, | |
| num_inference_steps=40, | |
| guidance_scale=4.0, | |
| generator=torch.Generator("cuda").manual_seed(0), | |
| ).images[0] | |
| output.save("joyimage_edit_output.png") | |
| ``` | |
| ## Spatial editing | |
| JoyAI-Image supports three spatial editing prompt patterns: **Object Move**, **Object Rotation**, and **Camera Control**. For best results, follow the prompt templates below as closely as possible. For more information, refer to [SpatialEdit](https://github.com/EasonXiao-888/SpatialEdit). | |
| ### Object Move | |
| Move a target object into a specified region marked by a red box in the input image. | |
| ```text | |
| Move the <object> into the red box and finally remove the red box. | |
| ``` | |
| ### Object Rotation | |
| Rotate an object to a specific canonical view. Supported `<view>` values: `front`, `right`, `left`, `rear`, `front right`, `front left`, `rear right`, `rear left`. | |
| ```text | |
| Rotate the <object> to show the <view> side view. | |
| ``` | |
| ### Camera Control | |
| Change the camera viewpoint while keeping the 3D scene unchanged. | |
| ```text | |
| Move the camera. | |
| - Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°. | |
| - Camera zoom: in/out/unchanged. | |
| - Keep the 3D scene static; only change the viewpoint. | |
| ``` | |
| ## JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]] | |
| #### diffusers.JoyImageEditPipeline[[diffusers.JoyImageEditPipeline]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L100) | |
| Diffusion pipeline for image editing using the JoyImage architecture. | |
| The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D | |
| transformer, and decodes the result with a WAN VAE. | |
| Model offloading order: text_encoder -> transformer -> vae. | |
| __call__diffusers.JoyImageEditPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L600[{"name": "image", "val": ": PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = None"}, {"name": "prompt", "val": ": str | list[str] = None"}, {"name": "height", "val": ": int | None = None"}, {"name": "width", "val": ": int | None = None"}, {"name": "num_inference_steps", "val": ": int = 40"}, {"name": "timesteps", "val": ": typing.List[int] = None"}, {"name": "sigmas", "val": ": typing.List[float] = None"}, {"name": "guidance_scale", "val": ": float = 4.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 4096"}, {"name": "enable_denormalization", "val": ": bool = True"}]- **prompt** (*str* or *List[str]*) -- | |
| The prompt or prompts to guide generation. | |
| - **height** (*int*) -- | |
| Height of the generated output in pixels. | |
| - **width** (*int*) -- | |
| Width of the generated output in pixels. | |
| - **image** (*PipelineImageInput*, *optional*) -- | |
| Reference image used for conditioning. When provided the pipeline operates in image-editing mode with | |
| `num_items=2`. | |
| - **num_inference_steps** (*int*, *optional*, defaults to 40) -- | |
| Number of denoising steps. More steps generally improve quality at the cost of slower inference. | |
| - **timesteps** (*List[int]*, *optional*) -- | |
| Custom timesteps for the denoising process. When provided, `num_inference_steps` is inferred from the | |
| list length. | |
| - **sigmas** (*List[float]*, *optional*) -- | |
| Custom sigmas for the denoising process. Mutually exclusive with `timesteps`. | |
| - **guidance_scale** (*float*, *optional*, defaults to 4.0) -- | |
| Classifier-free guidance scale. | |
| - **negative_prompt** (*str* or *List[str]*, *optional*) -- | |
| Negative prompt(s) used to suppress undesired content. | |
| - **num_images_per_prompt** (*int*, *optional*, defaults to 1) -- | |
| Number of generated samples per prompt. | |
| - **generator** (*torch.Generator* or *List[torch.Generator]*, *optional*) -- | |
| RNG generator(s) for deterministic sampling. | |
| - **latents** (*torch.Tensor*, *optional*) -- | |
| Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not | |
| provided. Can be used to seed generation from a specific starting noise tensor. | |
| - **prompt_embeds** (*torch.Tensor*, *optional*) -- | |
| Pre-computed prompt embeddings. When provided `prompt` can be omitted. | |
| - **prompt_embeds_mask** (*torch.Tensor*, *optional*) -- | |
| Attention mask for `prompt_embeds`. | |
| - **negative_prompt_embeds** (*torch.Tensor*, *optional*) -- | |
| Pre-computed negative prompt embeddings. | |
| - **negative_prompt_embeds_mask** (*torch.Tensor*, *optional*) -- | |
| Attention mask for `negative_prompt_embeds`. | |
| - **output_type** (*str*, *optional*, defaults to `"pil"`) -- | |
| Output format. Pass `"latent"` to return raw latents. | |
| - **return_dict** (*bool*, *optional*, defaults to *True*) -- | |
| Whether to return a [JoyImageEditPipelineOutput](/docs/diffusers/pr_13098/en/api/pipelines/joyimage_edit#diffusers.JoyImageEditPipelineOutput) or a plain tensor. | |
| - **callback_on_step_end** (*Callable*, *PipelineCallback*, *MultiPipelineCallbacks*, *optional*) -- | |
| Callback invoked at the end of each denoising step with signature `(self, step: int, timestep: int, callback_kwargs: Dict)`. | |
| - **callback_on_step_end_tensor_inputs** (*List[str]*, *optional*, defaults to `["latents"]`) -- | |
| Tensor keys included in `callback_kwargs` for `callback_on_step_end`. | |
| - **max_sequence_length** (*int*, *optional*, defaults to 4096) -- | |
| Maximum sequence length for prompt encoding. | |
| - **enable_denormalization** (*bool*, *optional*, defaults to *True*) -- | |
| Denormalise latents before VAE decoding.0[*~pipelines.joyimage.JoyImageEditPipelineOutput*] or *torch.Tensor*If `return_dict` is `True`, returns a pipeline output object containing the generated image(s). | |
| Otherwise returns the image tensor directly. | |
| Generate an edited image conditioned on a reference image and a text prompt. | |
| Examples: | |
| ```python | |
| >>> import torch | |
| >>> from diffusers import JoyImageEditPipeline | |
| >>> from diffusers.utils import load_image | |
| >>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers" | |
| >>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) | |
| >>> pipe.to("cuda") | |
| >>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg") | |
| >>> output = pipe( | |
| ... image=image, # pass an image for editing; omit for text-to-image generation | |
| ... prompt="Add wings to the astronaut.", | |
| ... num_inference_steps=40, | |
| ... guidance_scale=4.0, | |
| ... generator=torch.manual_seed(0), | |
| ... ) | |
| >>> output.images[0].save("joyimage_edit.png") | |
| ``` | |
| **Parameters:** | |
| prompt (*str* or *List[str]*) : The prompt or prompts to guide generation. | |
| height (*int*) : Height of the generated output in pixels. | |
| width (*int*) : Width of the generated output in pixels. | |
| image (*PipelineImageInput*, *optional*) : Reference image used for conditioning. When provided the pipeline operates in image-editing mode with `num_items=2`. | |
| num_inference_steps (*int*, *optional*, defaults to 40) : Number of denoising steps. More steps generally improve quality at the cost of slower inference. | |
| timesteps (*List[int]*, *optional*) : Custom timesteps for the denoising process. When provided, `num_inference_steps` is inferred from the list length. | |
| sigmas (*List[float]*, *optional*) : Custom sigmas for the denoising process. Mutually exclusive with `timesteps`. | |
| guidance_scale (*float*, *optional*, defaults to 4.0) : Classifier-free guidance scale. | |
| negative_prompt (*str* or *List[str]*, *optional*) : Negative prompt(s) used to suppress undesired content. | |
| num_images_per_prompt (*int*, *optional*, defaults to 1) : Number of generated samples per prompt. | |
| generator (*torch.Generator* or *List[torch.Generator]*, *optional*) : RNG generator(s) for deterministic sampling. | |
| latents (*torch.Tensor*, *optional*) : Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor. | |
| prompt_embeds (*torch.Tensor*, *optional*) : Pre-computed prompt embeddings. When provided `prompt` can be omitted. | |
| prompt_embeds_mask (*torch.Tensor*, *optional*) : Attention mask for `prompt_embeds`. | |
| negative_prompt_embeds (*torch.Tensor*, *optional*) : Pre-computed negative prompt embeddings. | |
| negative_prompt_embeds_mask (*torch.Tensor*, *optional*) : Attention mask for `negative_prompt_embeds`. | |
| output_type (*str*, *optional*, defaults to `"pil"`) : Output format. Pass `"latent"` to return raw latents. | |
| return_dict (*bool*, *optional*, defaults to *True*) : Whether to return a [JoyImageEditPipelineOutput](/docs/diffusers/pr_13098/en/api/pipelines/joyimage_edit#diffusers.JoyImageEditPipelineOutput) or a plain tensor. | |
| callback_on_step_end (*Callable*, *PipelineCallback*, *MultiPipelineCallbacks*, *optional*) : Callback invoked at the end of each denoising step with signature `(self, step: int, timestep: int, callback_kwargs: Dict)`. | |
| callback_on_step_end_tensor_inputs (*List[str]*, *optional*, defaults to `["latents"]`) : Tensor keys included in `callback_kwargs` for `callback_on_step_end`. | |
| max_sequence_length (*int*, *optional*, defaults to 4096) : Maximum sequence length for prompt encoding. | |
| enable_denormalization (*bool*, *optional*, defaults to *True*) : Denormalise latents before VAE decoding. | |
| **Returns:** | |
| `[*~pipelines.joyimage.JoyImageEditPipelineOutput*] or *torch.Tensor*` | |
| If `return_dict` is `True`, returns a pipeline output object containing the generated image(s). | |
| Otherwise returns the image tensor directly. | |
| #### check_inputs[[diffusers.JoyImageEditPipeline.check_inputs]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L409) | |
| Validate pipeline inputs before the forward pass. | |
| #### denormalize_latents[[diffusers.JoyImageEditPipeline.denormalize_latents]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L476) | |
| Invert `normalize_latents` to recover the original latent scale. | |
| **Parameters:** | |
| latent : Normalised latent tensor. | |
| **Returns:** | |
| Latent tensor in the scale expected by `vae.decode`. | |
| #### encode_prompt[[diffusers.JoyImageEditPipeline.encode_prompt]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L364) | |
| Encode a text prompt into embeddings (text-only path). | |
| Pre-computed `prompt_embeds` bypass encoding entirely. | |
| **Parameters:** | |
| prompt : Prompt string or list of prompt strings. | |
| device : Target device. | |
| num_images_per_prompt : Number of outputs to generate per prompt. | |
| prompt_embeds : Pre-computed prompt embeddings. | |
| prompt_embeds_mask : Attention mask for pre-computed embeddings. | |
| max_sequence_length : Maximum output sequence length. | |
| template_type : Prompt template key (`"image"` or `"multiple_images"`). | |
| **Returns:** | |
| Tuple of (prompt_embeds, prompt_embeds_mask). | |
| #### encode_prompt_multiple_images[[diffusers.JoyImageEditPipeline.encode_prompt_multiple_images]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L286) | |
| Encode prompts that contain inline image tokens via the Qwen processor. | |
| `&lt;image>\n` placeholders in each prompt string are replaced by the Qwen vision special tokens before being | |
| fed to the multimodal encoder. | |
| **Parameters:** | |
| prompt : Prompt string(s), optionally containing `&lt;image>\n` tokens. | |
| device : Target device. | |
| num_images_per_prompt : Number of outputs to generate per prompt. | |
| images : Pixel tensors corresponding to the inline image tokens. | |
| prompt_embeds : Pre-computed prompt embeddings. | |
| prompt_embeds_mask : Attention mask for pre-computed embeddings. | |
| template_type : Must be `"multiple_images"`. | |
| max_sequence_length : If set, truncate the output to this length (keeping the last `max_sequence_length` tokens). | |
| **Returns:** | |
| Tuple of (prompt_embeds, prompt_embeds_mask). | |
| #### normalize_latents[[diffusers.JoyImageEditPipeline.normalize_latents]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L447) | |
| Normalise latents using per-channel statistics from the VAE config. | |
| Uses (latent - mean) / std when the VAE exposes `latents_mean` and `latents_std`; otherwise falls back to | |
| scaling by `scaling_factor`. | |
| **Parameters:** | |
| latent : Raw latent tensor from `vae.encode`. | |
| **Returns:** | |
| Normalised latent tensor. | |
| #### prepare_latents[[diffusers.JoyImageEditPipeline.prepare_latents]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_joyimage_edit.py#L502) | |
| Prepare the initial noisy latent tensor for the denoising loop. | |
| **Parameters:** | |
| batch_size : Number of samples in the batch. | |
| num_channels_latents : Latent channel dimension from the transformer config. | |
| height : Spatial height in pixels. | |
| width : Spatial width in pixels. | |
| video_length : Number of frames (1 for image inference). | |
| dtype : Floating-point dtype for the latent tensor. | |
| device : Target device. | |
| generator : RNG generator(s) for reproducible sampling. | |
| latents : Optional user-provided initial noise for the target slot. When `None` random noise is sampled. | |
| image : Optional list of PIL reference images to VAE-encode as conditioning slots. | |
| enable_denormalization : Whether to normalise encoded reference latents. | |
| **Returns:** | |
| Tuple of `(latents, image_latents)` where `latents` has shape `(B, 1, C, T, H', W')` and | |
| `image_latents` has shape `(B, N_ref, C, T, H', W')` or `None` when no reference images are given. | |
| ## JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]] | |
| #### diffusers.JoyImageEditPipelineOutput[[diffusers.JoyImageEditPipelineOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_13098/src/diffusers/pipelines/joyimage/pipeline_output.py#L11) | |
| Output class for JoyImageEdit generation pipelines. | |
Xet Storage Details
- Size:
- 16.3 kB
- Xet hash:
- cdedb83756a3fe5a1740a487afd6612cb1d1f0ca5a3023295c5259687972d84b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.