Upload folder using huggingface_hub

d74fa98 verified 2 days ago

5.32 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-image
	tags:
	- comfyui
	- image-editing
	- joyai
	- multi-image
	---

	# JoyAI-Image-Edit-Plus (ComfyUI weights)

	Single-file `.safetensors` checkpoints of [JoyAI-Image-Edit-Plus](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers), repackaged for native ComfyUI support (no custom node required).

	JoyAI-Image-Edit-Plus is the multi-image instruction-guided editing model of the [JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image) family. It accepts 1–6 reference images and a text instruction, and generates a new image that combines elements from the references according to the instruction.

	## Files

	\| File \| Size \| Goes into \| Component \|
	\|------\|------\|-----------\|-----------\|
	\| `diffusion_models/joy_image_edit_plus_bf16.safetensors` \| ~31 GB \| `ComfyUI/models/diffusion_models/` \| `JoyImageEditPlusTransformer3DModel` (bf16) \|
	\| `text_encoders/qwen3vl_joyimage_bf16.safetensors` \| ~17 GB \| `ComfyUI/models/text_encoders/` \| Qwen3-VL-8B text encoder (bf16) \|
	\| `vae/joy_image_edit_vae.safetensors` \| ~243 MB \| `ComfyUI/models/vae/` \| `AutoencoderKLWan` \|

	The repo layout already matches `ComfyUI/models/`, so a single `hf download` into your models root drops every file where it needs to go.

	## Model architecture

	- Transformer: 40-layer DiT, hidden size 4096, 32 heads, in/out channels 16, patch size `[1, 2, 2]`, 3D RoPE (`rope_dim_list = [16, 56, 56]`, theta 10000). Each reference image is patchified independently and concatenated on the sequence dimension with a per-image temporal offset in the 3D RoPE grid, so references may differ in resolution.
	- Text encoder: `Qwen3VLForConditionalGeneration` (text dim 4096). The instruction is wrapped with one `<\|vision_start\|><\|image_pad\|><\|vision_end\|>` block per reference image.
	- VAE: `AutoencoderKLWan` (z_dim 16, spatial downscale 8, temporal downscale 4) — the same VAE used by the single-image edit model.
	- Scheduler: FlowMatch (Euler), sampling shift 1.5.

	Weight names are byte-identical to the diffusers checkpoint (894 transformer keys, zero renaming); ComfyUI auto-detects the model as `joyimage`.

	## Installation

	The model runs natively in ComfyUI. Native support is proposed upstream in [Comfy-Org/ComfyUI#14428](https://github.com/Comfy-Org/ComfyUI/pull/14428); until it is merged, install the fork branch:

	```bash
	git clone -b joyimage-edit-pr https://github.com/feice-huang/ComfyUI.git
	cd ComfyUI
	pip install -r requirements.txt
	```

	Once the PR is merged upstream, the stock ComfyUI release will run these weights with no fork needed.

	Then download the weights straight into `ComfyUI/models/`:

	```bash
	hf download jdopensource/JoyAI-Image-Edit-Plus-ComfyUI \
	--local-dir /path/to/ComfyUI/models
	```

	Restart ComfyUI.

	## Usage

	Example workflow: [workflow_joyimage_edit.json](https://github.com/user-attachments/files/29588811/workflow_joyimage_edit_plus.json)

	Build the graph from these native nodes:

	1. Load Diffusion Model (`UNETLoader`) → `diffusion_models/joy_image_edit_plus_bf16.safetensors`
	2. Load CLIP (`CLIPLoader`) → `text_encoders/qwen3vl_joyimage_bf16.safetensors`, type `joyimage`
	3. Load VAE (`VAELoader`) → `vae/joy_image_edit_vae.safetensors`
	4. Load Image (`LoadImage`) for each reference (1–6)
	5. TextEncodeJoyImageEditPlus — feed `clip`, `vae`, the instruction, and the reference images into `image1`…`image6`. Wire one instance for the positive prompt and one (empty prompt, same images) for the negative. Each node bucket-resizes the references to the 1024-base buckets, VAE-encodes them, and appends the reference latents to the conditioning; its `image` output feeds `VAEDecode` / empty-latent sizing.
	6. KSampler → VAEDecode → SaveImage

	## Recommended parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Steps \| 30 \|
	\| CFG \| 4.0 \|
	\| Sampler \| `euler` \|
	\| Scheduler \| `simple` \|
	\| dtype \| bf16 \|
	\| Resolution \| auto (1024-base buckets, per reference) \|

	## Example

	Prompt: "The woman is lovingly holding the cute puppy in her arms"

	\| Input 0 \| Input 1 \| Output \|
	\|---------\|---------\|--------\|
	\| ![input_0](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers/resolve/main/examples/input_0.png) \| ![input_1](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers/resolve/main/examples/input_1.png) \| ![output](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers/resolve/main/examples/output.png) \|

	## Model details

	- Developed by: JD.com
	- License: Apache-2.0
	- Framework: PyTorch / ComfyUI

	## Links

	- Source code and documentation: [github.com/jd-opensource/JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image)
	- Original Diffusers-format weights: [jdopensource/JoyAI-Image-Edit-Plus-Diffusers](https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers)
	- Single-image edit model (ComfyUI): [jdopensource/JoyAI-Image-Edit-ComfyUI](https://huggingface.co/jdopensource/JoyAI-Image-Edit-ComfyUI)

	## Citation

	```bibtex
	@misc{joyai-image-2025,
	title={JoyAI-Image: A Unified Multimodal Foundation Model for Image Understanding, Generation, and Editing},
	author={Joy Future Academy, JD},
	year={2025},
	url={https://github.com/jd-opensource/JoyAI-Image}
	}
	```