File size: 5,987 Bytes
a1b946c
d96a0f9
 
 
 
 
 
 
 
 
 
 
a1b946c
d96a0f9
a1b946c
d96a0f9
 
 
 
 
 
 
 
 
 
a1b946c
 
d96a0f9
 
 
 
a1b946c
d96a0f9
 
 
 
 
a1b946c
 
 
 
 
 
d96a0f9
 
a1b946c
d96a0f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b946c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18b1449
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d96a0f9
18b1449
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d96a0f9
 
 
18b1449
d96a0f9
 
 
a1b946c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
<h1 align="center">JoyAI-Image-Edit<br><sub><sup>Awakening Spatial Intelligence in Unified Multimodal Understanding and Generation</sup></sub></h1>

<div align="center">

[![Report PDF](https://img.shields.io/badge/Report-PDF-red)](https://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image.pdf)
[![Project](https://img.shields.io/badge/Project-JoyAI--Image-333399)](https://github.com/jd-opensource/JoyAI-Image)
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Checkpoint-JoyAI--Image--Edit-yellow)](https://huggingface.co/jdopensource/JoyAI-Image-Edit)&#160;
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)


</div>

## 🐶 JoyAI-Image-Edit

JoyAI-Image-Edit is a multimodal foundation model specialized in instruction-guided image editing. It enables precise and controllable edits by leveraging strong spatial understanding, including scene parsing, relational grounding, and instruction decomposition, allowing complex modifications to be applied accurately to specified regions.

## 🚀 Quick Start

### 1. Environment Setup

**Requirements**: Python >= 3.10, CUDA-capable GPU

Create a virtual environment and install:

```bash
git clone https://github.com/jd-opensource/JoyAI-Image
cd JoyAI-Image
conda create -n joyai python=3.10 -y
conda activate joyai

pip install -e .
```

> **Note on Flash Attention**: `flash-attn >= 2.8.0` is listed as a dependency for best performance.

#### Core Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `torch` | >= 2.8 | PyTorch |
| `transformers` | >= 4.57.0, < 4.58.0 | Text encoder |
| `diffusers` | >= 0.34.0 | Pipeline utilities |
| `flash-attn` | >= 2.8.0  | Fast attention kernel |


### 2. Inference

#### Image Editing

```bash
python inference.py \
  --ckpt-root /path/to/ckpts_infer \
  --prompt "Turn the plate blue" \
  --image test_images/test_1.jpg \
  --output outputs/result.png \
  --seed 123 \
  --steps 30 \
  --guidance-scale 5.0 \
  --basesize 1024
```

---

### CLI Reference (`inference.py`)

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--ckpt-root` | str | *required* | Checkpoint root |
| `--prompt` | str | *required* | Edit instruction or T2I prompt |
| `--image` | str | None | Input image path (required for editing, omit for T2I) |
| `--output` | str | `example.png` | Output image path |
| `--steps` | int | 50 | Denoising steps |
| `--guidance-scale` | float | 5.0 | Classifier-free guidance scale |
| `--seed` | int | 42 | Random seed for reproducibility |
| `--neg-prompt` | str | `""` | Negative prompt |
| `--basesize` | int | 1024 | Bucket base size for input image resizing (256/512/768/1024) |
| `--config` | str | auto | Config path; defaults to `<ckpt-root>/infer_config.py` |
| `--rewrite-prompt` | flag | off | Enable LLM-based prompt rewriting |
| `--rewrite-model` | str | `gpt-5` | Model name for prompt rewriting |
| `--hsdp-shard-dim` | int | 1 | FSDP shard dimension for multi-GPU (set to GPU count) |


### Spatial Editing Reference

JoyAI-Image supports three spatial editing prompt patterns: **Object Move**, **Object Rotation**, and **Camera Control**. For the most stable behavior, we recommend following the prompt templates below as closely as possible.

#### 1. Object Move

Use this pattern when you want to move a target object into a specified region.

**Prompt template:**

```text
Move the <object> into the red box and finally remove the red box.
```

**Rules:**

* Replace `<object>` with a clear description of the target object to be moved.
* The **red box** indicates the target destination in the image.
* The phrase **"finally remove the red box"** means the guidance box should not appear in the final edited result.

**Example:**

```text
Move the apple into the red box and finally remove the red box.
```

#### 2. Object Rotation

Use this pattern when you want to rotate an object to a specific canonical view.

**Prompt template:**

```text
Rotate the <object> to show the <view> side view.
```

**Supported `<view>` values:**

* `front`
* `right`
* `left`
* `rear`
* `front right`
* `front left`
* `rear right`
* `rear left`

**Rules:**

* Replace `<object>` with a clear description of the object to rotate.
* Replace `<view>` with one of the supported directions above.
* This instruction is intended to change the **object orientation**, while keeping the object identity and surrounding scene as consistent as possible.

**Examples:**

```text
Rotate the chair to show the front side view.
Rotate the car to show the rear left side view.
```

#### 3. Camera Control

Use this pattern when you want to change only the camera viewpoint while keeping the 3D scene itself unchanged.

**Prompt template:**

```text
Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.
```

**Rules:**

* `{y_rotation}` specifies the yaw rotation angle in degrees.
* `{p_rotation}` specifies the pitch rotation angle in degrees.
* `Camera zoom` must be one of:

  * `in`
  * `out`
  * `unchanged`
* The last line is important: it explicitly tells the model to preserve the 3D scene content and geometry, and only adjust the camera viewpoint.

**Examples:**

```text
Move the camera.
- Camera rotation: Yaw 45°, Pitch 0°.
- Camera zoom: in.
- Keep the 3D scene static; only change the viewpoint.
```

```text
Move the camera.
- Camera rotation: Yaw -90°, Pitch 20°.
- Camera zoom: unchanged.
- Keep the 3D scene static; only change the viewpoint.
```

## License Agreement

JoyAI-Image is licensed under Apache 2.0.

## ☎️  We're Hiring!

We are actively hiring Research Scientists, Engineers, and Interns to join us in building next-generation generative foundation models and bringing them into real-world applications. If you’re interested, please send your resume to: [huanghaoyang.ocean@jd.com](mailto:huanghaoyang.ocean@jd.com)