jdopensource
/

JoyAI-Image-Edit

Safetensors

Model card Files Files and versions

xet

Community

stevengrove commited on 2 days ago

Commit

18b1449

verified ·

1 Parent(s): d96a0f9

Update README.md

Browse files

Files changed (1) hide show

README.md +134 -36

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ Welcome to the official project page for **JoyAI-Image**.
 ## 🐶 JoyAI-Image
-JoyAI-Image is a **unified multimodal foundation model** for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the **closed-loop collaboration between understanding, generation, and editing**. Stronger spatial understanding improves grounded generation and contrallable editing through better scene parsing, relational grounding, and instruction decomposition, while generative transformations such as viewpoint changes provide complementary evidence for spatial reasoning.
 ![JoyAI-Image Architecture](https://joyai-image.s3.cn-north-1.jdcloud-oss.com/assests/architecture.png)
@@ -64,19 +64,18 @@ conda create -n joyai python=3.10 -y
 conda activate joyai
 pip install -e .
-```
 > **Note on Flash Attention**: `flash-attn >= 2.8.0` is listed as a dependency for best performance.
 #### Core Dependencies
-| Package | Version | Purpose |
-|---------|---------|---------|
-| `torch` | >= 2.8 | PyTorch |
-| `transformers` | >= 4.57.0, < 4.58.0 | Text encoder |
-| `diffusers` | >= 0.34.0 | Pipeline utilities |
-| `flash-attn` | >= 2.8.0  | Fast attention kernel |
 ### 2. Inference
@@ -108,40 +107,139 @@ python inference.py \
 ### CLI Reference (`inference.py`)
-| Argument | Type | Default | Description |
-|----------|------|---------|-------------|
-| `--ckpt-root` | str | *required* | Checkpoint root |
-| `--prompt` | str | *required* | Edit instruction or T2I prompt |
-| `--image` | str | None | Input image path (required for editing, omit for T2I) |
-| `--output` | str | `example.png` | Output image path |
-| `--steps` | int | 50 | Denoising steps |
-| `--guidance-scale` | float | 5.0 | Classifier-free guidance scale |
-| `--seed` | int | 42 | Random seed for reproducibility |
-| `--neg-prompt` | str | `""` | Negative prompt |
-| `--basesize` | int | 1024 | Bucket base size for input image resizing (256/512/768/1024) |
-| `--config` | str | auto | Config path; defaults to `<ckpt-root>/infer_config.py` |
-| `--rewrite-prompt` | flag | off | Enable LLM-based prompt rewriting |
-| `--rewrite-model` | str | `gpt-5` | Model name for prompt rewriting |
-| `--hsdp-shard-dim` | int | 1 | FSDP shard dimension for multi-GPU (set to GPU count) |
 ### CLI Reference (`inference_und.py`)
-| Argument | Type | Default | Description |
-|----------|------|---------|-------------|
-| `--ckpt-root` | str | *required* | Checkpoint root containing `text_encoder/` |
-| `--image` | str | *required* | Input image path, or comma-separated paths for multiple images |
-| `--prompt` | str | `"Describe this image in detail."` | User question or instruction. When omitted, defaults to image captioning |
-| `--max-new-tokens` | int | 2048 | Maximum number of tokens to generate |
-| `--temperature` | float | 0.7 | Sampling temperature. Use `0` for greedy decoding |
-| `--top-p` | float | 0.8 | Top-p (nucleus) sampling threshold |
-| `--top-k` | int | 50 | Top-k sampling threshold |
-| `--output` | str | None | Optional output file to save the response text |
 ## License Agreement
-JoyAI-Image is licensed under Apache 2.0.
 ## ☎��  We're Hiring!
-We are actively hiring Research Scientists, Engineers, and Interns to join us in building next-generation generative foundation models and bringing them into real-world applications. If you’re interested, please send your resume to: huanghaoyang.ocean@jd.com

 ## 🐶 JoyAI-Image
+JoyAI-Image is a **unified multimodal foundation model** for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the **closed-loop collaboration between understanding, generation, and editing**. Stronger spatial understanding improves grounded generation and controllable editing through better scene parsing, relational grounding, and instruction decomposition, while generative transformations such as viewpoint changes provide complementary evidence for spatial reasoning.
 ![JoyAI-Image Architecture](https://joyai-image.s3.cn-north-1.jdcloud-oss.com/assests/architecture.png)
 conda activate joyai
 pip install -e .
+````
 > **Note on Flash Attention**: `flash-attn >= 2.8.0` is listed as a dependency for best performance.
 #### Core Dependencies
+| Package        | Version             | Purpose               |
+| -------------- | ------------------- | --------------------- |
+| `torch`        | >= 2.8              | PyTorch               |
+| `transformers` | >= 4.57.0, < 4.58.0 | Text encoder          |
+| `diffusers`    | >= 0.34.0           | Pipeline utilities    |
+| `flash-attn`   | >= 2.8.0            | Fast attention kernel |
 ### 2. Inference
 ### CLI Reference (`inference.py`)
+| Argument           | Type  | Default       | Description                                                  |
+| ------------------ | ----- | ------------- | ------------------------------------------------------------ |
+| `--ckpt-root`      | str   | *required*    | Checkpoint root                                              |
+| `--prompt`         | str   | *required*    | Edit instruction or T2I prompt                               |
+| `--image`          | str   | None          | Input image path (required for editing, omit for T2I)        |
+| `--output`         | str   | `example.png` | Output image path                                            |
+| `--steps`          | int   | 50            | Denoising steps                                              |
+| `--guidance-scale` | float | 5.0           | Classifier-free guidance scale                               |
+| `--seed`           | int   | 42            | Random seed for reproducibility                              |
+| `--neg-prompt`     | str   | `""`          | Negative prompt                                              |
+| `--basesize`       | int   | 1024          | Bucket base size for input image resizing (256/512/768/1024) |
+| `--config`         | str   | auto          | Config path; defaults to `<ckpt-root>/infer_config.py`       |
+| `--rewrite-prompt` | flag  | off           | Enable LLM-based prompt rewriting                            |
+| `--rewrite-model`  | str   | `gpt-5`       | Model name for prompt rewriting                              |
+| `--hsdp-shard-dim` | int   | 1             | FSDP shard dimension for multi-GPU (set to GPU count)        |
 ### CLI Reference (`inference_und.py`)
+| Argument           | Type  | Default                            | Description                                                              |
+| ------------------ | ----- | ---------------------------------- | ------------------------------------------------------------------------ |
+| `--ckpt-root`      | str   | *required*                         | Checkpoint root containing `text_encoder/`                               |
+| `--image`          | str   | *required*                         | Input image path, or comma-separated paths for multiple images           |
+| `--prompt`         | str   | `"Describe this image in detail."` | User question or instruction. When omitted, defaults to image captioning |
+| `--max-new-tokens` | int   | 2048                               | Maximum number of tokens to generate                                     |
+| `--temperature`    | float | 0.7                                | Sampling temperature. Use `0` for greedy decoding                        |
+| `--top-p`          | float | 0.8                                | Top-p (nucleus) sampling threshold                                       |
+| `--top-k`          | int   | 50                                 | Top-k sampling threshold                                                 |
+| `--output`         | str   | None                               | Optional output file to save the response text                           |
+### Spatial Editing Reference
+JoyAI-Image supports three spatial editing prompt patterns: **Object Move**, **Object Rotation**, and **Camera Control**. For the most stable behavior, we recommend following the prompt templates below as closely as possible.
+#### 1. Object Move
+Use this pattern when you want to move a target object into a specified region.
+**Prompt template:**
+```text
+Move the <object> into the red box and finally remove the red box.
+```
+**Rules:**
+* Replace `<object>` with a clear description of the target object to be moved.
+* The **red box** indicates the target destination in the image.
+* The phrase **"finally remove the red box"** means the guidance box should not appear in the final edited result.
+**Example:**
+```text
+Move the apple into the red box and finally remove the red box.
+```
+#### 2. Object Rotation
+Use this pattern when you want to rotate an object to a specific canonical view.
+**Prompt template:**
+```text
+Rotate the <object> to show the <view> side view.
+```
+**Supported `<view>` values:**
+* `front`
+* `right`
+* `left`
+* `rear`
+* `front right`
+* `front left`
+* `rear right`
+* `rear left`
+**Rules:**
+* Replace `<object>` with a clear description of the object to rotate.
+* Replace `<view>` with one of the supported directions above.
+* This instruction is intended to change the **object orientation**, while keeping the object identity and surrounding scene as consistent as possible.
+**Examples:**
+```text
+Rotate the chair to show the front side view.
+Rotate the car to show the rear left side view.
+```
+#### 3. Camera Control
+Use this pattern when you want to change only the camera viewpoint while keeping the 3D scene itself unchanged.
+**Prompt template:**
+```text
+Move the camera.
+- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
+- Camera zoom: in/out/unchanged.
+- Keep the 3D scene static; only change the viewpoint.
+```
+**Rules:**
+* `{y_rotation}` specifies the yaw rotation angle in degrees.
+* `{p_rotation}` specifies the pitch rotation angle in degrees.
+* `Camera zoom` must be one of:
+  * `in`
+  * `out`
+  * `unchanged`
+* The last line is important: it explicitly tells the model to preserve the 3D scene content and geometry, and only adjust the camera viewpoint.
+**Examples:**
+```text
+Move the camera.
+- Camera rotation: Yaw 45°, Pitch 0°.
+- Camera zoom: in.
+- Keep the 3D scene static; only change the viewpoint.
+```
+```text
+Move the camera.
+- Camera rotation: Yaw -90°, Pitch 20°.
+- Camera zoom: unchanged.
+- Keep the 3D scene static; only change the viewpoint.
+```
 ## License Agreement
+JoyAI-Image is licensed under Apache 2.0.
 ## ☎��  We're Hiring!
+We are actively hiring Research Scientists, Engineers, and Interns to join us in building next-generation generative foundation models and bringing them into real-world applications. If you’re interested, please send your resume to: [huanghaoyang.ocean@jd.com](mailto:huanghaoyang.ocean@jd.com)