File size: 3,742 Bytes
7f0b483
 
 
 
 
 
 
 
 
be518fa
7f0b483
 
 
d080817
7f0b483
 
 
0200f03
d360f29
7f0b483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
906d34c
7f0b483
 
 
 
 
d72b5c4
36eec61
 
 
 
1fcfd85
7f0b483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55c2e3c
7f0b483
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
datasets:
- opendiffusionai/laion2b-squareish-1536px
base_model:
- Tongyi-MAI/Z-Image
tags:
- z-image
- controlnet
thumbnail: https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet/resolve/main/assets/stacked_vertical.png
---

# Z-Image-SAM-ControlNet
![side by side](assets/side_by_side_d.png)
## Fun Facts
- This ControlNet is trained exclusively on images generated by [Segment Anything (SAM)](https://aidemos.meta.com/segment-anything/)
- Base model used was [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image)
- Uses SAM style images as input, outputs photorealistic images
- Trained at 1024x1024 resolution, inference works best at 1.5k and up
- Trained on 220K segmented images from [laion2b-squareish-1536px](https://huggingface.co/datasets/opendiffusionai/laion2b-squareish-1536px)
- Trained using this repo: [https://github.com/aigc-apps/VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun)

# Showcases
<table style="width:100%; table-layout:fixed;">
  <tr>
    <td><img src="./assets/resized_kitten_seg.png" ></td>
    <td><img src="./assets/resized_kitten.png" ></td>
  </tr>
  <tr>
    <td><img src="./assets/resized_dread_girl_seg.png" ></td>
    <td><img src="./assets/resized_dread_girl.png" ></td>
  </tr>
  <tr>
    <td><img src="./assets/resized_house_seg.png" ></td>
    <td><img src="./assets/resized_house.png" ></td>
  </tr>
</table>

# ComfyUI Usage
1) Copy the weights from [comfy-ui-patch/z-image-sam-controlnet.safetensors](comfy-ui-patch/z-image-sam-controlnet.safetensors) to `ComfyUI/models/model_patches`
2) Use `ModelPatchLoader` to load the patch
3) Plug `MODEL_PATCH` into `model_patch` on `ZImageFunControlnet`
4) Plug the model, VAE and image into `ZImageFunControlnet`
5) Plug the `ZImageFunControlnet` into KSampler
![videoXFun Nodes](assets/comfyui.png)

## Add Auto Segmentation (optional)
1) Use the ComfyUI Manager to add [ComfyUI-segment-anything-2](https://github.com/kijai/ComfyUI-segment-anything-2)
2) Use `Sam2AutoSegmentation` node to create segmented image

Here's an example workflow json: [comfy-ui-patch/z-image-control.json](https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet/blob/main/comfy-ui-patch/z-image-control.json) (includes option which performs segmentation first)
# Hugging Face Usage
  
## Compatibility
```py
pip install -U diffusers==0.37.0
```

## Download
```bash
sudo apt-get install git-lfs
git lfs install

git clone https://huggingface.co/neuralvfx/Z-Image-SAM-ControlNet

cd Z-Image-SAM-ControlNet
```

## Inference
```python
import torch
from diffusers.utils import load_image
from diffusers_local.pipeline_z_image_control_unified import ZImageControlUnifiedPipeline
from diffusers_local.z_image_control_transformer_2d import ZImageControlTransformer2DModel

transformer = ZImageControlTransformer2DModel.from_pretrained(
    ".",
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    add_control_noise_refiner=True,
)

pipe = ZImageControlUnifiedPipeline.from_pretrained(
    "Tongyi-MAI/Z-Image",
    torch_dtype=torch.bfloat16,
    transformer=transformer,
)

pipe.enable_model_cpu_offload()

image = pipe(
    prompt="some beach wood washed up on the sunny sand, spelling the words z-image, with footprints and waves crashing",
    negative_prompt="低分辨率,低画质,肢体畸形,手指畸形,画面过饱和,蜡像感,人脸无细节,过度光滑,画面具有AI感。构图混乱。文字模糊,扭曲。",
    control_image=load_image("assets/z-image.png"),
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4.0,
    controlnet_conditioning_scale=1.0,
    generator= torch.Generator("cuda").manual_seed(45),
).images[0]

image.save("output.png")
image
```