bubbliiiing commited on
Commit
a3e5080
·
1 Parent(s): 1fc23a9

Update 2601

Browse files
README.md CHANGED
@@ -8,10 +8,21 @@ library_name: videox_fun
8
  [![Github](https://img.shields.io/badge/🎬%20Code-VideoX_Fun-blue)](https://github.com/aigc-apps/VideoX-Fun)
9
 
10
  ## Update
 
11
  - During testing, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and become blurry. We performed 8-step distillation on the version 2.1 model, and the distilled model demonstrates better performance when using 8-step prediction. Additionally, we have uploaded a tile model that can be used for super-resolution generation. [2025.12.22]
12
  - Due to a typo in version 2.0, `control_layers` was used instead of `control_noise_refiner` to process refiner latents during training. Although the model converged normally, the model inference speed was slow because `control_layers` forward pass was performed twice. In version 2.1, we made an urgent fix and the speed has returned to normal. [2025.12.17]
13
 
14
  ## Model Card
 
 
 
 
 
 
 
 
 
 
15
  | Name | Description |
16
  |--|--|
17
  | Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors | Based on version 2.1, the model was distilled using an 8-step distillation algorithm. 8-step prediction is recommended. Compared to version 2.1, when using 8-step prediction, the images are clearer and the composition is more reasonable. |
@@ -20,7 +31,7 @@ library_name: videox_fun
20
  | Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors | ControlNet weights for Z-Image-Turbo. Compared to version 1.0, it adds modifications to more layers and was trained for a longer time. However, due to a typo in the code, the layer blocks were forwarded twice, resulting in slower speed. The model supports multiple control conditions such as Canny, Depth, Pose, MLSD, etc. Additionally, the model lost some of its acceleration capability after training, thus requiring more steps. |
21
 
22
  ## Model Features
23
- - This ControlNet is added on 15 layer blocks and 2 refiner layer blocks. It supports multiple control conditions—including Canny, HED, Depth, Pose and MLSD can be used like a standard ControlNet.
24
  - Inpainting mode is also supported.
25
  - Training Process:
26
  - 2.0: The model was trained from scratch for 70,000 steps on a dataset of 1 million high-quality images covering both general and human-centric content. Training was performed at 1328 resolution using BFloat16 precision, with a batch size of 64, a learning rate of 2e-5, and a text dropout ratio of 0.10.
@@ -32,11 +43,36 @@ library_name: videox_fun
32
  - You can adjust control_context_scale for stronger control and better detail preservation. For better stability, we highly recommend using a detailed prompt. The optimal range for control_context_scale is from 0.65 to 0.90.
33
  - During testing, in versions 2.0 and 2.1, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and produce blurry images. For detailed information on strength and step count testing, please refer to Scale Test Results. These results were generated using version 2.0. For strength and step testing, please refer to [Scale Test Results](#scale-test-results). This was obtained by generating with version 2.0.
34
 
35
- ## TODO
36
- - [ ] Train on better data.
37
-
38
  ## Results
39
- ### Difference between 2.1 and 2.1-8steps.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  8 steps results:
42
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
@@ -50,7 +86,66 @@ library_name: videox_fun
50
  </tr>
51
  </table>
52
 
53
- ### Generation Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
55
  <tr>
56
  <td>Pose + Inpaint</td>
 
8
  [![Github](https://img.shields.io/badge/🎬%20Code-VideoX_Fun-blue)](https://github.com/aigc-apps/VideoX-Fun)
9
 
10
  ## Update
11
+ - A new lite model has been added with Control Latents applied on 5 layers (only 1.9GB). The previous Control model had two issues: insufficient mask randomness causing the model to learn mask patterns and auto-fill during inpainting, and overfitting between control and tile distillation causing artifacts at large control_context_scale values. Both Control and Tile models have been retrained with enriched mask varieties and improved training schedules. Additionally, the dataset has been restructured with multi-resolution control images (512~1536) instead of single resolution (512) for better robustness. [2026.01.12]
12
  - During testing, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and become blurry. We performed 8-step distillation on the version 2.1 model, and the distilled model demonstrates better performance when using 8-step prediction. Additionally, we have uploaded a tile model that can be used for super-resolution generation. [2025.12.22]
13
  - Due to a typo in version 2.0, `control_layers` was used instead of `control_noise_refiner` to process refiner latents during training. Although the model converged normally, the model inference speed was slow because `control_layers` forward pass was performed twice. In version 2.1, we made an urgent fix and the speed has returned to normal. [2025.12.17]
14
 
15
  ## Model Card
16
+
17
+ ### a. 2601 Models
18
+ | Name | Description |
19
+ |--|--|
20
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors | Compared to the old version of the model, a more diverse variety of masks and a more reasonable training schedule have been adopted. This reduces bright spots/artifacts and mask information leakage. Additionally, the dataset has been restructured with multi-resolution control images (512~1536) instead of single resolution (512) for better robustness. |
21
+ | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors | Compared to the old version of the model, a higher resolution was used for training, and a more reasonable training schedule was employed during distillation, which reduces bright spots/artifacts. |
22
+ | Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors | Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines. |
23
+ | Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors | Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines. |
24
+
25
+ ### b. Models Before 2601
26
  | Name | Description |
27
  |--|--|
28
  | Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors | Based on version 2.1, the model was distilled using an 8-step distillation algorithm. 8-step prediction is recommended. Compared to version 2.1, when using 8-step prediction, the images are clearer and the composition is more reasonable. |
 
31
  | Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors | ControlNet weights for Z-Image-Turbo. Compared to version 1.0, it adds modifications to more layers and was trained for a longer time. However, due to a typo in the code, the layer blocks were forwarded twice, resulting in slower speed. The model supports multiple control conditions such as Canny, Depth, Pose, MLSD, etc. Additionally, the model lost some of its acceleration capability after training, thus requiring more steps. |
32
 
33
  ## Model Features
34
+ - This ControlNet is added on 15 layer blocks and 2 refiner layer blocks (Lite models are added on 3 layer blocks and 2 refiner blocks). It supports multiple control conditions—including Canny, HED, Depth, Pose and MLSD can be used like a standard ControlNet.
35
  - Inpainting mode is also supported.
36
  - Training Process:
37
  - 2.0: The model was trained from scratch for 70,000 steps on a dataset of 1 million high-quality images covering both general and human-centric content. Training was performed at 1328 resolution using BFloat16 precision, with a batch size of 64, a learning rate of 2e-5, and a text dropout ratio of 0.10.
 
43
  - You can adjust control_context_scale for stronger control and better detail preservation. For better stability, we highly recommend using a detailed prompt. The optimal range for control_context_scale is from 0.65 to 0.90.
44
  - During testing, in versions 2.0 and 2.1, we found that applying ControlNet to Z-Image-Turbo caused the model to lose its acceleration capability and produce blurry images. For detailed information on strength and step count testing, please refer to Scale Test Results. These results were generated using version 2.0. For strength and step testing, please refer to [Scale Test Results](#scale-test-results). This was obtained by generating with version 2.0.
45
 
 
 
 
46
  ## Results
47
+ ### a. Difference between 2.1-8steps and 2.1-2601-8steps.
48
+
49
+ The old 8-steps model had bright spots/artifacts when the control_context_scale was too large, while the new version does not.
50
+
51
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
52
+ <tr>
53
+ <td>Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps</td>
54
+ <td>Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps</td>
55
+ </tr>
56
+ <tr>
57
+ <td><img src="results/hed_2_1.png" width="100%" /></td>
58
+ <td><img src="results/hed_2_1_2601.png" width="100%" /></td>
59
+ </tr>
60
+ </table>
61
+
62
+ The old 8-steps model sometimes learned the mask information and tended to completely fill the mask during removal, while the new version does not.
63
+
64
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
65
+ <tr>
66
+ <td>Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps</td>
67
+ <td>Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps</td>
68
+ </tr>
69
+ <tr>
70
+ <td><img src="results/mask_2_1.png" width="100%" /></td>
71
+ <td><img src="results/mask_2_1_2601.png" width="100%" /></td>
72
+ </tr>
73
+ </table>
74
+
75
+ ### b. Difference between 2.1 and 2.1-8steps.
76
 
77
  8 steps results:
78
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
 
86
  </tr>
87
  </table>
88
 
89
+ ### c. Generation Results With 2.1-lite-2601-8steps
90
+
91
+ Uses the same training scheme as the 2601 version, but compared to the large version of the model, fewer layers have control added, resulting in weaker control conditions. This makes it suitable for larger control_context_scale values, and the generation results appear more natural. It is also suitable for lower-spec machines.
92
+
93
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
94
+ <tr>
95
+ <td>Pose</td>
96
+ <td>Output</td>
97
+ </tr>
98
+ <tr>
99
+ <td><img src="asset/pose.jpg" width="100%" /></td>
100
+ <td><img src="results/pose_lite.png" width="100%" /></td>
101
+ </tr>
102
+ </table>
103
+
104
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
105
+ <tr>
106
+ <td>Pose</td>
107
+ <td>Output</td>
108
+ </tr>
109
+ <tr>
110
+ <td><img src="asset/pose2.jpg" width="100%" /></td>
111
+ <td><img src="results/pose2_lite.png" width="100%" /></td>
112
+ </tr>
113
+ </table>
114
+
115
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
116
+ <tr>
117
+ <td>Canny</td>
118
+ <td>Output</td>
119
+ </tr>
120
+ <tr>
121
+ <td><img src="asset/canny.jpg" width="100%" /></td>
122
+ <td><img src="results/canny_lite.png" width="100%" /></td>
123
+ </tr>
124
+ </table>
125
+
126
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
127
+ <tr>
128
+ <td>Depth</td>
129
+ <td>Output</td>
130
+ </tr>
131
+ <tr>
132
+ <td><img src="asset/depth.jpg" width="100%" /></td>
133
+ <td><img src="results/depth_lite.png" width="100%" /></td>
134
+ </tr>
135
+
136
+ ### d. Generation Results With 2.1-2601-8steps
137
+
138
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
139
+ <tr>
140
+ <td>Pose + Inpaint</td>
141
+ <td>Output</td>
142
+ </tr>
143
+ <tr>
144
+ <td><img src="asset/inpaint.jpg" width="100%" /><img src="asset/mask.jpg" width="100%" /></td>
145
+ <td><img src="results/inpaint.png" width="100%" /></td>
146
+ </tr>
147
+ </table>
148
+
149
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
150
  <tr>
151
  <td>Pose + Inpaint</td>
Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ca1f20f684be3f0c53b204b2e61a83f1ac28821c8c9a48ea7d8196ce395eb71
3
+ size 6712485600
Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:880bf452b060abfcbccedee56f8d3dbf8aed2cb0311b599210361b414fc8f2fd
3
+ size 2016627488
Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53bac221dfae4279f14a3b1e6e311eac86ab39d57bf3d9a226e5aaf067a049bb
3
+ size 6712485600
Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa428bc857b0095cddb52cd1acd7c0c6ada4c57658ec0ed39cd64280355b39cf
3
+ size 2016627488
results/canny.png CHANGED

Git LFS Details

  • SHA256: 1a0537a1a887163841c851e06bfef32fdfa41f79312f43f0b136bfb3e649429f
  • Pointer size: 132 Bytes
  • Size of remote file: 2.38 MB

Git LFS Details

  • SHA256: 2b17e728050fcba9e11cf61148d95f93faf20c8e9a0a3aafcb6dea8f60a93016
  • Pointer size: 132 Bytes
  • Size of remote file: 2.13 MB
results/canny_lite.png ADDED

Git LFS Details

  • SHA256: dbb85fac7b54813439282f813df3c08f590eeee97bcf803eae485a4540532835
  • Pointer size: 132 Bytes
  • Size of remote file: 2.12 MB
results/depth.png CHANGED

Git LFS Details

  • SHA256: 46e4276a3c415b2760e8b81609f6fc656f10f0a676c8c415a834013b366b9d59
  • Pointer size: 132 Bytes
  • Size of remote file: 1.43 MB

Git LFS Details

  • SHA256: 87cd9831279058477f7acfd8e13b7779fb356ceca48bdd2678878d3d1763e804
  • Pointer size: 132 Bytes
  • Size of remote file: 1.64 MB
results/depth_lite.png ADDED

Git LFS Details

  • SHA256: f0c83515dc30dfaa465b1099f95632e89cc119c878838c1c7a343c3b5266abd0
  • Pointer size: 132 Bytes
  • Size of remote file: 1.6 MB
results/hed.png CHANGED

Git LFS Details

  • SHA256: 816e11e44e8d659fd2c6cb2c862c37f942d329001cf787fef888fd2849f303ea
  • Pointer size: 132 Bytes
  • Size of remote file: 1.53 MB

Git LFS Details

  • SHA256: eb8cea94cf7864a969a67a2251f54de9e7c5c68290c2ca10c61e480db6d2d3ef
  • Pointer size: 132 Bytes
  • Size of remote file: 1.76 MB
results/hed_2_1.png ADDED

Git LFS Details

  • SHA256: 303ff17502b56bffe979054cbbfe7e426d2b6acaa263a436be20f7e5f0c460c7
  • Pointer size: 132 Bytes
  • Size of remote file: 1.7 MB
results/hed_2_1_2601.png ADDED

Git LFS Details

  • SHA256: eb8cea94cf7864a969a67a2251f54de9e7c5c68290c2ca10c61e480db6d2d3ef
  • Pointer size: 132 Bytes
  • Size of remote file: 1.76 MB
results/high_res.png CHANGED

Git LFS Details

  • SHA256: b77998bb0d60dc5cc758decbb51638242b12e64c4a9ec14133c28de98599c7ff
  • Pointer size: 132 Bytes
  • Size of remote file: 4.78 MB

Git LFS Details

  • SHA256: 8524e49d45c273e147f9aa3865efb8c9948cf35cf24937a7a2982e97ee103f05
  • Pointer size: 132 Bytes
  • Size of remote file: 4.41 MB
results/inpaint.png ADDED

Git LFS Details

  • SHA256: 97c1f03a38c85fd4674d1348bb4552d9dbcdb27ef55a77bd07a01a5c743ca34b
  • Pointer size: 132 Bytes
  • Size of remote file: 1.86 MB
results/mask_2_1.png ADDED

Git LFS Details

  • SHA256: 3ecefcb9d95ed0832b1f59a5c23ea4a84223789004362c005d13f901059d3d1b
  • Pointer size: 132 Bytes
  • Size of remote file: 1.99 MB
results/mask_2_1_2601.png ADDED

Git LFS Details

  • SHA256: 2dddea0e0904440f62ce1dcb8569e80b0a59cfecda4ce140215c6280425fc376
  • Pointer size: 132 Bytes
  • Size of remote file: 1.9 MB
results/pose.png CHANGED

Git LFS Details

  • SHA256: 866ecbb5c65e6ea7f7f540b205c4c7261e2e7685f6e29cac190ad56ee87ddb9b
  • Pointer size: 132 Bytes
  • Size of remote file: 1.79 MB

Git LFS Details

  • SHA256: cb12ccd0c9c83047dac4a3094f2e2acb3925c2cd95e87dbb39ee074154e35d1c
  • Pointer size: 132 Bytes
  • Size of remote file: 1.84 MB
results/pose2.png CHANGED

Git LFS Details

  • SHA256: 70a170110ea25ae0cf1a6057dcdb31748c12e8ed1c98f7d51d98503a1a03ad54
  • Pointer size: 132 Bytes
  • Size of remote file: 1.78 MB

Git LFS Details

  • SHA256: 2b931e684a504221813a645605519f55ea182b7485e4ce0562bc682098a0b182
  • Pointer size: 132 Bytes
  • Size of remote file: 1.88 MB
results/pose2_lite.png ADDED

Git LFS Details

  • SHA256: 82c6280639ba289544df41ecdf5bcfb6c350852cec341e6915dabaa50e21743d
  • Pointer size: 132 Bytes
  • Size of remote file: 1.87 MB
results/pose3.png CHANGED

Git LFS Details

  • SHA256: d2127491e33c0da4cc361de1dac9148dff237129828df89ffdeb63bdf385edf5
  • Pointer size: 132 Bytes
  • Size of remote file: 2.13 MB

Git LFS Details

  • SHA256: 7278561cfb32a9418c82037c123444cdb3a956e68a83045a7397c177e0fcfbc2
  • Pointer size: 132 Bytes
  • Size of remote file: 2.22 MB
results/pose_inpaint.png CHANGED

Git LFS Details

  • SHA256: bfe302223041787d3b678f49ac5c459e2fdda74bd25fa6ef15714dc32c1d6cb0
  • Pointer size: 132 Bytes
  • Size of remote file: 1.83 MB

Git LFS Details

  • SHA256: bc22c696fcc531d4e54a3665daea97d3a9202e79aaf66dc481572e9e87d64868
  • Pointer size: 132 Bytes
  • Size of remote file: 1.85 MB
results/pose_lite.png ADDED

Git LFS Details

  • SHA256: a3c9036dd0c2a8ee13aeaf716b979aff1292904f8a8fc7f60a5a20f4a87f0954
  • Pointer size: 132 Bytes
  • Size of remote file: 1.76 MB