zyhuangnus commited on
Commit
4ba6b50
·
verified ·
1 Parent(s): 921f9b8

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +206 -0
  3. assets/0830-mingtok-fig1.jpg +3 -0
  4. config.json +34 -0
  5. model.safetensors +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/0830-mingtok-fig1.jpg filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## MingTok: A Unified Tokenizer for Visual Understanding and Generation without Vector Quantization
2
+
3
+ <p align="center">📑 <a href="">Technical Report</a>|📖<a href="https://huggingface.co/inclusionAI/MingTok-Vision">Project Page</a> |🤗 <a href="https://huggingface.co/inclusionAI/MingTok-Vision">Hugging Face</a>| 🤖 <a href="https://modelscope.cn/models/inclusionAI/MingTok-Vision">ModelScope</a>
4
+
5
+ ## Key Features
6
+ - 🖼️ **First Continuous Unified Vision Tokenizer:** MingTok enables unified vision understanding and generation via a continuous latent space, eliminating quantization while preserving semantic and perceptual fidelity.
7
+ - 🎯 **High-Fidelity Image Reconstruction:** A three-stage architecture (encoding, expansion, reconstruction) captures fine details and global structure for accurate, high-quality image recovery.
8
+ - ⚡ **Accelerated Autoregressive Convergence:** Masked modeling with multi-level supervision shapes a compact, semantically rich latent space, enabling faster and more stable autoregressive training.
9
+
10
+
11
+ <div align="center">
12
+ <img src="assets/0830-mingtok-fig1.jpg" alt="Model Architecture" width="80%"/>
13
+ </div>
14
+
15
+ **Figure 1: Conceptual comparison and qualitative examples of MingTok.**
16
+
17
+ ## Usage
18
+ ```python
19
+ # build MingTok
20
+
21
+ from mingtok.modeling_mingtok import MingTok
22
+
23
+ mingtok_model = MingTok.from_pretrained("inclusionAI/MingTok-Vision")
24
+ mingtok_model = mingtok_model.cuda()
25
+
26
+ img_path = "mingtok/asset/mingtok.png"
27
+ save_path = "mingtok/asset/mingtok_recon.png"
28
+
29
+ # loading original image
30
+ image = Image.open(img_path).convert("RGB")
31
+ processor = CenterCropProcessor(image_size=512, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
32
+ image = processor(image).cuda().unsqueeze(0)
33
+
34
+ # performing reconstruction
35
+ with torch.no_grad():
36
+ image_recon = mingtok_model.forward_enc_dec(image)
37
+ # latent = mingtok_model.low_level_encoder(image)
38
+ # semantic_feat = mingtok_model.semantic_decoder(latent)['x_norm_patchtokens']
39
+ # image_recon = mingtok_model.forward_pixel_decoder(semantic_feat)
40
+
41
+
42
+ output_mean = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
43
+ output_std = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
44
+ output_image = (image_recon*output_std + output_mean)[0]
45
+ output_image = T.ToPILImage()(output_image)
46
+ output_image.save(save_path)
47
+ ```
48
+
49
+ ## Performance
50
+ ### Image Reconstruction
51
+
52
+ <style>
53
+ table {
54
+ width: 100%;
55
+ border-collapse: collapse;
56
+ font-size: 0.9em;
57
+ text-align: center;
58
+ }
59
+ th, td {
60
+ padding: 6px 8px;
61
+ border: 1px solid #ccc;
62
+ }
63
+ th {
64
+ background-color: #f7f7f7;
65
+ font-weight: bold;
66
+ }
67
+ tr.italic th {
68
+ font-style: italic;
69
+ text-align: left;
70
+ }
71
+ .footnote {
72
+ font-size: 0.9em;
73
+ color: #555;
74
+ margin-top: 8px;
75
+ }
76
+ </style>
77
+
78
+ <table>
79
+ <thead>
80
+ <tr>
81
+ <th>Tokenizer</th>
82
+ <th>Res.</th>
83
+ <th># Tokens</th>
84
+ <th>rFID ↓</th>
85
+ <th>PSNR ↑</th>
86
+ <th>SSIM ↑</th>
87
+ <th>LPIPS ↓</th>
88
+ </tr>
89
+ </thead>
90
+ <tbody>
91
+ <!-- Specialized tokenizers -->
92
+ <tr class="italic">
93
+ <td colspan="7"><em>Specialized tokenizers</em></td>
94
+ </tr>
95
+ <tr>
96
+ <td>SD-VAE</td>
97
+ <td>256</td>
98
+ <td>1024</td>
99
+ <td>1.06</td>
100
+ <td>28.62</td>
101
+ <td>0.86</td>
102
+ <td>-</td>
103
+ </tr>
104
+ <tr>
105
+ <td>GigaTok</td>
106
+ <td>256</td>
107
+ <td>256</td>
108
+ <td>0.51</td>
109
+ <td>21.32</td>
110
+ <td>0.69</td>
111
+ <td>0.21</td>
112
+ </tr>
113
+ <tr>
114
+ <td>VA-VAE</td>
115
+ <td>256</td>
116
+ <td>256</td>
117
+ <td>0.26</td>
118
+ <td>28.59</td>
119
+ <td>0.80</td>
120
+ <td>0.09</td>
121
+ </tr>
122
+ <tr>
123
+ <td>HieraTok</td>
124
+ <td>256</td>
125
+ <td>256</td>
126
+ <td>1.04</td>
127
+ <td>23.90</td>
128
+ <td>0.72</td>
129
+ <td>0.09</td>
130
+ </tr>
131
+ <tr>
132
+ <td>DC-AE</td>
133
+ <td>512</td>
134
+ <td>64</td>
135
+ <td>0.22</td>
136
+ <td>26.15</td>
137
+ <td>0.71</td>
138
+ <td>0.08</td>
139
+ </tr>
140
+ <tr>
141
+ <td>MAE-Tok</td>
142
+ <td>512</td>
143
+ <td>128</td>
144
+ <td>0.62</td>
145
+ <td>-</td>
146
+ <td>-</td>
147
+ <td>-</td>
148
+ </tr>
149
+ <tr>
150
+ <td>TexTok</td>
151
+ <td>512</td>
152
+ <td>256</td>
153
+ <td>0.73</td>
154
+ <td>24.45</td>
155
+ <td>0.66</td>
156
+ <td>0.19</td>
157
+ </tr>
158
+ <!-- Unified tokenizers -->
159
+ <tr class="italic">
160
+ <td colspan="7"><em>Unified tokenizers</em></td>
161
+ </tr>
162
+ <tr>
163
+ <td>UniTok</td>
164
+ <td>256</td>
165
+ <td>256</td>
166
+ <td>0.38</td>
167
+ <td>-</td>
168
+ <td>-</td>
169
+ <td>-</td>
170
+ </tr>
171
+ <tr>
172
+ <td>TokenFlow</td>
173
+ <td>384</td>
174
+ <td>729</td>
175
+ <td>0.63</td>
176
+ <td>22.77</td>
177
+ <td>0.73</td>
178
+ <td>-</td>
179
+ </tr>
180
+ <tr>
181
+ <td><strong>MingTok-Vision</strong></td>
182
+ <td>512</td>
183
+ <td>256</td>
184
+ <td>0.54</td>
185
+ <td>30.77</td>
186
+ <td>0.62</td>
187
+ <td>0.14</td>
188
+ </tr>
189
+ <tr>
190
+ <td><strong>MingTok-Vision</strong> †</td>
191
+ <td>512</td>
192
+ <td>256</td>
193
+ <td>0.38</td>
194
+ <td>31.09</td>
195
+ <td>0.64</td>
196
+ <td>0.12</td>
197
+ </tr>
198
+ </tbody>
199
+ </table>
200
+
201
+ <div class="footnote">
202
+ <strong>†</strong> denotes using semantic decoder after joint pre-training.
203
+ </div>
204
+
205
+ ## Reference
206
+ TBD.
assets/0830-mingtok-fig1.jpg ADDED

Git LFS Details

  • SHA256: c795bff4195f188bd714981b604b634121b0c4b1d61f7ffccee82b2c04b3cf7c
  • Pointer size: 131 Bytes
  • Size of remote file: 355 kB
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MingTok"
4
+ ],
5
+ "low_level_encoder": {
6
+ "depth": 12,
7
+ "embed_dim": 768,
8
+ "ffn_layer": "swiglufused",
9
+ "img_size": 512,
10
+ "out_dim": 32,
11
+ "patch_size": 32
12
+ },
13
+ "mean": 1.46817409,
14
+ "model_dtype": "bf16",
15
+ "model_type": "mingtok",
16
+ "pixel_decoder": {
17
+ "decoder_depth": 24,
18
+ "embed_dim": 1024,
19
+ "loss_type": "L1-plain",
20
+ "norm_pix_loss": true,
21
+ "patch_size": 16
22
+ },
23
+ "pretrained_checkpoint": "/mnt/nativemm-hn/checkpoint/ziyuan/moe_mingtok/mingtok_moe_0830_recon_0830_bef_joint_training_resize_dec_p16d24c1024_s2noresize/gan_v2_20M/202509281648/checkpoint_1_hf.pth",
24
+ "scaling_factor": 8.09449291,
25
+ "semantic_decoder": {
26
+ "decoder_depth": 24,
27
+ "embed_dim": 1024,
28
+ "ffn_layer": "swiglufused",
29
+ "in_dim": 32,
30
+ "patch_size": 32
31
+ },
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.52.4"
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc76fb2056dff98ae7f3fee03b31a4b523816989682769cb19a73bd1813e6c95
3
+ size 2790962104