Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +206 -0
assets/0830-mingtok-fig1.jpg +3 -0
config.json +34 -0
model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/0830-mingtok-fig1.jpg filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,206 @@

+##  MingTok: A Unified Tokenizer for Visual Understanding and Generation without Vector Quantization
+<p align="center">📑 <a href="">Technical Report</a>｜📖<a href="https://huggingface.co/inclusionAI/MingTok-Vision">Project Page</a> ｜🤗 <a href="https://huggingface.co/inclusionAI/MingTok-Vision">Hugging Face</a>｜ 🤖 <a href="https://modelscope.cn/models/inclusionAI/MingTok-Vision">ModelScope</a>
+## Key Features
+- 🖼️ **First Continuous Unified Vision Tokenizer:** MingTok enables unified vision understanding and generation via a continuous latent space, eliminating quantization while preserving semantic and perceptual fidelity.
+- 🎯 **High-Fidelity Image Reconstruction:** A three-stage architecture (encoding, expansion, reconstruction) captures fine details and global structure for accurate, high-quality image recovery.
+- ⚡ **Accelerated Autoregressive Convergence:** Masked modeling with multi-level supervision shapes a compact, semantically rich latent space, enabling faster and more stable autoregressive training.
+<div align="center">
+  <img src="assets/0830-mingtok-fig1.jpg" alt="Model Architecture" width="80%"/>
+</div>
+**Figure 1: Conceptual comparison and qualitative examples of MingTok.**
+## Usage
+```python
+# build MingTok
+from mingtok.modeling_mingtok import MingTok
+mingtok_model = MingTok.from_pretrained("inclusionAI/MingTok-Vision")
+mingtok_model = mingtok_model.cuda()
+img_path = "mingtok/asset/mingtok.png"
+save_path = "mingtok/asset/mingtok_recon.png"
+# loading original image
+image = Image.open(img_path).convert("RGB")
+processor = CenterCropProcessor(image_size=512, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
+image = processor(image).cuda().unsqueeze(0)
+# performing reconstruction
+with torch.no_grad():
+  image_recon = mingtok_model.forward_enc_dec(image)
+  # latent = mingtok_model.low_level_encoder(image)
+  # semantic_feat = mingtok_model.semantic_decoder(latent)['x_norm_patchtokens']
+  # image_recon = mingtok_model.forward_pixel_decoder(semantic_feat)
+output_mean = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
+output_std = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
+output_image = (image_recon*output_std + output_mean)[0]
+output_image = T.ToPILImage()(output_image)
+output_image.save(save_path)
+```
+## Performance
+### Image Reconstruction
+<style>
+  table {
+    width: 100%;
+    border-collapse: collapse;
+    font-size: 0.9em;
+    text-align: center;
+  }
+  th, td {
+    padding: 6px 8px;
+    border: 1px solid #ccc;
+  }
+  th {
+    background-color: #f7f7f7;
+    font-weight: bold;
+  }
+  tr.italic th {
+    font-style: italic;
+    text-align: left;
+  }
+  .footnote {
+    font-size: 0.9em;
+    color: #555;
+    margin-top: 8px;
+  }
+</style>
+<table>
+  <thead>
+    <tr>
+      <th>Tokenizer</th>
+      <th>Res.</th>
+      <th># Tokens</th>
+      <th>rFID ↓</th>
+      <th>PSNR ↑</th>
+      <th>SSIM ↑</th>
+      <th>LPIPS ↓</th>
+    </tr>
+  </thead>
+  <tbody>
+    <!-- Specialized tokenizers -->
+    <tr class="italic">
+      <td colspan="7"><em>Specialized tokenizers</em></td>
+    </tr>
+    <tr>
+      <td>SD-VAE</td>
+      <td>256</td>
+      <td>1024</td>
+      <td>1.06</td>
+      <td>28.62</td>
+      <td>0.86</td>
+      <td>-</td>
+    </tr>
+    <tr>
+      <td>GigaTok</td>
+      <td>256</td>
+      <td>256</td>
+      <td>0.51</td>
+      <td>21.32</td>
+      <td>0.69</td>
+      <td>0.21</td>
+    </tr>
+    <tr>
+      <td>VA-VAE</td>
+      <td>256</td>
+      <td>256</td>
+      <td>0.26</td>
+      <td>28.59</td>
+      <td>0.80</td>
+      <td>0.09</td>
+    </tr>
+    <tr>
+      <td>HieraTok</td>
+      <td>256</td>
+      <td>256</td>
+      <td>1.04</td>
+      <td>23.90</td>
+      <td>0.72</td>
+      <td>0.09</td>
+    </tr>
+    <tr>
+      <td>DC-AE</td>
+      <td>512</td>
+      <td>64</td>
+      <td>0.22</td>
+      <td>26.15</td>
+      <td>0.71</td>
+      <td>0.08</td>
+    </tr>
+    <tr>
+      <td>MAE-Tok</td>
+      <td>512</td>
+      <td>128</td>
+      <td>0.62</td>
+      <td>-</td>
+      <td>-</td>
+      <td>-</td>
+    </tr>
+    <tr>
+      <td>TexTok</td>
+      <td>512</td>
+      <td>256</td>
+      <td>0.73</td>
+      <td>24.45</td>
+      <td>0.66</td>
+      <td>0.19</td>
+    </tr>
+    <!-- Unified tokenizers -->
+    <tr class="italic">
+      <td colspan="7"><em>Unified tokenizers</em></td>
+    </tr>
+    <tr>
+      <td>UniTok</td>
+      <td>256</td>
+      <td>256</td>
+      <td>0.38</td>
+      <td>-</td>
+      <td>-</td>
+      <td>-</td>
+    </tr>
+    <tr>
+      <td>TokenFlow</td>
+      <td>384</td>
+      <td>729</td>
+      <td>0.63</td>
+      <td>22.77</td>
+      <td>0.73</td>
+      <td>-</td>
+    </tr>
+    <tr>
+      <td><strong>MingTok-Vision</strong></td>
+      <td>512</td>
+      <td>256</td>
+      <td>0.54</td>
+      <td>30.77</td>
+      <td>0.62</td>
+      <td>0.14</td>
+    </tr>
+    <tr>
+      <td><strong>MingTok-Vision</strong> †</td>
+      <td>512</td>
+      <td>256</td>
+      <td>0.38</td>
+      <td>31.09</td>
+      <td>0.64</td>
+      <td>0.12</td>
+    </tr>
+  </tbody>
+</table>
+<div class="footnote">
+  <strong>†</strong> denotes using semantic decoder after joint pre-training.
+</div>
+## Reference
+TBD.

assets/0830-mingtok-fig1.jpg ADDED Viewed

Git LFS Details

SHA256: c795bff4195f188bd714981b604b634121b0c4b1d61f7ffccee82b2c04b3cf7c
Pointer size: 131 Bytes
Size of remote file: 355 kB

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "architectures": [
+    "MingTok"
+  ],
+  "low_level_encoder": {
+    "depth": 12,
+    "embed_dim": 768,
+    "ffn_layer": "swiglufused",
+    "img_size": 512,
+    "out_dim": 32,
+    "patch_size": 32
+  },
+  "mean": 1.46817409,
+  "model_dtype": "bf16",
+  "model_type": "mingtok",
+  "pixel_decoder": {
+    "decoder_depth": 24,
+    "embed_dim": 1024,
+    "loss_type": "L1-plain",
+    "norm_pix_loss": true,
+    "patch_size": 16
+  },
+  "pretrained_checkpoint": "/mnt/nativemm-hn/checkpoint/ziyuan/moe_mingtok/mingtok_moe_0830_recon_0830_bef_joint_training_resize_dec_p16d24c1024_s2noresize/gan_v2_20M/202509281648/checkpoint_1_hf.pth",
+  "scaling_factor": 8.09449291,
+  "semantic_decoder": {
+    "decoder_depth": 24,
+    "embed_dim": 1024,
+    "ffn_layer": "swiglufused",
+    "in_dim": 32,
+    "patch_size": 32
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.52.4"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc76fb2056dff98ae7f3fee03b31a4b523816989682769cb19a73bd1813e6c95
+size 2790962104