Reza2kn commited on
Commit
4a29951
·
verified ·
1 Parent(s): e139166

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ samples/anime.png filter=lfs diff=lfs merge=lfs -text
37
+ samples/barista.png filter=lfs diff=lfs merge=lfs -text
38
+ samples/city.png filter=lfs diff=lfs merge=lfs -text
39
+ samples/food.png filter=lfs diff=lfs merge=lfs -text
40
+ samples/panda.png filter=lfs diff=lfs merge=lfs -text
41
+ samples/portrait.png filter=lfs diff=lfs merge=lfs -text
42
+ samples/t2v_f16.png filter=lfs diff=lfs merge=lfs -text
43
+ samples/t2v_f8.png filter=lfs diff=lfs merge=lfs -text
44
+ text_tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: openmdw-1.1
4
+ license_link: https://openmdw.ai/license/1-1/
5
+ base_model: nvidia/Cosmos3-Nano
6
+ base_model_relation: quantized
7
+ library_name: mlx
8
+ pipeline_tag: text-to-image
9
+ tags: [cosmos, cosmos3, mlx, apple-silicon, 4-bit, quantization, text-to-image]
10
+ ---
11
+
12
+ # Cosmos3-Nano — MLX 4-bit (Apple Silicon)
13
+
14
+ A **4-bit MLX** build of [`nvidia/Cosmos3-Nano`](https://huggingface.co/nvidia/Cosmos3-Nano) that
15
+ **runs on Apple Silicon** — not just quantized weights, a working text2image model. The custom
16
+ Cosmos3 omni-MoT diffusion transformer was ported to MLX from scratch (no `mlx-vlm` support exists
17
+ for this architecture) and every block was validated against the reference torch implementation.
18
+
19
+ > Derivative of `nvidia/Cosmos3-Nano`. © NVIDIA. Distributed under **OpenMDW-1.1** (license + NVIDIA
20
+ > copyright/origin notices retained). Not affiliated with, nor endorsed by, NVIDIA.
21
+
22
+ ## Highlights
23
+ - **Transformer: 30.3 GB bf16 → 12.1 GB MLX-4bit** (468 attn+MLP linears quantized, group-64; embeddings/norms/lm_head kept bf16).
24
+ - **Runs ~11 GB peak** — fits a 16 GB Mac. ~12 s for a 256² image (M2 Ultra), longer at higher res.
25
+ - **Validated:** every module matches torch — primitives ~1e-6, full decoder layer ~1e-3 (bf16), patchify bit-exact.
26
+
27
+ ## Usage
28
+ ```python
29
+ import torch
30
+ from huggingface_hub import snapshot_download
31
+ from mlx_pipeline import MLXCosmos3Transformer # included in this repo
32
+ from diffusers import Cosmos3OmniPipeline, AutoencoderKLWan, UniPCMultistepScheduler
33
+ from diffusers.models.autoencoders.autoencoder_cosmos3_audio import Cosmos3AVAEAudioTokenizer
34
+ from transformers import AutoTokenizer
35
+
36
+ repo = snapshot_download("Reza2kn/Cosmos3-Nano-MLX-4bit")
37
+ vae = AutoencoderKLWan.from_pretrained(repo, subfolder="vae", torch_dtype=torch.float32).eval()
38
+ sched = UniPCMultistepScheduler.from_pretrained(repo, subfolder="scheduler")
39
+ tok = AutoTokenizer.from_pretrained(repo, subfolder="text_tokenizer")
40
+ st = Cosmos3AVAEAudioTokenizer.from_pretrained(repo, subfolder="sound_tokenizer", torch_dtype=torch.float32).eval()
41
+ pipe = Cosmos3OmniPipeline(transformer=MLXCosmos3Transformer(repo + "/transformer"),
42
+ text_tokenizer=tok, vae=vae, scheduler=sched, sound_tokenizer=st, enable_safety_checker=False)
43
+ img = pipe("A red panda astronaut floating in a nebula", num_frames=1,
44
+ height=384, width=384, num_inference_steps=24).video[0][0]
45
+ img.save("out.png")
46
+ ```
47
+ **Requires:** `mlx`, `diffusers` (git main / ≥0.39 for Cosmos3), `transformers`, `torch` (VAE/scheduler only). The
48
+ heavy 16B transformer runs in MLX on the GPU; the small VAE/scheduler/tokenizer run in torch.
49
+
50
+ ## Quality (honest)
51
+ Same profile as any 4-bit build: **clean on typical content** (portraits, scenes, objects, food —
52
+ see `samples/`), but **4-bit defects appear on hard anatomy** — e.g. fused/mangled **hands**
53
+ (`samples/barista.png`) and broken limbs in complex poses (`samples/anime.png`). PickScore (mean
54
+ **21.42**, vs the CUDA builds' ~21.8) does **not** reliably catch these — eyeball the hard cases.
55
+ Use FP8/BF16 if you need hands/complex anatomy to hold up.
56
+
57
+ ## Status / honesty
58
+ - **text2image: working** (`samples/*.png`), with the 4-bit anatomy caveats above.
59
+ - **text2video: working** (`samples/t2v_waves.mp4`, `num_frames>1`).
60
+ - **image2video / audio:** not implemented yet (image-conditioning + sound paths).
61
+ - Quantization is 4-bit weight-only — near-original on typical content, with the usual 4-bit wobble on the
62
+ hardest cases (dense hands, on-image text), same as any 4-bit build.
63
+
64
+ ## How it was built
65
+ `mlx_cosmos3.py` (validated MLX modules), `mlx_pipeline.py` (torch wrapper routing the transformer forward to MLX
66
+ while reusing torch tokenizer/UniPC/VAE/CFG). Quantized with `mx.quantize` (group-64, 4-bit), streamed shard-by-shard.
mlx_cosmos3.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MLX port of the Cosmos3-Nano omni transformer. Built module-by-module, each validated
2
+ against the torch reference (validate_primitives.py). Runs the MLX 4-bit weights produced by
3
+ mlx_quant.py. WIP — primitives + attention first, then full transformer + pipeline glue."""
4
+ import mlx.core as mx
5
+ import mlx.nn as nn
6
+ QGROUP = 64
7
+ QBITS = 4 # set by loader from mlx_quant_config.json
8
+
9
+
10
+ def rms_norm(x, weight, eps):
11
+ # matches diffusers RMSNorm: variance in float32, scale, then * weight
12
+ xf = x.astype(mx.float32)
13
+ var = mx.mean(xf * xf, axis=-1, keepdims=True)
14
+ xf = xf * mx.rsqrt(var + eps)
15
+ return (weight * xf.astype(x.dtype)) if weight is not None else xf.astype(x.dtype)
16
+
17
+
18
+ def silu(x):
19
+ return x * mx.sigmoid(x)
20
+
21
+
22
+ def swiglu_mlp(x, gate_w, up_w, down_w):
23
+ # down(silu(gate(x)) * up(x)); weights are [out,in] (torch Linear) -> x @ w.T
24
+ g = silu(x @ gate_w.T)
25
+ u = x @ up_w.T
26
+ return (g * u) @ down_w.T
27
+
28
+
29
+ def rotate_half(x):
30
+ half = x.shape[-1] // 2
31
+ return mx.concatenate([-x[..., half:], x[..., :half]], axis=-1)
32
+
33
+
34
+ def apply_rope(x, cos, sin):
35
+ # x: [N, heads, head_dim]; cos/sin: [N, head_dim] -> unsqueeze head axis
36
+ cos = mx.expand_dims(cos, 1)
37
+ sin = mx.expand_dims(sin, 1)
38
+ return x * cos + rotate_half(x) * sin
39
+
40
+
41
+ class RoPE3D:
42
+ """Cosmos3VLTextRotaryEmbedding: interleaved 3D mRoPE."""
43
+ def __init__(self, head_dim, rope_theta, rope_axes_dim):
44
+ self.inv_freq = 1.0 / (rope_theta ** (mx.arange(0, head_dim, 2).astype(mx.float32) / head_dim))
45
+ self.rope_axes_dim = rope_axes_dim # e.g. [24,20,20]
46
+
47
+ def _interleave(self, freqs):
48
+ # freqs: [3, N, head_dim//2] -> [N, head_dim//2] interleaving H,W into T grid
49
+ freqs_t = freqs[0]
50
+ for dim, offset in ((1, 1), (2, 2)): # (axis idx, start offset)
51
+ length = self.rope_axes_dim[dim] * 3
52
+ idx = mx.arange(offset, length, 3)
53
+ # assign freqs_t[..., idx] = freqs[dim][..., idx]
54
+ sel = freqs[dim][..., idx]
55
+ freqs_t[..., idx] = sel
56
+ return freqs_t
57
+
58
+ def __call__(self, position_ids):
59
+ # position_ids: [3, N]
60
+ pid = position_ids.astype(mx.float32) # [3, N]
61
+ inv = self.inv_freq[None, :, None] # [1, d/2, 1]
62
+ inv = mx.broadcast_to(inv, (3, inv.shape[1], 1)) # [3, d/2, 1]
63
+ pe = pid[:, None, :] # [3, 1, N]
64
+ freqs = mx.transpose(inv @ pe, (0, 2, 1)) # [3, N, d/2]
65
+ freqs = self._interleave(freqs) # [N, d/2]
66
+ emb = mx.concatenate([freqs, freqs], axis=-1) # [N, d]
67
+ return mx.cos(emb), mx.sin(emb)
68
+
69
+
70
+ def gqa_attention(q, k, v, n_heads, n_kv_heads, causal):
71
+ # q:[N,H,D] k,v:[M,Hkv,D]. expand kv groups, scaled-dot-product.
72
+ N, H, D = q.shape
73
+ M = k.shape[0]
74
+ rep = n_heads // n_kv_heads
75
+ k = mx.repeat(k, rep, axis=1) # [M, H, D]
76
+ v = mx.repeat(v, rep, axis=1)
77
+ q = mx.transpose(q, (1, 0, 2)) # [H, N, D]
78
+ k = mx.transpose(k, (1, 0, 2)) # [H, M, D]
79
+ v = mx.transpose(v, (1, 0, 2))
80
+ scale = 1.0 / (D ** 0.5)
81
+ scores = (q @ mx.transpose(k, (0, 2, 1))) * scale # [H, N, M]
82
+ if causal:
83
+ mask = mx.triu(mx.full((N, M), -1e9, dtype=scores.dtype), k=1)
84
+ scores = scores + mask
85
+ w = mx.softmax(scores.astype(mx.float32), axis=-1).astype(v.dtype)
86
+ out = w @ v # [H, N, D]
87
+ return mx.transpose(out, (1, 0, 2)) # [N, H, D]
88
+
89
+
90
+ def dual_attention(und_seq, gen_seq, rope, w, n_heads, n_kv_heads, head_dim, eps):
91
+ """Cosmos3AttnProcessor in MLX: und=causal self-attn, gen=full attn over [und+gen] kv."""
92
+ cos_u, sin_u, cos_g, sin_g = rope
93
+ q_u = (und_seq @ w['to_q'].T).reshape(-1, n_heads, head_dim)
94
+ k_u = (und_seq @ w['to_k'].T).reshape(-1, n_kv_heads, head_dim)
95
+ v_u = (und_seq @ w['to_v'].T).reshape(-1, n_kv_heads, head_dim)
96
+ q_g = (gen_seq @ w['add_q_proj'].T).reshape(-1, n_heads, head_dim)
97
+ k_g = (gen_seq @ w['add_k_proj'].T).reshape(-1, n_kv_heads, head_dim)
98
+ v_g = (gen_seq @ w['add_v_proj'].T).reshape(-1, n_kv_heads, head_dim)
99
+ q_u = rms_norm(q_u, w['norm_q'], eps); k_u = rms_norm(k_u, w['norm_k'], eps)
100
+ q_g = rms_norm(q_g, w['norm_added_q'], eps); k_g = rms_norm(k_g, w['norm_added_k'], eps)
101
+ q_u = apply_rope(q_u, cos_u, sin_u); k_u = apply_rope(k_u, cos_u, sin_u)
102
+ q_g = apply_rope(q_g, cos_g, sin_g); k_g = apply_rope(k_g, cos_g, sin_g)
103
+ causal_out = gqa_attention(q_u, k_u, v_u, n_heads, n_kv_heads, causal=True).reshape(-1, n_heads * head_dim)
104
+ all_k = mx.concatenate([k_u, k_g], axis=0); all_v = mx.concatenate([v_u, v_g], axis=0)
105
+ full_out = gqa_attention(q_g, all_k, all_v, n_heads, n_kv_heads, causal=False).reshape(-1, n_heads * head_dim)
106
+ return causal_out @ w['to_out'].T, full_out @ w['to_add_out'].T
107
+
108
+
109
+ # ---- timestep embedding (diffusers Timesteps + TimestepEmbedding) ----
110
+ def get_timestep_embedding(timesteps, dim=256, max_period=10000, downscale_freq_shift=0.0):
111
+ half = dim // 2
112
+ exponent = -mx.log(mx.array(float(max_period))) * mx.arange(half).astype(mx.float32)
113
+ exponent = exponent / (half - downscale_freq_shift)
114
+ emb = mx.exp(exponent)
115
+ emb = timesteps.astype(mx.float32)[:, None] * emb[None, :]
116
+ # flip_sin_to_cos=True -> [cos, sin]
117
+ return mx.concatenate([mx.cos(emb), mx.sin(emb)], axis=-1)
118
+
119
+
120
+ def timestep_embedder(t_emb, l1_w, l1_b, l2_w, l2_b):
121
+ h = silu(t_emb @ l1_w.T + l1_b)
122
+ return h @ l2_w.T + l2_b
123
+
124
+
125
+ # ---- linear that accepts bf16 weight (mx array) or quantized tuple (wq, scales, biases) ----
126
+ def linear(x, w, bias=None, group_size=None, bits=None):
127
+ if isinstance(w, tuple):
128
+ wq, scales, biases = w
129
+ out = mx.quantized_matmul(x, wq, scales, biases, transpose=True,
130
+ group_size=group_size or QGROUP, bits=bits or QBITS)
131
+ else:
132
+ out = x @ w.T
133
+ return out + bias if bias is not None else out
134
+
135
+
136
+ def decoder_layer(und, gen, rope, P, cfg):
137
+ """One Cosmos3VLTextMoTDecoderLayer in MLX. P = dict of this layer's params (mx arrays or
138
+ quantized tuples). cfg = (n_heads, n_kv, head_dim, eps)."""
139
+ NH, NKV, HD, EPS = cfg
140
+ und_n = rms_norm(und, P["input_layernorm.weight"], EPS)
141
+ gen_n = rms_norm(gen, P["input_layernorm_moe_gen.weight"], EPS)
142
+ cos_u, sin_u, cos_g, sin_g = rope
143
+
144
+ def proj(seq, name, nh):
145
+ return linear(seq, P[name]).reshape(-1, nh, HD)
146
+ q_u = proj(und_n, "self_attn.to_q.weight", NH); k_u = proj(und_n, "self_attn.to_k.weight", NKV); v_u = proj(und_n, "self_attn.to_v.weight", NKV)
147
+ q_g = proj(gen_n, "self_attn.add_q_proj.weight", NH); k_g = proj(gen_n, "self_attn.add_k_proj.weight", NKV); v_g = proj(gen_n, "self_attn.add_v_proj.weight", NKV)
148
+ q_u = rms_norm(q_u, P["self_attn.norm_q.weight"], EPS); k_u = rms_norm(k_u, P["self_attn.norm_k.weight"], EPS)
149
+ q_g = rms_norm(q_g, P["self_attn.norm_added_q.weight"], EPS); k_g = rms_norm(k_g, P["self_attn.norm_added_k.weight"], EPS)
150
+ q_u = apply_rope(q_u, cos_u, sin_u); k_u = apply_rope(k_u, cos_u, sin_u)
151
+ q_g = apply_rope(q_g, cos_g, sin_g); k_g = apply_rope(k_g, cos_g, sin_g)
152
+ co = gqa_attention(q_u, k_u, v_u, NH, NKV, True).reshape(-1, NH * HD)
153
+ ak = mx.concatenate([k_u, k_g], axis=0); av = mx.concatenate([v_u, v_g], axis=0)
154
+ fo = gqa_attention(q_g, ak, av, NH, NKV, False).reshape(-1, NH * HD)
155
+ und = und + linear(co, P["self_attn.to_out.weight"])
156
+ gen = gen + linear(fo, P["self_attn.to_add_out.weight"])
157
+ und_m = rms_norm(und, P["post_attention_layernorm.weight"], EPS)
158
+ gen_m = rms_norm(gen, P["post_attention_layernorm_moe_gen.weight"], EPS)
159
+ und = und + linear(silu(linear(und_m, P["mlp.gate_proj.weight"])) * linear(und_m, P["mlp.up_proj.weight"]), P["mlp.down_proj.weight"])
160
+ gen = gen + linear(silu(linear(gen_m, P["mlp_moe_gen.gate_proj.weight"])) * linear(gen_m, P["mlp_moe_gen.up_proj.weight"]), P["mlp_moe_gen.down_proj.weight"])
161
+ return und, gen
162
+
163
+
164
+ # ---- patchify / pack / unpatchify (pure-tensor glue; matches torch methods) ----
165
+ def patchify_pack(latent, p, C):
166
+ """latent [C,T,H,W] -> packed [num_patches, p*p*C], (T, hpat, wpat)."""
167
+ _, T, H, W = latent.shape
168
+ Hp = ((H + p - 1) // p) * p; Wp = ((W + p - 1) // p) * p
169
+ if Hp != H or Wp != W:
170
+ pad = mx.zeros((C, T, Hp, Wp), dtype=latent.dtype)
171
+ pad[:, :, :H, :W] = latent; latent = pad
172
+ hpat, wpat = Hp // p, Wp // p
173
+ latent = latent.reshape(C, T, hpat, p, wpat, p)
174
+ latent = mx.einsum("cthpwq->thwpqc", latent).reshape(-1, p * p * C)
175
+ return latent, (T, hpat, wpat)
176
+
177
+
178
+ def unpatchify(packed, token_shape, orig_hw, p, C):
179
+ """packed [num_patches, p*p*C] -> latent [C, T, H, W]."""
180
+ T, hpat, wpat = token_shape
181
+ H, W = orig_hw
182
+ x = packed.reshape(T, hpat, wpat, p, p, C)
183
+ x = mx.einsum("thwpqc->cthpwq", x).reshape(C, T, hpat * p, wpat * p)
184
+ return x[:, :, :H, :W]
185
+
186
+
187
+ def scatter_timestep_single(tokens, t_embed, n_noisy_tokens):
188
+ """t2i / all-noisy single-item case: add the (broadcast) timestep embed to the first
189
+ n_noisy_tokens rows. General multi-frame scatter handled in the pipeline layer."""
190
+ if t_embed.ndim == 1:
191
+ t_embed = mx.broadcast_to(t_embed[None, :], (n_noisy_tokens, tokens.shape[1]))
192
+ tokens[:n_noisy_tokens] = tokens[:n_noisy_tokens] + t_embed
193
+ return tokens
mlx_pipeline.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """End-to-end text2image with the MLX 4-bit transformer + torch pipeline orchestration.
2
+ A torch nn.Module wrapper routes the transformer forward to MLX; everything else (tokenizer,
3
+ UniPC scheduler, CFG, VAE decode) stays in torch (small, fits RAM). The 33GB torch transformer
4
+ is never loaded."""
5
+ import glob, json, sys, time
6
+ import numpy as np, torch
7
+ import mlx.core as mx
8
+ from types import SimpleNamespace
9
+ sys.path.insert(0, "/Users/studio/cosmos_mlx/work")
10
+ import mlx_cosmos3 as M
11
+
12
+ NANO = "/Users/studio/cosmos_mlx/models/Cosmos3-Nano"
13
+ EXPORT = "/Users/studio/cosmos_mlx/export/Cosmos3-Nano-MLX-4bit/transformer"
14
+ HID, HD, NH, NKV, NL, EPS = 4096, 128, 32, 8, 36, 1e-6
15
+ P_PATCH, C_LAT, AXES, THETA, TS_SCALE = 2, 48, [24, 20, 20], 5e6, 0.001
16
+
17
+ def _t2m(t): # torch -> mlx
18
+ return mx.array(t.detach().to(torch.float32).cpu().numpy())
19
+ def _m2t(a, dtype=torch.bfloat16):
20
+ return torch.from_numpy(np.array(a.astype(mx.float32))).to(dtype)
21
+
22
+ class MLXCosmos3Transformer(torch.nn.Module):
23
+ def __init__(self, export_dir):
24
+ super().__init__()
25
+ self.W = {}
26
+ for f in sorted(glob.glob(export_dir + "/*.safetensors")):
27
+ self.W.update(mx.load(f))
28
+ cfgd = json.load(open(NANO + "/transformer/config.json"))
29
+ cfgd = {k: v for k, v in cfgd.items() if not k.startswith("_")}
30
+ self.config = SimpleNamespace(**cfgd) # real config -> all fields the pipeline reads
31
+ qc = json.load(open(export_dir + "/mlx_quant_config.json"))
32
+ M.QGROUP, M.QBITS = qc.get("group_size", 64), qc.get("bits", 4) # 4 or 8 bit
33
+ self._dbg = True
34
+ self._dtype = torch.bfloat16
35
+ @property
36
+ def dtype(self): return self._dtype
37
+ @property
38
+ def device(self): return torch.device("cpu")
39
+ def to(self, *a, **k): return self
40
+ def eval(self): return self
41
+
42
+ def _lp(self, i):
43
+ pre = f"layers.{i}."; P = {}
44
+ for k in self.W:
45
+ if not k.startswith(pre) or k.endswith(".scales") or k.endswith(".biases"): continue
46
+ n = k[len(pre):]
47
+ P[n] = (self.W[k], self.W[k + ".scales"], self.W[k + ".biases"]) if k + ".scales" in self.W else self.W[k]
48
+ return P
49
+ def _gv(self, n):
50
+ return (self.W[n], self.W[n + ".scales"], self.W[n + ".biases"]) if n + ".scales" in self.W else self.W[n]
51
+
52
+ @torch.no_grad()
53
+ def forward(self, input_ids, text_indexes, position_ids, und_len, sequence_length,
54
+ vision_tokens, vision_token_shapes, vision_sequence_indexes, vision_mse_loss_indexes,
55
+ vision_timesteps, vision_noisy_frame_indexes, **sound_kw):
56
+ W = self.W
57
+ ii = mx.array(input_ids.cpu().numpy().astype(np.int32))
58
+ ti = mx.array(text_indexes.cpu().numpy().astype(np.int32))
59
+ vsi = mx.array(vision_sequence_indexes.cpu().numpy().astype(np.int32))
60
+ vmi = mx.array(vision_mse_loss_indexes.cpu().numpy().astype(np.int32))
61
+ pid = mx.array(position_ids.cpu().numpy().astype(np.int32))
62
+ latent = _t2m(vision_tokens[0]).reshape(C_LAT, *vision_tokens[0].shape[-3:]) # [C,T,H,W]
63
+ H, Wd = int(latent.shape[-2]), int(latent.shape[-1])
64
+ if getattr(self, "_dbg", False):
65
+ print(f"[wrapper] vision_tokens[0].shape={tuple(vision_tokens[0].shape)} latent T={latent.shape[1]} "
66
+ f"seq_len={sequence_length} und_len={und_len} mse_idx={vision_mse_loss_indexes.shape} "
67
+ f"token_shapes={vision_token_shapes} noisy={[ (x.tolist() if hasattr(x,'tolist') else x) for x in vision_noisy_frame_indexes]}", flush=True)
68
+ self._dbg = False
69
+ tstep = float(vision_timesteps[0].item())
70
+
71
+ emb = W["embed_tokens.weight"][ii]
72
+ hidden = mx.zeros((sequence_length, HID), dtype=emb.dtype)
73
+ hidden[ti] = emb
74
+ packed, shape = M.patchify_pack(latent, P_PATCH, C_LAT)
75
+ packed = M.linear(packed.astype(emb.dtype), self._gv("proj_in.weight"), W["proj_in.bias"])
76
+ te = M.get_timestep_embedding(mx.array([tstep * TS_SCALE]))
77
+ te = M.timestep_embedder(te, W["time_embedder.linear_1.weight"], W["time_embedder.linear_1.bias"],
78
+ W["time_embedder.linear_2.weight"], W["time_embedder.linear_2.bias"])[0].astype(emb.dtype)
79
+ packed = M.scatter_timestep_single(packed, te, packed.shape[0]) # t2i: all vision tokens noisy
80
+ hidden[vsi] = packed
81
+ cos, sin = M.RoPE3D(HD, THETA, AXES)(pid)
82
+ cos = cos.astype(emb.dtype); sin = sin.astype(emb.dtype)
83
+ und, gen = hidden[:und_len], hidden[und_len:]
84
+ rope = (cos[:und_len], sin[:und_len], cos[und_len:], sin[und_len:])
85
+ for i in range(NL):
86
+ und, gen = M.decoder_layer(und, gen, rope, self._lp(i), (NH, NKV, HD, EPS)); mx.eval(und, gen)
87
+ und = M.rms_norm(und, W["norm.weight"], EPS); gen = M.rms_norm(gen, W["norm_moe_gen.weight"], EPS)
88
+ last = mx.concatenate([und, gen], axis=0)
89
+ preds = M.linear(last[vmi], self._gv("proj_out.weight"), W["proj_out.bias"])
90
+ out = M.unpatchify(preds, shape, (H, Wd), P_PATCH, C_LAT); mx.eval(out)
91
+ return [_m2t(out, vision_tokens[0].dtype).unsqueeze(0)], None
92
+
93
+
94
+ if __name__ == "__main__":
95
+ from diffusers import Cosmos3OmniPipeline, AutoencoderKLWan, UniPCMultistepScheduler
96
+ from diffusers.models.autoencoders.autoencoder_cosmos3_audio import Cosmos3AVAEAudioTokenizer
97
+ from transformers import AutoTokenizer
98
+ dev = "cpu"
99
+ print("loading components (no torch transformer)...")
100
+ vae = AutoencoderKLWan.from_pretrained(NANO, subfolder="vae", torch_dtype=torch.float32).to(dev).eval()
101
+ sched = UniPCMultistepScheduler.from_pretrained(NANO, subfolder="scheduler")
102
+ tok = AutoTokenizer.from_pretrained(NANO, subfolder="text_tokenizer")
103
+ st = Cosmos3AVAEAudioTokenizer.from_pretrained(NANO, subfolder="sound_tokenizer", torch_dtype=torch.float32).to(dev).eval()
104
+ tf = MLXCosmos3Transformer(EXPORT)
105
+ pipe = Cosmos3OmniPipeline(transformer=tf, text_tokenizer=tok, vae=vae, scheduler=sched,
106
+ sound_tokenizer=st, enable_safety_checker=False)
107
+ print("generating (MLX 4-bit transformer)...")
108
+ t0 = time.time()
109
+ out = pipe(prompt="A red panda astronaut floating in a nebula, highly detailed", num_frames=1,
110
+ height=256, width=256, num_inference_steps=20, generator=torch.Generator().manual_seed(1))
111
+ img = out.video[0][0] if isinstance(out.video[0], list) else out.video[0]
112
+ img.save("/Users/studio/cosmos_mlx/work/mlx_t2i.png")
113
+ print(f"GENERATED in {time.time()-t0:.0f}s -> mlx_t2i.png ({img.size})")
model_index.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "Cosmos3OmniDiffusersPipeline",
3
+ "_diffusers_version": "0.37.1",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "UniPCMultistepScheduler"
7
+ ],
8
+ "text_tokenizer": [
9
+ "transformers",
10
+ "Qwen2TokenizerFast"
11
+ ],
12
+ "transformer": [
13
+ "diffusers",
14
+ "Cosmos3OmniTransformer"
15
+ ],
16
+ "vae": [
17
+ "diffusers",
18
+ "AutoencoderKLWan"
19
+ ],
20
+ "vision_encoder": [
21
+ "transformers",
22
+ "Qwen3VLVisionModel"
23
+ ],
24
+ "sound_tokenizer": [
25
+ "diffusers",
26
+ "Cosmos3AVAEAudioTokenizer"
27
+ ]
28
+ }
samples/anime.png ADDED

Git LFS Details

  • SHA256: 7a051daf2c86da6cf3d0f4770746a45270d29640905157c52c2e7a6902504d47
  • Pointer size: 131 Bytes
  • Size of remote file: 259 kB
samples/barista.png ADDED

Git LFS Details

  • SHA256: 82a672bb4c89e6c69c4f4d26e952a13e4d3030b878b91c63fd2a2917e2bd9a59
  • Pointer size: 131 Bytes
  • Size of remote file: 212 kB
samples/city.png ADDED

Git LFS Details

  • SHA256: 0db0015f32883f0bb27766515a25d108883ef4d3e112ffa0513ef220f1294403
  • Pointer size: 131 Bytes
  • Size of remote file: 284 kB
samples/food.png ADDED

Git LFS Details

  • SHA256: d9c2957b59170cb39bf3517bea7e524c1b10390acbee3a8245edf6aa0491b917
  • Pointer size: 131 Bytes
  • Size of remote file: 205 kB
samples/panda.png ADDED

Git LFS Details

  • SHA256: 8a81d1b9113d2aab697b686c2cc8539fa05c4e00c092c00ec82622c99d91ef21
  • Pointer size: 131 Bytes
  • Size of remote file: 206 kB
samples/portrait.png ADDED

Git LFS Details

  • SHA256: d83e3381364fc4d207bee74d4295cd5a530e4c6392e4a53b98d299bb055bda10
  • Pointer size: 131 Bytes
  • Size of remote file: 261 kB
samples/t2v_f0.png ADDED
samples/t2v_f16.png ADDED

Git LFS Details

  • SHA256: da05696339f74176ca26e3e8e2951134c1ecf82430d3b7e15beac3620fd64895
  • Pointer size: 131 Bytes
  • Size of remote file: 109 kB
samples/t2v_f8.png ADDED

Git LFS Details

  • SHA256: d2a00bbf8292e18e87e622e3f82f711743e3a4ba963451e722e949b46ba36574
  • Pointer size: 131 Bytes
  • Size of remote file: 121 kB
samples/t2v_waves.mp4 ADDED
Binary file (98.1 kB). View file
 
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UniPCMultistepScheduler",
3
+ "_diffusers_version": "0.37.1",
4
+ "beta_end": 0.02,
5
+ "beta_schedule": "linear",
6
+ "beta_start": 0.0001,
7
+ "disable_corrector": [],
8
+ "dynamic_thresholding_ratio": 0.995,
9
+ "final_sigmas_type": "zero",
10
+ "flow_shift": 1.0,
11
+ "lower_order_final": true,
12
+ "num_train_timesteps": 1000,
13
+ "predict_x0": true,
14
+ "prediction_type": "flow_prediction",
15
+ "rescale_betas_zero_snr": false,
16
+ "sample_max_value": 1.0,
17
+ "shift_terminal": null,
18
+ "sigma_max": 200.0,
19
+ "sigma_min": 0.147,
20
+ "solver_order": 2,
21
+ "solver_p": null,
22
+ "solver_type": "bh2",
23
+ "steps_offset": 0,
24
+ "thresholding": false,
25
+ "time_shift_type": "exponential",
26
+ "timestep_spacing": "linspace",
27
+ "trained_betas": null,
28
+ "use_beta_sigmas": false,
29
+ "use_dynamic_shifting": false,
30
+ "use_exponential_sigmas": false,
31
+ "use_flow_sigmas": true,
32
+ "use_karras_sigmas": true
33
+ }
sound_tokenizer/config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "autoencoder_v2",
3
+ "sampling_rate": 48000,
4
+ "stereo": true,
5
+ "use_wav_as_input": true,
6
+ "normalize_volume": true,
7
+ "hop_size": 1920,
8
+ "input_channels": 1,
9
+ "enc_type": "spec_convnext",
10
+ "enc_dim": 192,
11
+ "enc_intermediate_dim": 768,
12
+ "enc_num_layers": 12,
13
+ "enc_num_blocks": 2,
14
+ "enc_n_fft": 64,
15
+ "enc_hop_length": 16,
16
+ "enc_latent_dim": 128,
17
+ "enc_c_mults": [
18
+ 1,
19
+ 2,
20
+ 4
21
+ ],
22
+ "enc_strides": [
23
+ 4,
24
+ 5,
25
+ 6
26
+ ],
27
+ "enc_identity_init": false,
28
+ "enc_use_snake": true,
29
+ "dec_type": "oobleck",
30
+ "dec_dim": 320,
31
+ "dec_c_mults": [
32
+ 1,
33
+ 2,
34
+ 4,
35
+ 8,
36
+ 16
37
+ ],
38
+ "dec_strides": [
39
+ 2,
40
+ 4,
41
+ 5,
42
+ 6,
43
+ 8
44
+ ],
45
+ "dec_use_snake": true,
46
+ "dec_final_tanh": false,
47
+ "dec_out_channels": 2,
48
+ "dec_anti_aliasing": false,
49
+ "dec_use_nearest_upsample": false,
50
+ "dec_use_tanh_at_final": false,
51
+ "bottleneck_type": "vae",
52
+ "bottleneck": {
53
+ "type": "vae"
54
+ },
55
+ "activation": "snakebeta",
56
+ "snake_logscale": true,
57
+ "anti_aliasing": false,
58
+ "use_cuda_kernel": false,
59
+ "causal": false,
60
+ "padding_mode": "zeros",
61
+ "vocoder_input_dim": 64,
62
+ "latent_mean": null,
63
+ "latent_std": null
64
+ }
sound_tokenizer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d4c61cde38acfb0cad9048a140c3533750277a8462b19dc08450d9fe1ad9879
3
+ size 1892409600
text_tokenizer/added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
text_tokenizer/chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
text_tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
text_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
text_tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
text_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 262144,
235
+ "pad_token": "<|endoftext|>",
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "Qwen2Tokenizer",
238
+ "unk_token": null
239
+ }
text_tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
transformer/mlx_quant_config.json ADDED
@@ -0,0 +1,474 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "group_size": 64,
3
+ "bits": 4,
4
+ "quantized": [
5
+ "layers.0.mlp.down_proj.weight",
6
+ "layers.0.mlp.gate_proj.weight",
7
+ "layers.0.mlp.up_proj.weight",
8
+ "layers.0.mlp_moe_gen.down_proj.weight",
9
+ "layers.0.mlp_moe_gen.gate_proj.weight",
10
+ "layers.0.mlp_moe_gen.up_proj.weight",
11
+ "layers.0.self_attn.add_k_proj.weight",
12
+ "layers.0.self_attn.add_q_proj.weight",
13
+ "layers.0.self_attn.add_v_proj.weight",
14
+ "layers.0.self_attn.to_k.weight",
15
+ "layers.0.self_attn.to_out.weight",
16
+ "layers.0.self_attn.to_q.weight",
17
+ "layers.0.self_attn.to_v.weight",
18
+ "layers.1.mlp.down_proj.weight",
19
+ "layers.1.mlp.gate_proj.weight",
20
+ "layers.1.mlp.up_proj.weight",
21
+ "layers.1.mlp_moe_gen.down_proj.weight",
22
+ "layers.1.mlp_moe_gen.gate_proj.weight",
23
+ "layers.1.mlp_moe_gen.up_proj.weight",
24
+ "layers.1.self_attn.add_k_proj.weight",
25
+ "layers.1.self_attn.add_q_proj.weight",
26
+ "layers.1.self_attn.add_v_proj.weight",
27
+ "layers.1.self_attn.to_k.weight",
28
+ "layers.1.self_attn.to_out.weight",
29
+ "layers.1.self_attn.to_q.weight",
30
+ "layers.1.self_attn.to_v.weight",
31
+ "layers.2.mlp.down_proj.weight",
32
+ "layers.2.mlp.gate_proj.weight",
33
+ "layers.2.mlp.up_proj.weight",
34
+ "layers.2.mlp_moe_gen.down_proj.weight",
35
+ "layers.2.mlp_moe_gen.gate_proj.weight",
36
+ "layers.2.mlp_moe_gen.up_proj.weight",
37
+ "layers.2.self_attn.add_k_proj.weight",
38
+ "layers.2.self_attn.add_q_proj.weight",
39
+ "layers.2.self_attn.add_v_proj.weight",
40
+ "layers.2.self_attn.to_k.weight",
41
+ "layers.2.self_attn.to_out.weight",
42
+ "layers.2.self_attn.to_q.weight",
43
+ "layers.2.self_attn.to_v.weight",
44
+ "layers.3.mlp.down_proj.weight",
45
+ "layers.3.mlp.gate_proj.weight",
46
+ "layers.3.mlp.up_proj.weight",
47
+ "layers.3.mlp_moe_gen.down_proj.weight",
48
+ "layers.3.mlp_moe_gen.gate_proj.weight",
49
+ "layers.3.mlp_moe_gen.up_proj.weight",
50
+ "layers.3.self_attn.add_k_proj.weight",
51
+ "layers.3.self_attn.add_q_proj.weight",
52
+ "layers.3.self_attn.add_v_proj.weight",
53
+ "layers.3.self_attn.to_k.weight",
54
+ "layers.3.self_attn.to_out.weight",
55
+ "layers.3.self_attn.to_q.weight",
56
+ "layers.3.self_attn.to_v.weight",
57
+ "layers.4.mlp.down_proj.weight",
58
+ "layers.4.mlp.gate_proj.weight",
59
+ "layers.4.mlp.up_proj.weight",
60
+ "layers.4.mlp_moe_gen.gate_proj.weight",
61
+ "layers.4.self_attn.add_k_proj.weight",
62
+ "layers.4.self_attn.add_q_proj.weight",
63
+ "layers.4.self_attn.add_v_proj.weight",
64
+ "layers.4.self_attn.to_k.weight",
65
+ "layers.4.self_attn.to_out.weight",
66
+ "layers.4.self_attn.to_q.weight",
67
+ "layers.4.self_attn.to_v.weight",
68
+ "layers.10.mlp.down_proj.weight",
69
+ "layers.10.mlp.gate_proj.weight",
70
+ "layers.10.mlp.up_proj.weight",
71
+ "layers.10.mlp_moe_gen.down_proj.weight",
72
+ "layers.10.mlp_moe_gen.gate_proj.weight",
73
+ "layers.10.mlp_moe_gen.up_proj.weight",
74
+ "layers.10.self_attn.add_k_proj.weight",
75
+ "layers.10.self_attn.add_q_proj.weight",
76
+ "layers.10.self_attn.add_v_proj.weight",
77
+ "layers.10.self_attn.to_k.weight",
78
+ "layers.10.self_attn.to_out.weight",
79
+ "layers.10.self_attn.to_q.weight",
80
+ "layers.10.self_attn.to_v.weight",
81
+ "layers.11.self_attn.add_k_proj.weight",
82
+ "layers.11.self_attn.add_q_proj.weight",
83
+ "layers.11.self_attn.add_v_proj.weight",
84
+ "layers.11.self_attn.to_k.weight",
85
+ "layers.11.self_attn.to_out.weight",
86
+ "layers.11.self_attn.to_q.weight",
87
+ "layers.11.self_attn.to_v.weight",
88
+ "layers.4.mlp_moe_gen.down_proj.weight",
89
+ "layers.4.mlp_moe_gen.up_proj.weight",
90
+ "layers.5.mlp.down_proj.weight",
91
+ "layers.5.mlp.gate_proj.weight",
92
+ "layers.5.mlp.up_proj.weight",
93
+ "layers.5.mlp_moe_gen.down_proj.weight",
94
+ "layers.5.mlp_moe_gen.gate_proj.weight",
95
+ "layers.5.mlp_moe_gen.up_proj.weight",
96
+ "layers.5.self_attn.add_k_proj.weight",
97
+ "layers.5.self_attn.add_q_proj.weight",
98
+ "layers.5.self_attn.add_v_proj.weight",
99
+ "layers.5.self_attn.to_k.weight",
100
+ "layers.5.self_attn.to_out.weight",
101
+ "layers.5.self_attn.to_q.weight",
102
+ "layers.5.self_attn.to_v.weight",
103
+ "layers.6.mlp.down_proj.weight",
104
+ "layers.6.mlp.gate_proj.weight",
105
+ "layers.6.mlp.up_proj.weight",
106
+ "layers.6.mlp_moe_gen.down_proj.weight",
107
+ "layers.6.mlp_moe_gen.gate_proj.weight",
108
+ "layers.6.mlp_moe_gen.up_proj.weight",
109
+ "layers.6.self_attn.add_k_proj.weight",
110
+ "layers.6.self_attn.add_q_proj.weight",
111
+ "layers.6.self_attn.add_v_proj.weight",
112
+ "layers.6.self_attn.to_k.weight",
113
+ "layers.6.self_attn.to_out.weight",
114
+ "layers.6.self_attn.to_q.weight",
115
+ "layers.6.self_attn.to_v.weight",
116
+ "layers.7.mlp.down_proj.weight",
117
+ "layers.7.mlp.gate_proj.weight",
118
+ "layers.7.mlp.up_proj.weight",
119
+ "layers.7.mlp_moe_gen.down_proj.weight",
120
+ "layers.7.mlp_moe_gen.gate_proj.weight",
121
+ "layers.7.mlp_moe_gen.up_proj.weight",
122
+ "layers.7.self_attn.add_k_proj.weight",
123
+ "layers.7.self_attn.add_q_proj.weight",
124
+ "layers.7.self_attn.add_v_proj.weight",
125
+ "layers.7.self_attn.to_k.weight",
126
+ "layers.7.self_attn.to_out.weight",
127
+ "layers.7.self_attn.to_q.weight",
128
+ "layers.7.self_attn.to_v.weight",
129
+ "layers.8.mlp.down_proj.weight",
130
+ "layers.8.mlp.gate_proj.weight",
131
+ "layers.8.mlp.up_proj.weight",
132
+ "layers.8.mlp_moe_gen.down_proj.weight",
133
+ "layers.8.mlp_moe_gen.gate_proj.weight",
134
+ "layers.8.mlp_moe_gen.up_proj.weight",
135
+ "layers.8.self_attn.add_k_proj.weight",
136
+ "layers.8.self_attn.add_q_proj.weight",
137
+ "layers.8.self_attn.add_v_proj.weight",
138
+ "layers.8.self_attn.to_k.weight",
139
+ "layers.8.self_attn.to_out.weight",
140
+ "layers.8.self_attn.to_q.weight",
141
+ "layers.8.self_attn.to_v.weight",
142
+ "layers.9.mlp.down_proj.weight",
143
+ "layers.9.mlp.gate_proj.weight",
144
+ "layers.9.mlp.up_proj.weight",
145
+ "layers.9.mlp_moe_gen.down_proj.weight",
146
+ "layers.9.mlp_moe_gen.gate_proj.weight",
147
+ "layers.9.mlp_moe_gen.up_proj.weight",
148
+ "layers.9.self_attn.add_k_proj.weight",
149
+ "layers.9.self_attn.add_q_proj.weight",
150
+ "layers.9.self_attn.add_v_proj.weight",
151
+ "layers.9.self_attn.to_k.weight",
152
+ "layers.9.self_attn.to_out.weight",
153
+ "layers.9.self_attn.to_q.weight",
154
+ "layers.9.self_attn.to_v.weight",
155
+ "layers.11.mlp.down_proj.weight",
156
+ "layers.11.mlp.gate_proj.weight",
157
+ "layers.11.mlp.up_proj.weight",
158
+ "layers.11.mlp_moe_gen.down_proj.weight",
159
+ "layers.11.mlp_moe_gen.gate_proj.weight",
160
+ "layers.11.mlp_moe_gen.up_proj.weight",
161
+ "layers.12.mlp.down_proj.weight",
162
+ "layers.12.mlp.gate_proj.weight",
163
+ "layers.12.mlp.up_proj.weight",
164
+ "layers.12.mlp_moe_gen.down_proj.weight",
165
+ "layers.12.mlp_moe_gen.gate_proj.weight",
166
+ "layers.12.mlp_moe_gen.up_proj.weight",
167
+ "layers.12.self_attn.add_k_proj.weight",
168
+ "layers.12.self_attn.add_q_proj.weight",
169
+ "layers.12.self_attn.add_v_proj.weight",
170
+ "layers.12.self_attn.to_k.weight",
171
+ "layers.12.self_attn.to_out.weight",
172
+ "layers.12.self_attn.to_q.weight",
173
+ "layers.12.self_attn.to_v.weight",
174
+ "layers.13.mlp.down_proj.weight",
175
+ "layers.13.mlp.gate_proj.weight",
176
+ "layers.13.mlp.up_proj.weight",
177
+ "layers.13.mlp_moe_gen.down_proj.weight",
178
+ "layers.13.mlp_moe_gen.gate_proj.weight",
179
+ "layers.13.mlp_moe_gen.up_proj.weight",
180
+ "layers.13.self_attn.add_k_proj.weight",
181
+ "layers.13.self_attn.add_q_proj.weight",
182
+ "layers.13.self_attn.add_v_proj.weight",
183
+ "layers.13.self_attn.to_k.weight",
184
+ "layers.13.self_attn.to_out.weight",
185
+ "layers.13.self_attn.to_q.weight",
186
+ "layers.13.self_attn.to_v.weight",
187
+ "layers.14.mlp.down_proj.weight",
188
+ "layers.14.mlp.gate_proj.weight",
189
+ "layers.14.mlp.up_proj.weight",
190
+ "layers.14.mlp_moe_gen.down_proj.weight",
191
+ "layers.14.mlp_moe_gen.gate_proj.weight",
192
+ "layers.14.mlp_moe_gen.up_proj.weight",
193
+ "layers.14.self_attn.add_k_proj.weight",
194
+ "layers.14.self_attn.add_q_proj.weight",
195
+ "layers.14.self_attn.add_v_proj.weight",
196
+ "layers.14.self_attn.to_k.weight",
197
+ "layers.14.self_attn.to_out.weight",
198
+ "layers.14.self_attn.to_q.weight",
199
+ "layers.14.self_attn.to_v.weight",
200
+ "layers.15.mlp.down_proj.weight",
201
+ "layers.15.mlp.gate_proj.weight",
202
+ "layers.15.mlp.up_proj.weight",
203
+ "layers.15.mlp_moe_gen.down_proj.weight",
204
+ "layers.15.mlp_moe_gen.gate_proj.weight",
205
+ "layers.15.mlp_moe_gen.up_proj.weight",
206
+ "layers.15.self_attn.add_k_proj.weight",
207
+ "layers.15.self_attn.add_q_proj.weight",
208
+ "layers.15.self_attn.add_v_proj.weight",
209
+ "layers.15.self_attn.to_k.weight",
210
+ "layers.15.self_attn.to_out.weight",
211
+ "layers.15.self_attn.to_q.weight",
212
+ "layers.15.self_attn.to_v.weight",
213
+ "layers.16.mlp.down_proj.weight",
214
+ "layers.16.mlp.gate_proj.weight",
215
+ "layers.16.mlp.up_proj.weight",
216
+ "layers.16.mlp_moe_gen.down_proj.weight",
217
+ "layers.16.mlp_moe_gen.gate_proj.weight",
218
+ "layers.16.mlp_moe_gen.up_proj.weight",
219
+ "layers.16.self_attn.add_k_proj.weight",
220
+ "layers.16.self_attn.add_q_proj.weight",
221
+ "layers.16.self_attn.add_v_proj.weight",
222
+ "layers.16.self_attn.to_k.weight",
223
+ "layers.16.self_attn.to_out.weight",
224
+ "layers.16.self_attn.to_q.weight",
225
+ "layers.16.self_attn.to_v.weight",
226
+ "layers.17.mlp.down_proj.weight",
227
+ "layers.17.mlp.gate_proj.weight",
228
+ "layers.17.mlp.up_proj.weight",
229
+ "layers.17.self_attn.add_k_proj.weight",
230
+ "layers.17.self_attn.add_q_proj.weight",
231
+ "layers.17.self_attn.add_v_proj.weight",
232
+ "layers.17.self_attn.to_k.weight",
233
+ "layers.17.self_attn.to_out.weight",
234
+ "layers.17.self_attn.to_q.weight",
235
+ "layers.17.self_attn.to_v.weight",
236
+ "layers.17.mlp_moe_gen.down_proj.weight",
237
+ "layers.17.mlp_moe_gen.gate_proj.weight",
238
+ "layers.17.mlp_moe_gen.up_proj.weight",
239
+ "layers.18.mlp.down_proj.weight",
240
+ "layers.18.mlp.gate_proj.weight",
241
+ "layers.18.mlp.up_proj.weight",
242
+ "layers.18.mlp_moe_gen.down_proj.weight",
243
+ "layers.18.mlp_moe_gen.gate_proj.weight",
244
+ "layers.18.mlp_moe_gen.up_proj.weight",
245
+ "layers.18.self_attn.add_k_proj.weight",
246
+ "layers.18.self_attn.add_q_proj.weight",
247
+ "layers.18.self_attn.add_v_proj.weight",
248
+ "layers.18.self_attn.to_k.weight",
249
+ "layers.18.self_attn.to_out.weight",
250
+ "layers.18.self_attn.to_q.weight",
251
+ "layers.18.self_attn.to_v.weight",
252
+ "layers.19.mlp.down_proj.weight",
253
+ "layers.19.mlp.gate_proj.weight",
254
+ "layers.19.mlp.up_proj.weight",
255
+ "layers.19.mlp_moe_gen.down_proj.weight",
256
+ "layers.19.mlp_moe_gen.gate_proj.weight",
257
+ "layers.19.mlp_moe_gen.up_proj.weight",
258
+ "layers.19.self_attn.add_k_proj.weight",
259
+ "layers.19.self_attn.add_q_proj.weight",
260
+ "layers.19.self_attn.add_v_proj.weight",
261
+ "layers.19.self_attn.to_k.weight",
262
+ "layers.19.self_attn.to_out.weight",
263
+ "layers.19.self_attn.to_q.weight",
264
+ "layers.19.self_attn.to_v.weight",
265
+ "layers.20.mlp.down_proj.weight",
266
+ "layers.20.mlp.gate_proj.weight",
267
+ "layers.20.mlp.up_proj.weight",
268
+ "layers.20.mlp_moe_gen.down_proj.weight",
269
+ "layers.20.mlp_moe_gen.gate_proj.weight",
270
+ "layers.20.mlp_moe_gen.up_proj.weight",
271
+ "layers.20.self_attn.add_k_proj.weight",
272
+ "layers.20.self_attn.add_q_proj.weight",
273
+ "layers.20.self_attn.add_v_proj.weight",
274
+ "layers.20.self_attn.to_k.weight",
275
+ "layers.20.self_attn.to_out.weight",
276
+ "layers.20.self_attn.to_q.weight",
277
+ "layers.20.self_attn.to_v.weight",
278
+ "layers.21.mlp.down_proj.weight",
279
+ "layers.21.mlp.gate_proj.weight",
280
+ "layers.21.mlp.up_proj.weight",
281
+ "layers.21.mlp_moe_gen.down_proj.weight",
282
+ "layers.21.mlp_moe_gen.gate_proj.weight",
283
+ "layers.21.mlp_moe_gen.up_proj.weight",
284
+ "layers.21.self_attn.add_k_proj.weight",
285
+ "layers.21.self_attn.add_q_proj.weight",
286
+ "layers.21.self_attn.add_v_proj.weight",
287
+ "layers.21.self_attn.to_k.weight",
288
+ "layers.21.self_attn.to_out.weight",
289
+ "layers.21.self_attn.to_q.weight",
290
+ "layers.21.self_attn.to_v.weight",
291
+ "layers.22.mlp.down_proj.weight",
292
+ "layers.22.mlp.gate_proj.weight",
293
+ "layers.22.mlp.up_proj.weight",
294
+ "layers.22.mlp_moe_gen.down_proj.weight",
295
+ "layers.22.mlp_moe_gen.gate_proj.weight",
296
+ "layers.22.mlp_moe_gen.up_proj.weight",
297
+ "layers.22.self_attn.add_k_proj.weight",
298
+ "layers.22.self_attn.add_q_proj.weight",
299
+ "layers.22.self_attn.add_v_proj.weight",
300
+ "layers.22.self_attn.to_k.weight",
301
+ "layers.22.self_attn.to_out.weight",
302
+ "layers.22.self_attn.to_q.weight",
303
+ "layers.22.self_attn.to_v.weight",
304
+ "layers.23.mlp.down_proj.weight",
305
+ "layers.23.mlp.gate_proj.weight",
306
+ "layers.23.mlp.up_proj.weight",
307
+ "layers.23.mlp_moe_gen.down_proj.weight",
308
+ "layers.23.mlp_moe_gen.gate_proj.weight",
309
+ "layers.23.mlp_moe_gen.up_proj.weight",
310
+ "layers.23.self_attn.add_k_proj.weight",
311
+ "layers.23.self_attn.add_q_proj.weight",
312
+ "layers.23.self_attn.add_v_proj.weight",
313
+ "layers.23.self_attn.to_k.weight",
314
+ "layers.23.self_attn.to_out.weight",
315
+ "layers.23.self_attn.to_q.weight",
316
+ "layers.23.self_attn.to_v.weight",
317
+ "layers.24.self_attn.to_k.weight",
318
+ "layers.24.self_attn.to_q.weight",
319
+ "layers.24.self_attn.to_v.weight",
320
+ "layers.24.mlp.down_proj.weight",
321
+ "layers.24.mlp.gate_proj.weight",
322
+ "layers.24.mlp.up_proj.weight",
323
+ "layers.24.mlp_moe_gen.down_proj.weight",
324
+ "layers.24.mlp_moe_gen.gate_proj.weight",
325
+ "layers.24.mlp_moe_gen.up_proj.weight",
326
+ "layers.24.self_attn.add_k_proj.weight",
327
+ "layers.24.self_attn.add_q_proj.weight",
328
+ "layers.24.self_attn.add_v_proj.weight",
329
+ "layers.24.self_attn.to_out.weight",
330
+ "layers.25.mlp.down_proj.weight",
331
+ "layers.25.mlp.gate_proj.weight",
332
+ "layers.25.mlp.up_proj.weight",
333
+ "layers.25.mlp_moe_gen.down_proj.weight",
334
+ "layers.25.mlp_moe_gen.gate_proj.weight",
335
+ "layers.25.mlp_moe_gen.up_proj.weight",
336
+ "layers.25.self_attn.add_k_proj.weight",
337
+ "layers.25.self_attn.add_q_proj.weight",
338
+ "layers.25.self_attn.add_v_proj.weight",
339
+ "layers.25.self_attn.to_k.weight",
340
+ "layers.25.self_attn.to_out.weight",
341
+ "layers.25.self_attn.to_q.weight",
342
+ "layers.25.self_attn.to_v.weight",
343
+ "layers.26.mlp.down_proj.weight",
344
+ "layers.26.mlp.gate_proj.weight",
345
+ "layers.26.mlp.up_proj.weight",
346
+ "layers.26.mlp_moe_gen.down_proj.weight",
347
+ "layers.26.mlp_moe_gen.gate_proj.weight",
348
+ "layers.26.mlp_moe_gen.up_proj.weight",
349
+ "layers.26.self_attn.add_k_proj.weight",
350
+ "layers.26.self_attn.add_q_proj.weight",
351
+ "layers.26.self_attn.add_v_proj.weight",
352
+ "layers.26.self_attn.to_k.weight",
353
+ "layers.26.self_attn.to_out.weight",
354
+ "layers.26.self_attn.to_q.weight",
355
+ "layers.26.self_attn.to_v.weight",
356
+ "layers.27.mlp.down_proj.weight",
357
+ "layers.27.mlp.gate_proj.weight",
358
+ "layers.27.mlp.up_proj.weight",
359
+ "layers.27.mlp_moe_gen.down_proj.weight",
360
+ "layers.27.mlp_moe_gen.gate_proj.weight",
361
+ "layers.27.mlp_moe_gen.up_proj.weight",
362
+ "layers.27.self_attn.add_k_proj.weight",
363
+ "layers.27.self_attn.add_q_proj.weight",
364
+ "layers.27.self_attn.add_v_proj.weight",
365
+ "layers.27.self_attn.to_k.weight",
366
+ "layers.27.self_attn.to_out.weight",
367
+ "layers.27.self_attn.to_q.weight",
368
+ "layers.27.self_attn.to_v.weight",
369
+ "layers.28.mlp.down_proj.weight",
370
+ "layers.28.mlp.gate_proj.weight",
371
+ "layers.28.mlp.up_proj.weight",
372
+ "layers.28.mlp_moe_gen.down_proj.weight",
373
+ "layers.28.mlp_moe_gen.gate_proj.weight",
374
+ "layers.28.mlp_moe_gen.up_proj.weight",
375
+ "layers.28.self_attn.add_k_proj.weight",
376
+ "layers.28.self_attn.add_q_proj.weight",
377
+ "layers.28.self_attn.add_v_proj.weight",
378
+ "layers.28.self_attn.to_k.weight",
379
+ "layers.28.self_attn.to_out.weight",
380
+ "layers.28.self_attn.to_q.weight",
381
+ "layers.28.self_attn.to_v.weight",
382
+ "layers.29.mlp.down_proj.weight",
383
+ "layers.29.mlp.gate_proj.weight",
384
+ "layers.29.mlp.up_proj.weight",
385
+ "layers.29.mlp_moe_gen.down_proj.weight",
386
+ "layers.29.mlp_moe_gen.gate_proj.weight",
387
+ "layers.29.mlp_moe_gen.up_proj.weight",
388
+ "layers.29.self_attn.add_k_proj.weight",
389
+ "layers.29.self_attn.add_q_proj.weight",
390
+ "layers.29.self_attn.add_v_proj.weight",
391
+ "layers.29.self_attn.to_k.weight",
392
+ "layers.29.self_attn.to_out.weight",
393
+ "layers.29.self_attn.to_q.weight",
394
+ "layers.29.self_attn.to_v.weight",
395
+ "layers.30.mlp.gate_proj.weight",
396
+ "layers.30.mlp.up_proj.weight",
397
+ "layers.30.self_attn.add_k_proj.weight",
398
+ "layers.30.self_attn.add_q_proj.weight",
399
+ "layers.30.self_attn.add_v_proj.weight",
400
+ "layers.30.self_attn.to_k.weight",
401
+ "layers.30.self_attn.to_out.weight",
402
+ "layers.30.self_attn.to_q.weight",
403
+ "layers.30.self_attn.to_v.weight",
404
+ "layers.30.mlp.down_proj.weight",
405
+ "layers.30.mlp_moe_gen.down_proj.weight",
406
+ "layers.30.mlp_moe_gen.gate_proj.weight",
407
+ "layers.30.mlp_moe_gen.up_proj.weight",
408
+ "layers.31.mlp.down_proj.weight",
409
+ "layers.31.mlp.gate_proj.weight",
410
+ "layers.31.mlp.up_proj.weight",
411
+ "layers.31.mlp_moe_gen.down_proj.weight",
412
+ "layers.31.mlp_moe_gen.gate_proj.weight",
413
+ "layers.31.mlp_moe_gen.up_proj.weight",
414
+ "layers.31.self_attn.add_k_proj.weight",
415
+ "layers.31.self_attn.add_q_proj.weight",
416
+ "layers.31.self_attn.add_v_proj.weight",
417
+ "layers.31.self_attn.to_k.weight",
418
+ "layers.31.self_attn.to_out.weight",
419
+ "layers.31.self_attn.to_q.weight",
420
+ "layers.31.self_attn.to_v.weight",
421
+ "layers.32.mlp.down_proj.weight",
422
+ "layers.32.mlp.gate_proj.weight",
423
+ "layers.32.mlp.up_proj.weight",
424
+ "layers.32.mlp_moe_gen.down_proj.weight",
425
+ "layers.32.mlp_moe_gen.gate_proj.weight",
426
+ "layers.32.mlp_moe_gen.up_proj.weight",
427
+ "layers.32.self_attn.add_k_proj.weight",
428
+ "layers.32.self_attn.add_q_proj.weight",
429
+ "layers.32.self_attn.add_v_proj.weight",
430
+ "layers.32.self_attn.to_k.weight",
431
+ "layers.32.self_attn.to_out.weight",
432
+ "layers.32.self_attn.to_q.weight",
433
+ "layers.32.self_attn.to_v.weight",
434
+ "layers.33.mlp.down_proj.weight",
435
+ "layers.33.mlp.gate_proj.weight",
436
+ "layers.33.mlp.up_proj.weight",
437
+ "layers.33.mlp_moe_gen.down_proj.weight",
438
+ "layers.33.mlp_moe_gen.gate_proj.weight",
439
+ "layers.33.mlp_moe_gen.up_proj.weight",
440
+ "layers.33.self_attn.add_k_proj.weight",
441
+ "layers.33.self_attn.add_q_proj.weight",
442
+ "layers.33.self_attn.add_v_proj.weight",
443
+ "layers.33.self_attn.to_k.weight",
444
+ "layers.33.self_attn.to_out.weight",
445
+ "layers.33.self_attn.to_q.weight",
446
+ "layers.33.self_attn.to_v.weight",
447
+ "layers.34.mlp.down_proj.weight",
448
+ "layers.34.mlp.gate_proj.weight",
449
+ "layers.34.mlp.up_proj.weight",
450
+ "layers.34.mlp_moe_gen.down_proj.weight",
451
+ "layers.34.mlp_moe_gen.gate_proj.weight",
452
+ "layers.34.mlp_moe_gen.up_proj.weight",
453
+ "layers.34.self_attn.add_k_proj.weight",
454
+ "layers.34.self_attn.add_q_proj.weight",
455
+ "layers.34.self_attn.add_v_proj.weight",
456
+ "layers.34.self_attn.to_k.weight",
457
+ "layers.34.self_attn.to_out.weight",
458
+ "layers.34.self_attn.to_q.weight",
459
+ "layers.34.self_attn.to_v.weight",
460
+ "layers.35.mlp.down_proj.weight",
461
+ "layers.35.mlp.gate_proj.weight",
462
+ "layers.35.mlp.up_proj.weight",
463
+ "layers.35.mlp_moe_gen.down_proj.weight",
464
+ "layers.35.mlp_moe_gen.gate_proj.weight",
465
+ "layers.35.mlp_moe_gen.up_proj.weight",
466
+ "layers.35.self_attn.add_k_proj.weight",
467
+ "layers.35.self_attn.add_q_proj.weight",
468
+ "layers.35.self_attn.add_v_proj.weight",
469
+ "layers.35.self_attn.to_k.weight",
470
+ "layers.35.self_attn.to_out.weight",
471
+ "layers.35.self_attn.to_q.weight",
472
+ "layers.35.self_attn.to_v.weight"
473
+ ]
474
+ }
transformer/model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33aa7326bc74dba9d1420041eb3b0f6e051befac05082053c80f8ecb1c22f90d
3
+ size 2503129397
transformer/model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc369f9576c0cadfe691be1399ef580a5d34f9130fd9ec4d2cc765ac43220389
3
+ size 1724131896
transformer/model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79e4356afb93a0a5e01d1844b2d7f6ae1e9250555d0ac4a2bfda1ae152959448
3
+ size 1680055054
transformer/model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:765416dc6fa5c34504b587c8a5da28789d9e62822d275519eb7dde3dbb53bfff
3
+ size 1695818020
transformer/model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa8c7cfe7c709e5ac217e56959d45f2813d49a54f0e4d56459055c1da782049a
3
+ size 1708369224
transformer/model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29d5b08926ee17dc43270744a77926b929482e9f19b8ba1206b1e18babc073e2
3
+ size 1447282090
transformer/model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f36f39ad47fd8b0cb2b43d7117cd9d03784d108fb77c9a13474da6184aa0bf08
3
+ size 1318361139
vae/config.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKLWan",
3
+ "_diffusers_version": "0.37.1",
4
+ "_name_or_path": "Wan-AI/Wan2.2-TI2V-5B-Diffusers",
5
+ "attn_scales": [],
6
+ "base_dim": 160,
7
+ "clip_output": false,
8
+ "decoder_base_dim": 256,
9
+ "dim_mult": [
10
+ 1,
11
+ 2,
12
+ 4,
13
+ 4
14
+ ],
15
+ "dropout": 0.0,
16
+ "in_channels": 12,
17
+ "is_residual": true,
18
+ "latents_mean": [
19
+ -0.2289,
20
+ -0.0052,
21
+ -0.1323,
22
+ -0.2339,
23
+ -0.2799,
24
+ 0.0174,
25
+ 0.1838,
26
+ 0.1557,
27
+ -0.1382,
28
+ 0.0542,
29
+ 0.2813,
30
+ 0.0891,
31
+ 0.157,
32
+ -0.0098,
33
+ 0.0375,
34
+ -0.1825,
35
+ -0.2246,
36
+ -0.1207,
37
+ -0.0698,
38
+ 0.5109,
39
+ 0.2665,
40
+ -0.2108,
41
+ -0.2158,
42
+ 0.2502,
43
+ -0.2055,
44
+ -0.0322,
45
+ 0.1109,
46
+ 0.1567,
47
+ -0.0729,
48
+ 0.0899,
49
+ -0.2799,
50
+ -0.123,
51
+ -0.0313,
52
+ -0.1649,
53
+ 0.0117,
54
+ 0.0723,
55
+ -0.2839,
56
+ -0.2083,
57
+ -0.052,
58
+ 0.3748,
59
+ 0.0152,
60
+ 0.1957,
61
+ 0.1433,
62
+ -0.2944,
63
+ 0.3573,
64
+ -0.0548,
65
+ -0.1681,
66
+ -0.0667
67
+ ],
68
+ "latents_std": [
69
+ 0.4765,
70
+ 1.0364,
71
+ 0.4514,
72
+ 1.1677,
73
+ 0.5313,
74
+ 0.499,
75
+ 0.4818,
76
+ 0.5013,
77
+ 0.8158,
78
+ 1.0344,
79
+ 0.5894,
80
+ 1.0901,
81
+ 0.6885,
82
+ 0.6165,
83
+ 0.8454,
84
+ 0.4978,
85
+ 0.5759,
86
+ 0.3523,
87
+ 0.7135,
88
+ 0.6804,
89
+ 0.5833,
90
+ 1.4146,
91
+ 0.8986,
92
+ 0.5659,
93
+ 0.7069,
94
+ 0.5338,
95
+ 0.4889,
96
+ 0.4917,
97
+ 0.4069,
98
+ 0.4999,
99
+ 0.6866,
100
+ 0.4093,
101
+ 0.5709,
102
+ 0.6065,
103
+ 0.6415,
104
+ 0.4944,
105
+ 0.5726,
106
+ 1.2042,
107
+ 0.5458,
108
+ 1.6887,
109
+ 0.3971,
110
+ 1.06,
111
+ 0.3943,
112
+ 0.5537,
113
+ 0.5444,
114
+ 0.4089,
115
+ 0.7468,
116
+ 0.7744
117
+ ],
118
+ "num_res_blocks": 2,
119
+ "out_channels": 12,
120
+ "patch_size": 2,
121
+ "scale_factor_spatial": 16,
122
+ "scale_factor_temporal": 4,
123
+ "temperal_downsample": [
124
+ false,
125
+ true,
126
+ true
127
+ ],
128
+ "z_dim": 48
129
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:230496cb59ff85bc9c040487737c4062480cb61c71e697b197b4c30142f2a0da
3
+ size 1409400600