YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Kimi-K2.6 Vision Weights

Vision-only weights extracted from moonshotai/Kimi-K2.6 for use with MLX-based inference.

Contents

  • kimi_k26_vision.safetensors โ€” 335 tensors, ~899MB (BF16)
    • vision_tower.* โ€” 329 tensors (MoonViT encoder, 27 layers)
    • mm_projector.* โ€” 6 tensors (PatchMergerMLP projector)
  • config.json โ€” vision config + projector metadata

Architecture

Component Details
Vision Encoder MoonViT: 27 layers, 1152 hidden, 16 heads, patch_size=14
Patch Merger 2x2 spatial merge + temporal pool (no learned params)
Projector LayerNorm(1152) to Linear(4608 to 4608) to GELU to Linear(4608 to 7168)
Total params ~450M

Kimi-K2.6 uses the same vision architecture as Kimi-K2.5 (and the same vision encoder as Kimi-VL-A3B). The projector output dimension is 7168 to match the K2.6 text backbone hidden size.

Usage

These weights are designed to be loaded alongside text-only MLX ports of Kimi-K2.6 (e.g. mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8) to enable vision-language capabilities.

The vision encoder processes images into (N, 7168) embedding vectors that replace media placeholder tokens in the text embedding stream.

Reproduction

Extracted from shards 63 + 64 of moonshotai/Kimi-K2.6. The vision tensors live entirely in those two shards (mm_projector in 63, vision_tower in 64). No modifications to the weights; original BF16 precision preserved.

See extract_vision_weights.py for the script.

License

Same license as the source model: Kimi-K2.6 License

Downloads last month
1,394
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support