Kimi-K2.6 Vision Weights

Vision-only weights extracted from moonshotai/Kimi-K2.6 for use with MLX-based inference.

kimi_k26_vision.safetensors — 335 tensors, ~899MB (BF16)
- vision_tower.* — 329 tensors (MoonViT encoder, 27 layers)
- mm_projector.* — 6 tensors (PatchMergerMLP projector)
config.json — vision config + projector metadata

Architecture

Component	Details
Vision Encoder	MoonViT: 27 layers, 1152 hidden, 16 heads, patch_size=14
Patch Merger	2x2 spatial merge + temporal pool (no learned params)
Projector	LayerNorm(1152) to Linear(4608 to 4608) to GELU to Linear(4608 to 7168)
Total params	~450M

Kimi-K2.6 uses the same vision architecture as Kimi-K2.5 (and the same vision encoder as Kimi-VL-A3B). The projector output dimension is 7168 to match the K2.6 text backbone hidden size.

Usage

These weights are designed to be loaded alongside text-only MLX ports of Kimi-K2.6 (e.g. mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8) to enable vision-language capabilities.

The vision encoder processes images into (N, 7168) embedding vectors that replace media placeholder tokens in the text embedding stream.

Reproduction

Extracted from shards 63 + 64 of moonshotai/Kimi-K2.6. The vision tensors live entirely in those two shards (mm_projector in 63, vision_tower in 64). No modifications to the weights; original BF16 precision preserved.

See extract_vision_weights.py for the script.

License

Same license as the source model: Kimi-K2.6 License

Downloads last month: 1,394

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

exolabs
/

Kimi-K2.6-vision

Kimi-K2.6 Vision Weights

Contents

Architecture

Usage

Reproduction

License