Kimi-K2.6 REAP Solidity - Q4 MLX

This is a proof-of-concept Solidity-specialized Kimi-K2.6 checkpoint for local MLX inference on Apple Silicon. Kimi-K2.6 is released by Moonshot AI in a native INT4 / compressed-tensors format; this checkpoint preserves that Q4-class release format in MLX affine Q4 while applying REAP-style Mixture-of-Experts pruning. The pruning keeps the routed experts most active for the target workload, removes the least relevant routed experts, and preserves the rest of the model structure.

The goal is practical local use: retain useful smart-contract coding, investigation, and fork-test behavior while reducing the expert footprint enough to run on a single high-memory Mac Studio.

This release is intentionally framed against Moonshot AI hosted Kimi-K2.6 inference, not just against smaller local baselines. Kimi-K2.6 is a tier-1 open-weight MoE model, and its native INT4 format is part of Moonshot's post-training quantization-aware release process, not an aftermarket quality-reducing quantization. This checkpoint shows that a domain-specialized REAP variant can run locally on a Mac Studio M3 Ultra 512 GB and complete demanding Solidity-agent tasks in the same target domain.

Built with mlx_fun, an MLX-native toolkit for MoE quantization, expert pruning, and expert-activation analysis.

REAP summed-rank heatmap for 60 MoE layers and 384 routed experts

Each row is one of the 60 MoE layers; each column is one of the 384 routed experts in the base model. Lower summed rank indicates higher observed importance for the calibration workload. The released checkpoint keeps 256 of 384 routed experts per MoE layer.

Model Details

Field Value
Base model moonshotai/Kimi-K2.6
Format MLX safetensors with custom Kimi K2.5/K2.6 code files
Quantization MLX affine Q4, preserving the native INT4/Q4-class Kimi-K2.6 release format
Pruning method REAP per-layer expert pruning by saliency
Source routed experts 384 per MoE layer
Experts kept 256 / 384 routed experts per MoE layer
Expert reduction 128 experts removed per MoE layer, ~33% reduction
Layers 61 total text layers, including 1 dense layer and 60 MoE layers
Selected experts per token 8
Context length 262,144 tokens
Disk size ~401 GB
Shards 180 safetensors files
Build and evaluation machine Mac Studio M3 Ultra, 512 GB unified memory
Recommended hardware Apple Silicon with >= ~440 GB unified memory

The number of experts kept is uniform across MoE layers, but the specific expert indices differ per layer. The router and related expert-indexed tensors are sliced to match the retained experts.

Intended Use

This checkpoint is intended for local, domain-focused Solidity work:

  • Smart-contract review assistance.
  • Solidity exploit investigation and vulnerability triage.
  • Foundry-style fork-test generation and debugging.
  • Research into MoE routing, expert saliency, and workload-specific pruning.

It is not intended to be a general-purpose replacement for the full Kimi-K2.6 model. Capability outside the Solidity and smart-contract engineering domain is expected to degrade more than with the unpruned base model.

Build Summary

The checkpoint was produced from Kimi-K2.6 by:

  1. Collecting routed-expert saliency from a focused Solidity-oriented workload.
  2. Converting the native INT4/Q4-class base release to MLX affine Q4.
  3. Removing the least-salient 128 routed experts from each MoE layer.
  4. Preserving dense layers, top-k routing shape, shared experts, and router/tensor alignment.

All conversion, saliency collection, pruning, and local evaluation work for this release was completed on a Mac Studio M3 Ultra with 512 GB unified memory. That hardware point matters: the result is a local, reproducible workstation workflow for a tier-1 open-weight MoE model, not a cloud-only experiment.

Private prompts, internal task traces, target contracts, and operational pipeline details are intentionally omitted from this public model card. To apply the same method to another workload, collect your own routed-expert activation statistics with mlx_fun, then prune the target checkpoint using the resulting per-layer saliency.

Evaluation Snapshot

The following results are from a representative end-to-end smart-contract audit workload with multiple agent stages. They compare this pruned local Q4 checkpoint against a same-base local Q3 checkpoint and OpenRouter-routed Moonshot AI Kimi-K2.6 inference.

These are workload snapshots, not a formal benchmark suite, provider benchmark, or security certification. Hardware for the local runs was a Mac Studio M3 Ultra with 512 GB unified memory. The target details and internal pipeline names are intentionally abstracted in this public card.

Exploit-Investigation Stage

This stage asks the agent to reason through the exploit path and emit the confirmed vulnerability finding. The fully cold REAP row has no external knowledge priming.

Configuration Outcome Wall time Input tokens Output tokens Total tokens Tool calls
OpenRouter-routed Moonshot AI Kimi-K2.6 Critical canonical 293s 293,191 7,799 300,990 19
Local Q3 baseline Critical canonical 1,665s 659,769 13,721 673,490 21
This model, local Q4 REAP, fully cold Confirmed critical 1,498s 267,829 9,613 277,442 17

On this exploit-investigation stage, REAP uses ~59% fewer total tokens and ~19% fewer tool calls than the local Q3 baseline while reaching the same canonical outcome. Compared with OpenRouter-routed Moonshot AI inference, REAP uses slightly fewer total tokens and fewer tool calls, but wall-clock is slower because local Apple Silicon prefill/generation throughput is lower.

Throughput Snapshot

Configuration Output throughput Prefill throughput
This model, local Q4 REAP ~10-13 tok/s ~300 tok/s
Moonshot AI hosted Kimi-K2.6 inference ~23-31 tok/s ~2,000 tok/s

Moonshot AI hosted inference is faster wall-clock. The point of REAP here is different: keep a tier-1 open-weight model useful for a high-value domain while running locally, privately, and repeatably on one Mac Studio.

Quickstart

LM Studio

Place the directory under:

~/.lmstudio/models/<owner>/Kimi-K2.6-REAP-Solidity

Enable Trust Remote Code in LM Studio. The checkpoint ships the required Moonshot Kimi custom modeling files.

mlx-fun serve

mlx-fun serve \
  --model /path/to/Kimi-K2.6-REAP-Solidity \
  --trust-remote-code \
  --idle-timeout 7200 \
  --default-top-k 100 \
  --default-repetition-penalty 1.05 \
  --prompt-cache-size 1

The included chat_template.jinja handles Kimi tool calls and reasoning blocks.

Limitations

  • This is not a fine-tune. The model weights were not trained further; routed experts were removed based on observed workload saliency.
  • This is not a security-certified auditor. It can miss vulnerabilities, invent false positives, or generate unsafe assumptions. Human review is required for smart-contract findings.
  • The evaluation above is a small workload snapshot, not a broad benchmark corpus. Test on your own contracts, vulnerability classes, and agent setup before relying on the model.
  • The Q4 wording here refers to preserving Kimi-K2.6's native INT4/Q4-class release format in MLX. It should not be read as an additional aftermarket low-bit compression step from a BF16 checkpoint.
  • Saliency was collected from a lower-bit capture checkpoint and then applied to the Q4 target. Routing saliency is generally stable across quantization levels, but stricter reproductions should collect saliency directly on their target quantization.
  • Tooling that assumes expert index N has the same meaning across all layers will not apply cleanly. Expert retention is layer-specific.
  • Kimi-K2.6 is multimodal, but only the text path was exercised during calibration and evaluation. Vision inputs have not been measured after pruning.

License

This checkpoint inherits the base model's Modified MIT terms. See moonshotai/Kimi-K2.6 and the included LICENSE file for the original terms.

Links

If this model or workflow is useful, please star the mlx_fun repo.

Downloads last month
1,630
Safetensors
Model size
688B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sombra-x/Kimi-K2.6-REAP-Solidity

Quantized
(34)
this model