Upload folder using huggingface_hub

39c7ec2 verified 13 days ago

2.08 kB

license: other
license_name: kimi-k2-derivative
tags:
  - speculative-decoding
  - dflash
  - sglang
  - draft-model
  - kimi-k2
extra_gated_prompt: >-
  This is a research preview draft model. Access is granted manually. By
  requesting access you agree to use it for research/evaluation and understand
  it is an early, single-config checkpoint.
extra_gated_fields:
  Intended use: text
  Affiliation: text

Kimi-K2.7-coder-DFLASH-preview

A DFlash (block-diffusion) speculative-decoding draft model for an ablated Kimi-K2.7-Code target. Research preview.

Trained with SpecForge (online DFlash training).
Trained on only ~55k samples (single corpus, a few epochs) — deliberately small; this is a preview.
On our ablated Kimi-K2.7-Code target it already beats a misaligned EAGLE3 drafter (a K2.6 EAGLE3 head used on K2.7-Code): measured mean accept length ~2.5 vs ~1.9, ~1.3x single-stream decode speedup, with peaks of 4.6-5.2 on long free-form code bodies.
Architecture: DFlashDraftModel (5 layers, hidden 7168), train block size 16 / infer 8, target layer ids capture, mask token id 163838.

Serving (sglang)

Serve against the matching target with DFLASH speculative decoding:

python -m sglang.launch_server --model-path <kimi-k2.7-code-target> --tp 4 \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path Kimi-K2.7-coder-DFLASH-preview \
  --speculative-eagle-topk 1 --speculative-dflash-block-size 8 --speculative-num-draft-tokens 8

Production / structured outputs

Stock sglang DFLASH rejects grammar-constrained requests (JSON schema / regex / tool schemas). Support for that — so this drafter keeps its speedup on coding-CLI JSON (thin envelope around a large code/diff string) — is proposed upstream in sgl-project/sglang PR #28943.

Limitations

Early preview (small data, single config); acceptance plateaus and is not yet tuned.
Must be paired with the matching ablated Kimi-K2.7-Code target; not a standalone model.