Intern-S2-Preview-FP8 GGUF Q4_K_M + vision mmproj

GGUF conversion of internlm/Intern-S2-Preview-FP8 for llama.cpp/TurboQuant serving, with a verified local vision/mmproj path.

This repo is packaged as the original multimodal Intern-S2 deployment: the quantized Intern-S2 language GGUF plus a compatible vision projector (mmproj) for image input. Intern-S2's HF checkpoint is a custom multimodal wrapper; the llama.cpp language GGUF is produced by flattening the Qwen3.5 MoE language config into a CausalLM-compatible GGUF, then pairing it with the compatible Qwen3.6-35B-A3B qwen3vl_merger projector. The verified local runtime uses the patched llama.cpp/TurboQuant build documented below.

Files

  • Intern-S2-Preview-FP8-text-Q4_K_M.gguf
    • size: 21,721,180,960 bytes / 20.22 GiB
    • quant: Q4_K_M
    • BPW: 4.89
    • params reported by llama.cpp: 35.52 B, active MoE type 35B.A3B
    • vocab: 251392
    • tensors: 753
  • mmproj-Intern-S2-Preview-FP8-F16.gguf
    • size: 899,283,584 bytes / 857.62 MiB
    • source local filename: mmproj-Qwen3.6-35B-A3B-F16.gguf
    • projector: qwen3vl_merger
    • projection dim: 2048, matching Intern-S2 text hidden size
  • docs/CONVERSION_AND_SERVING.md โ€” full conversion notes, required converter patches, llama.cpp version, serving config, benchmarks
  • docs/VISION_STATUS.md โ€” verified vision setup, runtime patch, API smoke result, caveats
  • bench/SUMMARY.md โ€” generation and perplexity smoke summary
  • bench/vision_smoke.json โ€” verified image-input API smoke result
  • patches/llama-cpp-intern-s2-converter.patch โ€” local converter patch used with the TurboQuant llama.cpp fork
  • patches/llama-cpp-mtmd-embd-getrows-fix.patch โ€” local runtime patch required for stable mtmd image embedding evaluation

SHA-256:

cc28d91bb2cad0d591621e6879c8821e3e4481448da9d62e4b22e5c9c780b927  Intern-S2-Preview-FP8-text-Q4_K_M.gguf
71f3cbc1f7cc0f30d09d41cfa924c0060827ebc33bf15ace7e86661e856f0160  mmproj-Intern-S2-Preview-FP8-F16.gguf

Upstream model

Base model: https://huggingface.co/internlm/Intern-S2-Preview-FP8

This quant is linked back to the upstream model through the model card base_model metadata and this README. Use the upstream repository for license, intended use, tokenizer provenance, and original model details.

Tested serving mode

Text/MTP alias

Local tested stack:

llama.cpp/TurboQuant fork: https://github.com/TheTom/llama-cpp-turboquant.git
commit: 03ddcd585f1c708dce946c8deb4b7c45133ae3c6
version description: b9020-171-g03ddcd585-dirty
runtime: llama-swap -> llama-server
front door: :9069
backend: llama-server

Example llama-server command:

llama-server \
  --model Intern-S2-Preview-FP8-text-Q4_K_M.gguf \
  --ctx-size 204800 \
  --threads 16 \
  --batch-size 256 \
  --ubatch-size 64 \
  --flash-attn on \
  --cont-batching \
  --parallel 1 \
  --host 0.0.0.0 \
  --split-mode layer \
  --tensor-split 0.50,0.50 \
  -ctk turbo3 \
  -ctv turbo3 \
  -ngl 999 \
  --temp 0 \
  --top-k 1 \
  --top-p 1 \
  --min-p 0 \
  --repeat-penalty 1.0 \
  --spec-type mtp \
  --spec-draft-n-max 1 \
  --jinja \
  --chat-template-kwargs '{"enable_thinking":false}' \
  --reasoning off

Quick smoke:

curl -s http://127.0.0.1:9069/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"intern-s2-mtp","messages":[{"role":"user","content":"Reply exactly: ok"}],"max_tokens":8,"temperature":0,"stream":false}'

A valid MTP smoke should include nonzero timings.draft_n and timings.draft_n_accepted.

Vision alias

Vision is served as a separate conservative alias. It intentionally disables TurboKV/MTP/flash-attn during image bring-up.

llama-server \
  --model Intern-S2-Preview-FP8-text-Q4_K_M.gguf \
  --mmproj mmproj-Intern-S2-Preview-FP8-F16.gguf \
  --no-mmproj-offload \
  --ctx-size 4096 \
  --threads 16 \
  --batch-size 128 \
  --ubatch-size 32 \
  --flash-attn off \
  --parallel 1 \
  --split-mode layer \
  --tensor-split 0.50,0.50 \
  -ctk f16 \
  -ctv f16 \
  -ngl 999 \
  --temp 0 \
  --top-k 1 \
  --top-p 1 \
  --min-p 0 \
  --repeat-penalty 1.0 \
  --jinja \
  --chat-template-kwargs '{"enable_thinking":false}' \
  --reasoning off

Verified image smoke response:

A red square with the text "RED SQUARE" in black, centered on a white background.

See docs/VISION_STATUS.md for the exact API request and the local llama.cpp/mtmd runtime patch required to avoid the image-embedding crash.

Benchmarks from local smoke

Generation via llama-swap/OpenAI API, deterministic settings (temperature=0, top_k=1, top_p=1):

prompt completion tokens tok/s MTP accepted
smoke 3 126.51 1/1
short_reason 93 111.41 37/46
code 189 147.03 90/94
longer 256 111.09 101/127

Perplexity smoke on a small fixed local technical corpus:

PPL = 28.8177 +/- 9.76237
prompt eval: 192 tokens, 1743.28 tokens/s

This PPL is a smoke test only, not a standardized benchmark. Raw artifacts are under bench/.

Conversion warning

This model does not convert cleanly with stock llama.cpp as a top-level InternS2PreviewForConditionalGeneration model. Required handling:

  1. flatten config.json["text_config"] into a llama.cpp-compatible language GGUF shim
  2. force architectures = ["Qwen3_5MoeForCausalLM"]
  3. preserve quantization and tokenizer metadata
  4. ignore packed FP8 expert scale tensors during expert merge
  5. remap MTP layer tensor names and block IDs
  6. pad tokenizer metadata from raw config.json vocab size (251392)

See docs/CONVERSION_AND_SERVING.md for exact patch shapes and commands.

Limitations

  • This package is the multimodal Intern-S2 GGUF setup: language GGUF plus mmproj-Intern-S2-Preview-FP8-F16.gguf. Vision requires the patched llama.cpp/TurboQuant runtime documented in docs/VISION_STATUS.md.
  • Vision is verified with a conservative 4K-context alias. 65K/200K vision has not been revalidated.
  • 200K text-context serving uses TurboKV and is beyond the original n_ctx_train = 32768; treat it as an operational smoke, not a full quality guarantee.
  • Deterministic serving is recommended until stochastic MTP behavior is separately validated.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for crogers2287/Intern-S2-Preview-FP8-GGUF

Finetuned
(1)
this model