Intern-S2-Preview-oQ4-mtp

4-bit Quantized Intern-S2-Preview via oMLX oQ — with time_series stripped for inference compatibility.

Model ID: saintlits/Intern-S2-Preview-oQ4-mtp Base model: Intern-S2-Preview by Shanghai AI Laboratory Quantization tool: oMLX — optimal quantization for MLX Original discussion: From oQ to Inference: Full-chain Quantization Post-mortem


Model Description

Intern-S2-Preview is a multimodal MoE (Mixture of Experts) model with vision + text capabilities, built on a Qwen3.5-MoE text backbone. This quantized variant reduces memory footprint from the original bfloat16 weights to ~20 GB via 4-bit quantization while retaining the Multi-Token Prediction (MTP) head.

Architecture Highlights

Component Detail
Text backbone Qwen3.5-MoE, 256 experts, 8 experts/token, shared expert
Hidden size 2048
Layers 40
Attention heads 16 (2 KV heads)
Head dim 256
Hybrid attention Linear attention (efficient) × 3 + Full attention × 1, repeating every 4 layers
Context window 262,144 tokens (262K)
Activation SiLU, RMSNorm
RoPE Multi-mode (mRoPE, 3 sections: 11/11/10), θ=10M, partial rotary factor=0.25
MTP (Multi-Token Prediction) 1 extra layer, shared embeddings with main model
Vision encoder CLIP-style, 17 transformer layers, d_model=768, 8 attention heads
Time series ❌ Stripped — 511 time-series parameters removed during quantization
Vocabulary 251,392 tokens (Qwen-style tokenizer)

Quantization Scheme (Mixed-Precision)

Quantized with oMLX oQ (affine quantization, per-group):

Weight Group Bits Group Size
Most linear layers 4 64
Linear attention projections 6 64
Full attention Q/K projections 6 64
Shared expert MLPs 6 128
Shared expert gate 8 64
LM head 8 64
Vision encoder 4 64

The lm_head and shared expert gate are retained at higher precision (8-bit) to preserve output quality and routing accuracy.


Model Files

├── config.json                    # Full model configuration (4881 lines)
├── generation_config.json         # Generation parameters
├── chat_template.jinja            # Custom chat template (vision + text)
├── model.safetensors.index.json
├── model-00001-of-00012.safetensors (~2.1 GB each)
├── model-00002-of-00012.safetensors
├── ...
├── model-00012-of-00012.safetensors (84 MB)
├── modeling_interns2_preview.py   # Custom model code
├── configuration_interns2_preview.py
├── processing_interns2_preview.py # Image/text processing
├── preprocessor_config.json
├── tokenizer.json / tokenizer_config.json / vocab.json / merges.txt
├── special_tokens_map.json

Usage

oMLX / mlx-lm (Apple Silicon)

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("saintlits/Intern-S2-Preview-oQ4-mtp")
response = generate(model, tokenizer, "Hello, what is the capital of France?", verbose=True)

HuggingFace Transformers

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_id = "saintlits/Intern-S2-Preview-oQ4-mtp"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    config=config,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

⚠️ trust_remote_code=True is required since the model uses custom modeling code.

vLLM

pip install vllm
vllm serve "saintlits/Intern-S2-Preview-oQ4-mtp"

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "saintlits/Intern-S2-Preview-oQ4-mtp",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmarks

Summary Comparison

Benchmark oQ4-mtp (this model) oQ6-mtp
ARC_CHALLENGE 96.3% 96.7%
GSM8K 94.0%
CMMLU 90.3%
TRUTHFULQA 86.3%
LIVECODEBENCH (No Think) 44.0%
LIVECODEBENCH (Think) 68.0%

Detailed Results

oQ4-mtp (this model)

Benchmark Accuracy Correct Total Time (s) Think
ARC_CHALLENGE 96.3% 289 300 187.2 No
GSM8K 94.0% 94 100 255.8 No
LIVECODEBENCH 44.0% 44 100 3895.0 No

Benchmarks run on M4 Pro (48 GB) using mlx-lm. Sampled subsets of standard evaluation sets. Think mode enables chain-of-thought reasoning.


Inference Performance

oQ4-mtp — Single request vs Continuous batching (M4 Pro, 48 GB)

Test TTFT (ms) TPOT (ms) pp TPS tg TPS E2E (s) Throughput Peak Mem
pp1024/tg128 1310.7 15.48 781.3 64.3 3.302 348.9 tok/s 20.02 GB
pp4096/tg128 4536.2 15.48 903.0 65.1 6.503 649.6 tok/s 20.80 GB
pp8192/tg128 9302.1 15.67 880.7 64.3 11.292 736.8 tok/s 21.14 GB
pp16384/tg128 19899.5 17.42 823.3 57.9 22.112 746.8 tok/s 21.77 GB

Continuous Batching (pp1024/tg128)

Batch tg TPS Speedup pp TPS pp TPS/req TTFT (ms) E2E (s)
64.3 1.00× 781.3 781.3 1310.7 3.302
103.2 1.60× 762.7 381.4 2575.3 5.165
129.7 2.02× 763.5 190.9 5005.7 9.313
154.2 2.40× 754.2 94.3 9935.3 17.502

Benchmarked with oMLX on Apple Silicon (M4 Pro).


Generation Parameters (default)

{
    "do_sample": true,
    "temperature": 0.95,
    "top_k": 20,
    "top_p": 0.95,
    "bos_token_id": 248044,
    "eos_token_id": [248044, 248046],
    "pad_token_id": 248044
}

Quantization Notes

time_series Stripping

The original Intern-S2-Preview includes 511 time-series parameters that cause loading errors in standard frameworks (HF Transformers, MLX, llama.cpp). During the quantization pipeline, these parameters were identified, removed, and the model definition was patched:

  • Added TIME_SERIES_CONF = {"UNUSED": True} to config
  • Patched modeling_interns2_preview.py: TIME_SERIES_CONF.get("UNUSED") guard to skip time_series loading when unused

Model Remapping

The internlm3 type (InternLM3ForCausalLMQwen2MoeForCausalLM) requires explicit model remapping in HF Transformers. When using trust_remote_code=False and the custom type wasn't registered, the patch works as follows:

Before _measure_sensitivity → injects
MODEL_REMAPPING["intern_s2_preview"] = "qwen3_5_moe"
# → uses the remapped type and loads successfully

Other Patches

  • generation_config.py: Added r.any / r.all attribute conditions guard for generated config
  • MTP joint loss: Patched to ignore time_series loss component
  • Tokenizer: Shrunk default max length from 262k to avoid OOM in non-optimized environments; adjust via model.config.max_position_embeddings if needed

Limitations

  • Time series removed: The time-series modality was stripped — this variant cannot process time-series inputs
  • Not for production: This is a research/preview quantized model. Use at your own risk
  • Performance variance: Benchmark results may vary depending on hardware, quantization configuration, and evaluation settings
  • Custom code required: trust_remote_code=True is mandatory for HF Transformers usage

License & Acknowledgments

  • Apache 2.0 (this quantized variant)
  • Base model: Intern-S2-PreviewApache 2.0
  • Special thanks to the oMLX team for the quantization pipeline and Apple MLX community for Apple Silicon ML framework
  • Intern-S2-Preview by Shanghai AI Laboratory

Quantized and uploaded by saintlits. Full quantization post‑mortem available here.

Downloads last month
133
Safetensors
Model size
6B params
Tensor type
U32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support