Intern-S2-Preview-oQ4-mtp

4-bit Quantized Intern-S2-Preview via oMLX oQ — with time_series stripped for inference compatibility.

Model ID: saintlits/Intern-S2-Preview-oQ4-mtp Base model: Intern-S2-Preview by Shanghai AI Laboratory Quantization tool: oMLX — optimal quantization for MLX Original discussion: From oQ to Inference: Full-chain Quantization Post-mortem

Model Description

Intern-S2-Preview is a multimodal MoE (Mixture of Experts) model with vision + text capabilities, built on a Qwen3.5-MoE text backbone. This quantized variant reduces memory footprint from the original bfloat16 weights to ~20 GB via 4-bit quantization while retaining the Multi-Token Prediction (MTP) head.

Architecture Highlights

Component	Detail
Text backbone	Qwen3.5-MoE, 256 experts, 8 experts/token, shared expert
Hidden size	2048
Layers	40
Attention heads	16 (2 KV heads)
Head dim	256
Hybrid attention	Linear attention (efficient) × 3 + Full attention × 1, repeating every 4 layers
Context window	262,144 tokens (262K)
Activation	SiLU, RMSNorm
RoPE	Multi-mode (mRoPE, 3 sections: 11/11/10), θ=10M, partial rotary factor=0.25
MTP (Multi-Token Prediction)	1 extra layer, shared embeddings with main model
Vision encoder	CLIP-style, 17 transformer layers, d_model=768, 8 attention heads
Time series	❌ Stripped — 511 time-series parameters removed during quantization
Vocabulary	251,392 tokens (Qwen-style tokenizer)

Quantization Scheme (Mixed-Precision)

Quantized with oMLX oQ (affine quantization, per-group):

Weight Group	Bits	Group Size
Most linear layers	4	64
Linear attention projections	6	64
Full attention Q/K projections	6	64
Shared expert MLPs	6	128
Shared expert gate	8	64
LM head	8	64
Vision encoder	4	64

The lm_head and shared expert gate are retained at higher precision (8-bit) to preserve output quality and routing accuracy.

Model Files

├── config.json                    # Full model configuration (4881 lines)
├── generation_config.json         # Generation parameters
├── chat_template.jinja            # Custom chat template (vision + text)
├── model.safetensors.index.json
├── model-00001-of-00012.safetensors (~2.1 GB each)
├── model-00002-of-00012.safetensors
├── ...
├── model-00012-of-00012.safetensors (84 MB)
├── modeling_interns2_preview.py   # Custom model code
├── configuration_interns2_preview.py
├── processing_interns2_preview.py # Image/text processing
├── preprocessor_config.json
├── tokenizer.json / tokenizer_config.json / vocab.json / merges.txt
├── special_tokens_map.json

Usage

oMLX / mlx-lm (Apple Silicon)

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("saintlits/Intern-S2-Preview-oQ4-mtp")
response = generate(model, tokenizer, "Hello, what is the capital of France?", verbose=True)

HuggingFace Transformers

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_id = "saintlits/Intern-S2-Preview-oQ4-mtp"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    config=config,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

⚠️ trust_remote_code=True is required since the model uses custom modeling code.

vLLM

pip install vllm
vllm serve "saintlits/Intern-S2-Preview-oQ4-mtp"

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "saintlits/Intern-S2-Preview-oQ4-mtp",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmarks

Summary Comparison

Benchmark	oQ4-mtp (this model)	oQ6-mtp
ARC_CHALLENGE	96.3%	96.7%
GSM8K	94.0%	—
CMMLU	—	90.3%
TRUTHFULQA	—	86.3%
LIVECODEBENCH (No Think)	44.0%	—
LIVECODEBENCH (Think)	—	68.0%

Detailed Results

oQ4-mtp (this model)

Benchmark	Accuracy	Correct	Total	Time (s)	Think
ARC_CHALLENGE	96.3%	289	300	187.2	No
GSM8K	94.0%	94	100	255.8	No
LIVECODEBENCH	44.0%	44	100	3895.0	No

Benchmarks run on M4 Pro (48 GB) using mlx-lm. Sampled subsets of standard evaluation sets. Think mode enables chain-of-thought reasoning.

Inference Performance

oQ4-mtp — Single request vs Continuous batching (M4 Pro, 48 GB)

Test	TTFT (ms)	TPOT (ms)	pp TPS	tg TPS	E2E (s)	Throughput	Peak Mem
pp1024/tg128	1310.7	15.48	781.3	64.3	3.302	348.9 tok/s	20.02 GB
pp4096/tg128	4536.2	15.48	903.0	65.1	6.503	649.6 tok/s	20.80 GB
pp8192/tg128	9302.1	15.67	880.7	64.3	11.292	736.8 tok/s	21.14 GB
pp16384/tg128	19899.5	17.42	823.3	57.9	22.112	746.8 tok/s	21.77 GB

Continuous Batching (pp1024/tg128)

Batch	tg TPS	Speedup	pp TPS	pp TPS/req	TTFT (ms)	E2E (s)
1×	64.3	1.00×	781.3	781.3	1310.7	3.302
2×	103.2	1.60×	762.7	381.4	2575.3	5.165
4×	129.7	2.02×	763.5	190.9	5005.7	9.313
8×	154.2	2.40×	754.2	94.3	9935.3	17.502

Benchmarked with oMLX on Apple Silicon (M4 Pro).

Generation Parameters (default)

{
    "do_sample": true,
    "temperature": 0.95,
    "top_k": 20,
    "top_p": 0.95,
    "bos_token_id": 248044,
    "eos_token_id": [248044, 248046],
    "pad_token_id": 248044
}

Quantization Notes

time_series Stripping

The original Intern-S2-Preview includes 511 time-series parameters that cause loading errors in standard frameworks (HF Transformers, MLX, llama.cpp). During the quantization pipeline, these parameters were identified, removed, and the model definition was patched:

Added TIME_SERIES_CONF = {"UNUSED": True} to config
Patched modeling_interns2_preview.py: TIME_SERIES_CONF.get("UNUSED") guard to skip time_series loading when unused

Model Remapping

The internlm3 type (InternLM3ForCausalLM → Qwen2MoeForCausalLM) requires explicit model remapping in HF Transformers. When using trust_remote_code=False and the custom type wasn't registered, the patch works as follows:

Before _measure_sensitivity → injects
MODEL_REMAPPING["intern_s2_preview"] = "qwen3_5_moe"
# → uses the remapped type and loads successfully

Other Patches

generation_config.py: Added r.any / r.all attribute conditions guard for generated config
MTP joint loss: Patched to ignore time_series loss component
Tokenizer: Shrunk default max length from 262k to avoid OOM in non-optimized environments; adjust via model.config.max_position_embeddings if needed

Limitations

Time series removed: The time-series modality was stripped — this variant cannot process time-series inputs
Not for production: This is a research/preview quantized model. Use at your own risk
Performance variance: Benchmark results may vary depending on hardware, quantization configuration, and evaluation settings
Custom code required: trust_remote_code=True is mandatory for HF Transformers usage

License & Acknowledgments

Apache 2.0 (this quantized variant)
Base model: Intern-S2-Preview — Apache 2.0
Special thanks to the oMLX team for the quantization pipeline and Apple MLX community for Apple Silicon ML framework
Intern-S2-Preview by Shanghai AI Laboratory

Quantized and uploaded by saintlits. Full quantization post‑mortem available here.

Downloads last month: 133

Safetensors

Model size

6B params

Tensor type

U32

BF16