Intern-S2-Preview-oQ6-mtp

6-bit Quantized Intern-S2-Preview via oMLX oQ — with time_series stripped for inference compatibility.

Model ID: saintlits/Intern-S2-Preview-oQ6-mtp Base model: Intern-S2-Preview by Shanghai AI Laboratory Quantization tool: oMLX — optimal quantization for MLX Original discussion: From oQ to Inference: Full-chain Quantization Post-mortem

Model Description

Intern-S2-Preview is a multimodal MoE (Mixture of Experts) model with vision + text capabilities, built on a Qwen3.5-MoE text backbone. This quantized variant reduces memory footprint from the original bfloat16 weights to ~27 GB via 6-bit quantization while retaining the Multi-Token Prediction (MTP) head.

Architecture Highlights

Component	Detail
Text backbone	Qwen3.5-MoE, 256 experts, 8 experts/token, shared expert
Hidden size	2048
Layers	40
Attention heads	16 (2 KV heads)
Head dim	256
Hybrid attention	Linear attention (efficient) × 3 + Full attention × 1, repeating every 4 layers
Context window	262,144 tokens (262K)
Activation	SiLU, RMSNorm
RoPE	Multi-mode (mRoPE, 3 sections: 11/11/10), θ=10M, partial rotary factor=0.25
MTP (Multi-Token Prediction)	1 extra layer, shared embeddings with main model
Vision encoder	CLIP-style, 17 transformer layers, d_model=768, 8 attention heads
Time series	❌ Stripped — 511 time-series parameters removed during quantization
Vocabulary	251,392 tokens (Qwen-style tokenizer)

Key Differences from Q4 Variant

Aspect	Q4 (Intern-S2-Preview-oQ4-mtp)	Q6 (this model)
Weight precision	4-bit mixed	6-bit mixed
Memory footprint	~22 GB	~27 GB
Linear attn projections	6-bit / 5-bit	6-bit / 5-bit
Full attn Q/K projections	6-bit	8-bit
Shared expert gate	8-bit	8-bit
Peak memory (pp1024/tg128)	20.02 GB	27.62 GB
Inference speed (pp1024/tg128)	64.3 tok/s	59.1 tok/s

Quantization Scheme (Mixed-Precision)

Quantized with oMLX oQ (affine quantization, per-group):

Weight Group	Bits	Group Size
Most linear layers	6	64
Linear attention projections (in_proj_*)	6	64
Linear attention output projections	5	128
Full attention Q/K projections	8	64
Full attention O projection	5	64
Shared expert MLPs	8	128
Shared expert gate	8	64
LM head	8	64
Vision encoder	6	64

The lm_head and shared expert layers are retained at higher precision (8-bit) to preserve output quality and routing accuracy. Compared to Q4, the Q6 variant uses 6-bit (vs 4-bit) for most linear layers and 8-bit (vs 6-bit) for full attention projections, offering higher fidelity at the cost of ~5 GB additional memory.

Model Files

├── config.json                    # Full model configuration (4881 lines)
├── generation_config.json         # Generation parameters
├── chat_template.jinja            # Custom chat template (vision + text)
├── model.safetensors.index.json
├── model-00001-of-00009.safetensors (2.4 GB)
├── model-00002-of-00009.safetensors (2.4 GB)
├── ...
├── model-00009-of-00009.safetensors (288 MB)
├── modeling_interns2_preview.py   # ⚙️ Custom model code
├── configuration_interns2_preview.py  # ⚙️ Custom config
├── processing_interns2_preview.py # ⚙️ Image/text processing
├── preprocessor_config.json       # Vision preprocessor config
├── tokenizer.json / tokenizer_config.json / vocab.json / merges.txt
├── special_tokens_map.json

Usage

oMLX / mlx-lm (Apple Silicon)

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("saintlits/Intern-S2-Preview-oQ6-mtp")
response = generate(model, tokenizer, "Hello, what is the capital of France?", verbose=True)

HuggingFace Transformers

from transformers import AutoModel, AutoTokenizer, AutoConfig

model_id = "saintlits/Intern-S2-Preview-oQ6-mtp"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    config=config,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

⚠️ trust_remote_code=True is required since the model uses custom modeling code (modeling_interns2_preview.py, etc.).

vLLM

# Install vLLM (if not already installed):
pip install vllm

# Start the vLLM server:
vllm serve "saintlits/Intern-S2-Preview-oQ6-mtp"

# Call the API (example):
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "saintlits/Intern-S2-Preview-oQ6-mtp",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Benchmarks

Summary Comparison

Benchmark	oQ4-mtp	oQ6-mtp (this model)
ARC_CHALLENGE	96.3%	96.7%
GSM8K	94.0%	—
CMMLU	—	90.3%
TRUTHFULQA	—	86.3%
LIVECODEBENCH (No Think)	44.0%	—
LIVECODEBENCH (Think)	—	68.0%

Detailed Results

oQ6-mtp (this model)

Benchmark	Accuracy	Correct	Total	Time (s)	Think
ARC_CHALLENGE	96.7%	290	300	172.2	No
CMMLU	90.3%	271	300	186.9	No
TRUTHFULQA	86.3%	259	300	184.8	No
MMLU	91.6%	458	500	11072.8	Yes
MMLU_PRO	85.3%	256	300	8557.3	Yes
JMMLU	89.0%	267	300	5293.6	Yes
LIVECODEBENCH	68.0%	68	100	12370.1	Yes

oQ6-mtp — Single request (M4 Pro, 48 GB)

Test	TTFT (ms)	TPOT (ms)	pp TPS	tg TPS	E2E (s)	Throughput	Peak Mem
pp1024/tg128	1440.7	17.06	710.8	59.1	3.607	319.4 tok/s	27.62 GB
pp4096/tg128	4965.4	17.52	824.9	57.5	7.190	587.5 tok/s	27.72 GB
pp8192/tg128	10056.7	18.30	814.6	55.1	12.380	672.0 tok/s	27.93 GB

Benchmarked with oMLX on Apple Silicon (M4 Pro).

Generation Parameters (default)

{
    "do_sample": true,
    "temperature": 1.0,
    "top_k": 20,
    "top_p": 0.95,
    "bos_token_id": 248044,
    "eos_token_id": [248044, 248046],
    "pad_token_id": 248044
}

Quantization Notes

time_series Stripping

The original Intern-S2-Preview includes 511 time-series parameters that cause loading errors in standard frameworks (HF Transformers, MLX, llama.cpp). During the quantization pipeline, these parameters were identified, removed, and the model definition was patched:

Added TIME_SERIES_CONF = {"UNUSED": True} to config
Patched modeling_interns2_preview.py: TIME_SERIES_CONF.get("UNUSED") guard to skip time_series loading when unused

Model Remapping

The internlm3 type (InternLM3ForCausalLM → Qwen2MoeForCausalLM) requires explicit model remapping in HF Transformers. When using trust_remote_code=False and the custom type wasn't registered, the patch works as follows:

Before _measure_sensitivity → injects
MODEL_REMAPPING["intern_s2_preview"] = "qwen3_5_moe"
# → uses the remapped type and loads successfully

Other Patches

generation_config.py: Added r.any / r.all attribute conditions guard for generated config
MTP joint loss: Patched to ignore time_series loss component
Tokenizer: Shrunk default max length from 262k to avoid OOM in non-optimized environments; adjust via model.config.max_position_embeddings if needed

Limitations

Time series removed: The time-series modality was stripped — this variant cannot process time-series inputs
Not for production: This is a research/preview quantized model. Use at your own risk
Performance variance: Benchmark results may vary depending on hardware, quantization configuration, and evaluation settings
Custom code required: trust_remote_code=True is mandatory for HF Transformers usage

License & Acknowledgments

Apache 2.0 (this quantized variant)
Base model: Intern-S2-Preview — Apache 2.0
Special thanks to the oMLX team for the quantization pipeline and Apple MLX community for Apple Silicon ML framework
Intern-S2-Preview by Shanghai AI Laboratory

Quantized and uploaded by saintlits. Full quantization post‑mortem available here.

Downloads last month: 73

Safetensors

Model size

8B params

Tensor type

U32

BF16