Intern-S2-Preview-oQ6-mtp
6-bit Quantized Intern-S2-Preview via oMLX oQ — with time_series stripped for inference compatibility.
Model ID: saintlits/Intern-S2-Preview-oQ6-mtp
Base model: Intern-S2-Preview by Shanghai AI Laboratory
Quantization tool: oMLX — optimal quantization for MLX
Original discussion: From oQ to Inference: Full-chain Quantization Post-mortem
Model Description
Intern-S2-Preview is a multimodal MoE (Mixture of Experts) model with vision + text capabilities, built on a Qwen3.5-MoE text backbone. This quantized variant reduces memory footprint from the original bfloat16 weights to ~27 GB via 6-bit quantization while retaining the Multi-Token Prediction (MTP) head.
Architecture Highlights
| Component | Detail |
|---|---|
| Text backbone | Qwen3.5-MoE, 256 experts, 8 experts/token, shared expert |
| Hidden size | 2048 |
| Layers | 40 |
| Attention heads | 16 (2 KV heads) |
| Head dim | 256 |
| Hybrid attention | Linear attention (efficient) × 3 + Full attention × 1, repeating every 4 layers |
| Context window | 262,144 tokens (262K) |
| Activation | SiLU, RMSNorm |
| RoPE | Multi-mode (mRoPE, 3 sections: 11/11/10), θ=10M, partial rotary factor=0.25 |
| MTP (Multi-Token Prediction) | 1 extra layer, shared embeddings with main model |
| Vision encoder | CLIP-style, 17 transformer layers, d_model=768, 8 attention heads |
| Time series | ❌ Stripped — 511 time-series parameters removed during quantization |
| Vocabulary | 251,392 tokens (Qwen-style tokenizer) |
Key Differences from Q4 Variant
| Aspect | Q4 (Intern-S2-Preview-oQ4-mtp) | Q6 (this model) |
|---|---|---|
| Weight precision | 4-bit mixed | 6-bit mixed |
| Memory footprint | ~22 GB | ~27 GB |
| Linear attn projections | 6-bit / 5-bit | 6-bit / 5-bit |
| Full attn Q/K projections | 6-bit | 8-bit |
| Shared expert gate | 8-bit | 8-bit |
| Peak memory (pp1024/tg128) | 20.02 GB | 27.62 GB |
| Inference speed (pp1024/tg128) | 64.3 tok/s | 59.1 tok/s |
Quantization Scheme (Mixed-Precision)
Quantized with oMLX oQ (affine quantization, per-group):
| Weight Group | Bits | Group Size |
|---|---|---|
| Most linear layers | 6 | 64 |
| Linear attention projections (in_proj_*) | 6 | 64 |
| Linear attention output projections | 5 | 128 |
| Full attention Q/K projections | 8 | 64 |
| Full attention O projection | 5 | 64 |
| Shared expert MLPs | 8 | 128 |
| Shared expert gate | 8 | 64 |
| LM head | 8 | 64 |
| Vision encoder | 6 | 64 |
The lm_head and shared expert layers are retained at higher precision (8-bit) to preserve output quality and routing accuracy. Compared to Q4, the Q6 variant uses 6-bit (vs 4-bit) for most linear layers and 8-bit (vs 6-bit) for full attention projections, offering higher fidelity at the cost of ~5 GB additional memory.
Model Files
├── config.json # Full model configuration (4881 lines)
├── generation_config.json # Generation parameters
├── chat_template.jinja # Custom chat template (vision + text)
├── model.safetensors.index.json
├── model-00001-of-00009.safetensors (2.4 GB)
├── model-00002-of-00009.safetensors (2.4 GB)
├── ...
├── model-00009-of-00009.safetensors (288 MB)
├── modeling_interns2_preview.py # ⚙️ Custom model code
├── configuration_interns2_preview.py # ⚙️ Custom config
├── processing_interns2_preview.py # ⚙️ Image/text processing
├── preprocessor_config.json # Vision preprocessor config
├── tokenizer.json / tokenizer_config.json / vocab.json / merges.txt
├── special_tokens_map.json
Usage
oMLX / mlx-lm (Apple Silicon)
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("saintlits/Intern-S2-Preview-oQ6-mtp")
response = generate(model, tokenizer, "Hello, what is the capital of France?", verbose=True)
HuggingFace Transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig
model_id = "saintlits/Intern-S2-Preview-oQ6-mtp"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
config=config,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
⚠️
trust_remote_code=Trueis required since the model uses custom modeling code (modeling_interns2_preview.py, etc.).
vLLM
# Install vLLM (if not already installed):
pip install vllm
# Start the vLLM server:
vllm serve "saintlits/Intern-S2-Preview-oQ6-mtp"
# Call the API (example):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "saintlits/Intern-S2-Preview-oQ6-mtp",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Benchmarks
Summary Comparison
| Benchmark | oQ4-mtp | oQ6-mtp (this model) |
|---|---|---|
| ARC_CHALLENGE | 96.3% | 96.7% |
| GSM8K | 94.0% | — |
| CMMLU | — | 90.3% |
| TRUTHFULQA | — | 86.3% |
| LIVECODEBENCH (No Think) | 44.0% | — |
| LIVECODEBENCH (Think) | — | 68.0% |
Detailed Results
oQ6-mtp (this model)
| Benchmark | Accuracy | Correct | Total | Time (s) | Think |
|---|---|---|---|---|---|
| ARC_CHALLENGE | 96.7% | 290 | 300 | 172.2 | No |
| CMMLU | 90.3% | 271 | 300 | 186.9 | No |
| TRUTHFULQA | 86.3% | 259 | 300 | 184.8 | No |
| MMLU | 91.6% | 458 | 500 | 11072.8 | Yes |
| MMLU_PRO | 85.3% | 256 | 300 | 8557.3 | Yes |
| JMMLU | 89.0% | 267 | 300 | 5293.6 | Yes |
| LIVECODEBENCH | 68.0% | 68 | 100 | 12370.1 | Yes |
oQ6-mtp — Single request (M4 Pro, 48 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 1440.7 | 17.06 | 710.8 | 59.1 | 3.607 | 319.4 tok/s | 27.62 GB |
| pp4096/tg128 | 4965.4 | 17.52 | 824.9 | 57.5 | 7.190 | 587.5 tok/s | 27.72 GB |
| pp8192/tg128 | 10056.7 | 18.30 | 814.6 | 55.1 | 12.380 | 672.0 tok/s | 27.93 GB |
Benchmarked with oMLX on Apple Silicon (M4 Pro).
Generation Parameters (default)
{
"do_sample": true,
"temperature": 1.0,
"top_k": 20,
"top_p": 0.95,
"bos_token_id": 248044,
"eos_token_id": [248044, 248046],
"pad_token_id": 248044
}
Quantization Notes
time_series Stripping
The original Intern-S2-Preview includes 511 time-series parameters that cause loading errors in standard frameworks (HF Transformers, MLX, llama.cpp). During the quantization pipeline, these parameters were identified, removed, and the model definition was patched:
- Added
TIME_SERIES_CONF = {"UNUSED": True}to config - Patched
modeling_interns2_preview.py:TIME_SERIES_CONF.get("UNUSED")guard to skip time_series loading when unused
Model Remapping
The internlm3 type (InternLM3ForCausalLM → Qwen2MoeForCausalLM) requires explicit model remapping in HF Transformers. When using trust_remote_code=False and the custom type wasn't registered, the patch works as follows:
Before _measure_sensitivity → injects
MODEL_REMAPPING["intern_s2_preview"] = "qwen3_5_moe"
# → uses the remapped type and loads successfully
Other Patches
generation_config.py: Addedr.any/r.allattribute conditions guard for generated config- MTP joint loss: Patched to ignore
time_seriesloss component - Tokenizer: Shrunk default max length from 262k to avoid OOM in non-optimized environments; adjust via
model.config.max_position_embeddingsif needed
Limitations
- Time series removed: The time-series modality was stripped — this variant cannot process time-series inputs
- Not for production: This is a research/preview quantized model. Use at your own risk
- Performance variance: Benchmark results may vary depending on hardware, quantization configuration, and evaluation settings
- Custom code required:
trust_remote_code=Trueis mandatory for HF Transformers usage
License & Acknowledgments
- Apache 2.0 (this quantized variant)
- Base model: Intern-S2-Preview — Apache 2.0
- Special thanks to the oMLX team for the quantization pipeline and Apple MLX community for Apple Silicon ML framework
- Intern-S2-Preview by Shanghai AI Laboratory
Quantized and uploaded by saintlits. Full quantization post‑mortem available here.
- Downloads last month
- 73