Intern-S2-Preview-oQ4-mtp
4-bit Quantized Intern-S2-Preview via oMLX oQ — with time_series stripped for inference compatibility.
Model ID: saintlits/Intern-S2-Preview-oQ4-mtp
Base model: Intern-S2-Preview by Shanghai AI Laboratory
Quantization tool: oMLX — optimal quantization for MLX
Original discussion: From oQ to Inference: Full-chain Quantization Post-mortem
Model Description
Intern-S2-Preview is a multimodal MoE (Mixture of Experts) model with vision + text capabilities, built on a Qwen3.5-MoE text backbone. This quantized variant reduces memory footprint from the original bfloat16 weights to ~20 GB via 4-bit quantization while retaining the Multi-Token Prediction (MTP) head.
Architecture Highlights
| Component | Detail |
|---|---|
| Text backbone | Qwen3.5-MoE, 256 experts, 8 experts/token, shared expert |
| Hidden size | 2048 |
| Layers | 40 |
| Attention heads | 16 (2 KV heads) |
| Head dim | 256 |
| Hybrid attention | Linear attention (efficient) × 3 + Full attention × 1, repeating every 4 layers |
| Context window | 262,144 tokens (262K) |
| Activation | SiLU, RMSNorm |
| RoPE | Multi-mode (mRoPE, 3 sections: 11/11/10), θ=10M, partial rotary factor=0.25 |
| MTP (Multi-Token Prediction) | 1 extra layer, shared embeddings with main model |
| Vision encoder | CLIP-style, 17 transformer layers, d_model=768, 8 attention heads |
| Time series | ❌ Stripped — 511 time-series parameters removed during quantization |
| Vocabulary | 251,392 tokens (Qwen-style tokenizer) |
Quantization Scheme (Mixed-Precision)
Quantized with oMLX oQ (affine quantization, per-group):
| Weight Group | Bits | Group Size |
|---|---|---|
| Most linear layers | 4 | 64 |
| Linear attention projections | 6 | 64 |
| Full attention Q/K projections | 6 | 64 |
| Shared expert MLPs | 6 | 128 |
| Shared expert gate | 8 | 64 |
| LM head | 8 | 64 |
| Vision encoder | 4 | 64 |
The lm_head and shared expert gate are retained at higher precision (8-bit) to preserve output quality and routing accuracy.
Model Files
├── config.json # Full model configuration (4881 lines)
├── generation_config.json # Generation parameters
├── chat_template.jinja # Custom chat template (vision + text)
├── model.safetensors.index.json
├── model-00001-of-00012.safetensors (~2.1 GB each)
├── model-00002-of-00012.safetensors
├── ...
├── model-00012-of-00012.safetensors (84 MB)
├── modeling_interns2_preview.py # Custom model code
├── configuration_interns2_preview.py
├── processing_interns2_preview.py # Image/text processing
├── preprocessor_config.json
├── tokenizer.json / tokenizer_config.json / vocab.json / merges.txt
├── special_tokens_map.json
Usage
oMLX / mlx-lm (Apple Silicon)
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("saintlits/Intern-S2-Preview-oQ4-mtp")
response = generate(model, tokenizer, "Hello, what is the capital of France?", verbose=True)
HuggingFace Transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig
model_id = "saintlits/Intern-S2-Preview-oQ4-mtp"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
config=config,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
⚠️
trust_remote_code=Trueis required since the model uses custom modeling code.
vLLM
pip install vllm
vllm serve "saintlits/Intern-S2-Preview-oQ4-mtp"
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "saintlits/Intern-S2-Preview-oQ4-mtp",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Benchmarks
Summary Comparison
| Benchmark | oQ4-mtp (this model) | oQ6-mtp |
|---|---|---|
| ARC_CHALLENGE | 96.3% | 96.7% |
| GSM8K | 94.0% | — |
| CMMLU | — | 90.3% |
| TRUTHFULQA | — | 86.3% |
| LIVECODEBENCH (No Think) | 44.0% | — |
| LIVECODEBENCH (Think) | — | 68.0% |
Detailed Results
oQ4-mtp (this model)
| Benchmark | Accuracy | Correct | Total | Time (s) | Think |
|---|---|---|---|---|---|
| ARC_CHALLENGE | 96.3% | 289 | 300 | 187.2 | No |
| GSM8K | 94.0% | 94 | 100 | 255.8 | No |
| LIVECODEBENCH | 44.0% | 44 | 100 | 3895.0 | No |
Benchmarks run on M4 Pro (48 GB) using
mlx-lm. Sampled subsets of standard evaluation sets. Think mode enables chain-of-thought reasoning.
Inference Performance
oQ4-mtp — Single request vs Continuous batching (M4 Pro, 48 GB)
| Test | TTFT (ms) | TPOT (ms) | pp TPS | tg TPS | E2E (s) | Throughput | Peak Mem |
|---|---|---|---|---|---|---|---|
| pp1024/tg128 | 1310.7 | 15.48 | 781.3 | 64.3 | 3.302 | 348.9 tok/s | 20.02 GB |
| pp4096/tg128 | 4536.2 | 15.48 | 903.0 | 65.1 | 6.503 | 649.6 tok/s | 20.80 GB |
| pp8192/tg128 | 9302.1 | 15.67 | 880.7 | 64.3 | 11.292 | 736.8 tok/s | 21.14 GB |
| pp16384/tg128 | 19899.5 | 17.42 | 823.3 | 57.9 | 22.112 | 746.8 tok/s | 21.77 GB |
Continuous Batching (pp1024/tg128)
| Batch | tg TPS | Speedup | pp TPS | pp TPS/req | TTFT (ms) | E2E (s) |
|---|---|---|---|---|---|---|
| 1× | 64.3 | 1.00× | 781.3 | 781.3 | 1310.7 | 3.302 |
| 2× | 103.2 | 1.60× | 762.7 | 381.4 | 2575.3 | 5.165 |
| 4× | 129.7 | 2.02× | 763.5 | 190.9 | 5005.7 | 9.313 |
| 8× | 154.2 | 2.40× | 754.2 | 94.3 | 9935.3 | 17.502 |
Benchmarked with oMLX on Apple Silicon (M4 Pro).
Generation Parameters (default)
{
"do_sample": true,
"temperature": 0.95,
"top_k": 20,
"top_p": 0.95,
"bos_token_id": 248044,
"eos_token_id": [248044, 248046],
"pad_token_id": 248044
}
Quantization Notes
time_series Stripping
The original Intern-S2-Preview includes 511 time-series parameters that cause loading errors in standard frameworks (HF Transformers, MLX, llama.cpp). During the quantization pipeline, these parameters were identified, removed, and the model definition was patched:
- Added
TIME_SERIES_CONF = {"UNUSED": True}to config - Patched
modeling_interns2_preview.py:TIME_SERIES_CONF.get("UNUSED")guard to skip time_series loading when unused
Model Remapping
The internlm3 type (InternLM3ForCausalLM → Qwen2MoeForCausalLM) requires explicit model remapping in HF Transformers. When using trust_remote_code=False and the custom type wasn't registered, the patch works as follows:
Before _measure_sensitivity → injects
MODEL_REMAPPING["intern_s2_preview"] = "qwen3_5_moe"
# → uses the remapped type and loads successfully
Other Patches
generation_config.py: Addedr.any/r.allattribute conditions guard for generated config- MTP joint loss: Patched to ignore
time_seriesloss component - Tokenizer: Shrunk default max length from 262k to avoid OOM in non-optimized environments; adjust via
model.config.max_position_embeddingsif needed
Limitations
- Time series removed: The time-series modality was stripped — this variant cannot process time-series inputs
- Not for production: This is a research/preview quantized model. Use at your own risk
- Performance variance: Benchmark results may vary depending on hardware, quantization configuration, and evaluation settings
- Custom code required:
trust_remote_code=Trueis mandatory for HF Transformers usage
License & Acknowledgments
- Apache 2.0 (this quantized variant)
- Base model: Intern-S2-Preview — Apache 2.0
- Special thanks to the oMLX team for the quantization pipeline and Apple MLX community for Apple Silicon ML framework
- Intern-S2-Preview by Shanghai AI Laboratory
Quantized and uploaded by saintlits. Full quantization post‑mortem available here.
- Downloads last month
- 133