Qwen3-4B-Instruct - CoreML (8-Bit Quantised)

This is an 8-bit quantised CoreML conversion of the oceanicity/Qwen3-4B-Instruct-2507 model. It has been heavily optimised for fast, efficient, and low-memory inference on Apple Silicon using the Apple Neural Engine (ANE).

Conversion and quantisation work was performed using a customised version of 0seba's coremlmodels tool.

Model Details

  • Architecture: Qwen3 (4 Billion Parameters)
  • Precision: 8-bit Weights (Linear Symmetric, Per-Channel) / 16-bit Activations
  • Context Length: 8,192 Tokens (KV Cache)
  • Format: CoreML .mlpackage chunks

Optimisations Applied

  • Linear-to-Conv2d Patching: Transformer linear layers were patched into 1x1 convolutions to better align with the Neural Engine backend.
  • RMSNorm Fusion: Layer normalisation layers were fused using CoreMLTools graph passes to prevent FP16 overflow.
  • Chunking: The model was split into 8 chunks to safely bypass the Neural Engine's hardware memory limits per segment.
  • Vocabulary Chunking: The massive LM head was exported as a standalone chunked model to bypass the ~16,384 dimension limit on Apple Silicon.
  • Pre-computed Position Embeddings: RoPE embeddings were computed statically during tracing to avoid precision loss and runtime math overhead.

Files Included

  • chunk_0.mlpackage through chunk_7.mlpackage: The core transformer layers.
  • lm_head.mlpackage: The chunked vocabulary output head.
  • embeddings.npy: The standalone token embedding weights.

Usage

This model is ready to be used in CoreML inference pipelines that support multi-chunked stateful transformers. Ensure that your inference engine stitches the chunks together sequentially and routes the KV cache states appropriately.

Downloads last month
103
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oceanicity/Qwen3-4B-Instruct-CoreML-8bit

Quantized
(1)
this model