oceanicity's picture
Add files using upload-large-folder tool
f089580 verified
metadata
base_model: oceanicity/Qwen3-4B-Instruct-2507
library_name: coreml
tags:
  - text-generation
  - coreml
  - apple-silicon
  - 8-bit
  - quantized
  - qwen

Qwen3-4B-Instruct - CoreML (8-Bit Quantised)

This is an 8-bit quantised CoreML conversion of the oceanicity/Qwen3-4B-Instruct-2507 model. It has been heavily optimised for fast, efficient, and low-memory inference on Apple Silicon using the Apple Neural Engine (ANE).

Conversion and quantisation work was performed using a customised version of 0seba's coremlmodels tool.

Model Details

  • Architecture: Qwen3 (4 Billion Parameters)
  • Precision: 8-bit Weights (Linear Symmetric, Per-Channel) / 16-bit Activations
  • Context Length: 8,192 Tokens (KV Cache)
  • Format: CoreML .mlpackage chunks

Optimisations Applied

  • Linear-to-Conv2d Patching: Transformer linear layers were patched into 1x1 convolutions to better align with the Neural Engine backend.
  • RMSNorm Fusion: Layer normalisation layers were fused using CoreMLTools graph passes to prevent FP16 overflow.
  • Chunking: The model was split into 8 chunks to safely bypass the Neural Engine's hardware memory limits per segment.
  • Vocabulary Chunking: The massive LM head was exported as a standalone chunked model to bypass the ~16,384 dimension limit on Apple Silicon.
  • Pre-computed Position Embeddings: RoPE embeddings were computed statically during tracing to avoid precision loss and runtime math overhead.

Files Included

  • chunk_0.mlpackage through chunk_7.mlpackage: The core transformer layers.
  • lm_head.mlpackage: The chunked vocabulary output head.
  • embeddings.npy: The standalone token embedding weights.

Usage

This model is ready to be used in CoreML inference pipelines that support multi-chunked stateful transformers. Ensure that your inference engine stitches the chunks together sequentially and routes the KV cache states appropriately.