| --- |
| base_model: oceanicity/Qwen3-4B-Instruct-2507 |
| library_name: coreml |
| tags: |
| - text-generation |
| - coreml |
| - apple-silicon |
| - 8-bit |
| - quantized |
| - qwen |
| --- |
| |
| # Qwen3-4B-Instruct - CoreML (8-Bit Quantised) |
|
|
| This is an 8-bit quantised CoreML conversion of the [oceanicity/Qwen3-4B-Instruct-2507](https://huggingface.co/oceanicity/Qwen3-4B-Instruct-2507) model. It has been heavily optimised for fast, efficient, and low-memory inference on Apple Silicon using the Apple Neural Engine (ANE). |
|
|
| Conversion and quantisation work was performed using a customised version of [0seba's coremlmodels tool](https://github.com/0seba/coremlmodels). |
|
|
| ## Model Details |
| - **Architecture:** Qwen3 (4 Billion Parameters) |
| - **Precision:** 8-bit Weights (Linear Symmetric, Per-Channel) / 16-bit Activations |
| - **Context Length:** 8,192 Tokens (KV Cache) |
| - **Format:** CoreML `.mlpackage` chunks |
|
|
| ## Optimisations Applied |
| - **Linear-to-Conv2d Patching:** Transformer linear layers were patched into 1x1 convolutions to better align with the Neural Engine backend. |
| - **RMSNorm Fusion:** Layer normalisation layers were fused using CoreMLTools graph passes to prevent FP16 overflow. |
| - **Chunking:** The model was split into 8 chunks to safely bypass the Neural Engine's hardware memory limits per segment. |
| - **Vocabulary Chunking:** The massive LM head was exported as a standalone chunked model to bypass the ~16,384 dimension limit on Apple Silicon. |
| - **Pre-computed Position Embeddings:** RoPE embeddings were computed statically during tracing to avoid precision loss and runtime math overhead. |
|
|
| ## Files Included |
| - `chunk_0.mlpackage` through `chunk_7.mlpackage`: The core transformer layers. |
| - `lm_head.mlpackage`: The chunked vocabulary output head. |
| - `embeddings.npy`: The standalone token embedding weights. |
|
|
| ## Usage |
| This model is ready to be used in CoreML inference pipelines that support multi-chunked stateful transformers. Ensure that your inference engine stitches the chunks together sequentially and routes the KV cache states appropriately. |
|
|