oceanicity
/

Qwen3-4B-Instruct-CoreML-8bit

Text Generation

8-bit precision

Model card Files Files and versions

Qwen3-4B-Instruct-CoreML-8bit / README.md

oceanicity's picture

Add files using upload-large-folder tool

f089580 verified 14 days ago

|

history blame contribute delete

2.03 kB

	---
	base_model: oceanicity/Qwen3-4B-Instruct-2507
	library_name: coreml
	tags:
	- text-generation
	- coreml
	- apple-silicon
	- 8-bit
	- quantized
	- qwen
	---

	# Qwen3-4B-Instruct - CoreML (8-Bit Quantised)

	This is an 8-bit quantised CoreML conversion of the [oceanicity/Qwen3-4B-Instruct-2507](https://huggingface.co/oceanicity/Qwen3-4B-Instruct-2507) model. It has been heavily optimised for fast, efficient, and low-memory inference on Apple Silicon using the Apple Neural Engine (ANE).

	Conversion and quantisation work was performed using a customised version of [0seba's coremlmodels tool](https://github.com/0seba/coremlmodels).

	## Model Details
	- Architecture: Qwen3 (4 Billion Parameters)
	- Precision: 8-bit Weights (Linear Symmetric, Per-Channel) / 16-bit Activations
	- Context Length: 8,192 Tokens (KV Cache)
	- Format: CoreML `.mlpackage` chunks

	## Optimisations Applied
	- Linear-to-Conv2d Patching: Transformer linear layers were patched into 1x1 convolutions to better align with the Neural Engine backend.
	- RMSNorm Fusion: Layer normalisation layers were fused using CoreMLTools graph passes to prevent FP16 overflow.
	- Chunking: The model was split into 8 chunks to safely bypass the Neural Engine's hardware memory limits per segment.
	- Vocabulary Chunking: The massive LM head was exported as a standalone chunked model to bypass the ~16,384 dimension limit on Apple Silicon.
	- Pre-computed Position Embeddings: RoPE embeddings were computed statically during tracing to avoid precision loss and runtime math overhead.

	## Files Included
	- `chunk_0.mlpackage` through `chunk_7.mlpackage`: The core transformer layers.
	- `lm_head.mlpackage`: The chunked vocabulary output head.
	- `embeddings.npy`: The standalone token embedding weights.

	## Usage
	This model is ready to be used in CoreML inference pipelines that support multi-chunked stateful transformers. Ensure that your inference engine stitches the chunks together sequentially and routes the KV cache states appropriately.