Qwen2.5-Coder-0.5B-Instruct — q4f16_0 (MLC LLM)

Requantized version of Qwen2.5-Coder-0.5B-Instruct for on-device inference via MLC LLM.

What is this?

This is a q4f16_0 quantized build of Qwen2.5-Coder-0.5B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.

The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). On a OnePlus CPH2551 (Snapdragon 8 Gen 1):

Format	Generation time (10 tokens)
q4f16_1	~14.5s (0.7 tok/s)
q4f16_0	~2.2s (4.5 tok/s)

The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation.

Original model: Qwen/Qwen2.5-Coder-0.5B-Instruct (Apache-2.0)
Quantization source: mlc-ai/Qwen2.5-Coder-0.5B-Instruct-q4f16_1-MLC
Requantized by: NavixMind using mlc_llm package with mobile overrides (context_window_size=4096, prefill_chunk_size=1024, max_batch_size=1)

Only the weight layout was changed (q4f16_1 → q4f16_0). The model architecture, tokenizer, and training are identical to the original.

Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ce81ef8767dfb3f843c79deb0b3f66fc when loading.

Apache-2.0 (same as the original Qwen2.5-Coder-0.5B-Instruct model).

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

Finetuned

Finetuned

(90)

this model