Qwen2.5-Coder-0.5B-Instruct โ q4f16_0 (MLC LLM)
Requantized version of Qwen2.5-Coder-0.5B-Instruct for on-device inference via MLC LLM.
What is this?
This is a q4f16_0 quantized build of Qwen2.5-Coder-0.5B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.
- Quantization: q4f16_0 (4-bit weights, fp16 activations, KN memory layout)
- Target: Android / Qualcomm Adreno GPUs (OpenCL)
- Size: ~276 MB
- Function calling: Enabled (
use_function_calling: true)
Why q4f16_0 instead of q4f16_1?
The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). On a OnePlus CPH2551 (Snapdragon 8 Gen 1):
| Format | Generation time (10 tokens) |
|---|---|
| q4f16_1 | ~14.5s (0.7 tok/s) |
| q4f16_0 | ~2.2s (4.5 tok/s) |
The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation.
Source
- Original model: Qwen/Qwen2.5-Coder-0.5B-Instruct (Apache-2.0)
- Quantization source: mlc-ai/Qwen2.5-Coder-0.5B-Instruct-q4f16_1-MLC
- Requantized by: NavixMind using
mlc_llm packagewith mobile overrides (context_window_size=4096,prefill_chunk_size=1024,max_batch_size=1)
Only the weight layout was changed (q4f16_1 โ q4f16_0). The model architecture, tokenizer, and training are identical to the original.
Usage with MLC LLM
Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ce81ef8767dfb3f843c79deb0b3f66fc when loading.
License
Apache-2.0 (same as the original Qwen2.5-Coder-0.5B-Instruct model).
- Downloads last month
- 2
Model tree for alexandertaboriskiy/Qwen2.5-Coder-0.5B-Instruct-q4f16_0-MLC
Base model
Qwen/Qwen2.5-0.5B