Qwen2.5-Coder-3B-Instruct β q4f16_0 (MLC LLM)
Requantized version of Qwen2.5-Coder-3B-Instruct for on-device inference via MLC LLM.
What is this?
This is a q4f16_0 quantized build of Qwen2.5-Coder-3B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.
- Quantization: q4f16_0 (4-bit weights, fp16 activations, KN memory layout)
- Target: Android / Qualcomm Adreno GPUs (OpenCL)
- Size: ~1.6 GB
- Function calling: Enabled (
use_function_calling: true)
Why q4f16_0 instead of q4f16_1?
The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation, resulting in 6-10x slower inference.
Source
- Original model: Qwen/Qwen2.5-Coder-3B-Instruct (Qwen Research License)
- Quantization source: mlc-ai/Qwen2.5-Coder-3B-Instruct-q4f16_1-MLC
- Requantized by: NavixMind using
mlc_llm packagewith mobile overrides (context_window_size=2048,prefill_chunk_size=512,max_batch_size=1)
Only the weight layout was changed (q4f16_1 β q4f16_0). The model architecture, tokenizer, and training are identical to the original.
Usage with MLC LLM
Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ecc0cde57625a5817018e8d547361bb3 when loading.
License
Qwen Research License β Non-commercial research use only. Same license as the original Qwen2.5-Coder-3B-Instruct model. See the LICENSE for full terms.
- Downloads last month
- -