Qwen2.5-Coder-3B-Instruct β€” q4f16_0 (MLC LLM)

Requantized version of Qwen2.5-Coder-3B-Instruct for on-device inference via MLC LLM.

What is this?

This is a q4f16_0 quantized build of Qwen2.5-Coder-3B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.

  • Quantization: q4f16_0 (4-bit weights, fp16 activations, KN memory layout)
  • Target: Android / Qualcomm Adreno GPUs (OpenCL)
  • Size: ~1.6 GB
  • Function calling: Enabled (use_function_calling: true)

Why q4f16_0 instead of q4f16_1?

The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation, resulting in 6-10x slower inference.

Source

Only the weight layout was changed (q4f16_1 β†’ q4f16_0). The model architecture, tokenizer, and training are identical to the original.

Usage with MLC LLM

Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ecc0cde57625a5817018e8d547361bb3 when loading.

License

Qwen Research License β€” Non-commercial research use only. Same license as the original Qwen2.5-Coder-3B-Instruct model. See the LICENSE for full terms.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for alexandertaboriskiy/Qwen2.5-Coder-3B-Instruct-q4f16_0-MLC

Base model

Qwen/Qwen2.5-3B
Finetuned
(101)
this model