Qwen2.5-Coder-0.5B-Instruct โ€” q4f16_0 (MLC LLM)

Requantized version of Qwen2.5-Coder-0.5B-Instruct for on-device inference via MLC LLM.

What is this?

This is a q4f16_0 quantized build of Qwen2.5-Coder-0.5B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.

  • Quantization: q4f16_0 (4-bit weights, fp16 activations, KN memory layout)
  • Target: Android / Qualcomm Adreno GPUs (OpenCL)
  • Size: ~276 MB
  • Function calling: Enabled (use_function_calling: true)

Why q4f16_0 instead of q4f16_1?

The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). On a OnePlus CPH2551 (Snapdragon 8 Gen 1):

Format Generation time (10 tokens)
q4f16_1 ~14.5s (0.7 tok/s)
q4f16_0 ~2.2s (4.5 tok/s)

The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation.

Source

Only the weight layout was changed (q4f16_1 โ†’ q4f16_0). The model architecture, tokenizer, and training are identical to the original.

Usage with MLC LLM

Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ce81ef8767dfb3f843c79deb0b3f66fc when loading.

License

Apache-2.0 (same as the original Qwen2.5-Coder-0.5B-Instruct model).

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for alexandertaboriskiy/Qwen2.5-Coder-0.5B-Instruct-q4f16_0-MLC

Finetuned
(90)
this model