Qwen2.5-Coder-3B-Instruct — q4f16_0 (MLC LLM)

Requantized version of Qwen2.5-Coder-3B-Instruct for on-device inference via MLC LLM.

What is this?

This is a q4f16_0 quantized build of Qwen2.5-Coder-3B-Instruct, compiled for Android (arm64-v8a) using MLC LLM's mlc_llm package toolchain.

Quantization: q4f16_0 (4-bit weights, fp16 activations, KN memory layout)
Target: Android / Qualcomm Adreno GPUs (OpenCL)
Size: ~1.6 GB
Function calling: Enabled (use_function_calling: true)

Why q4f16_0 instead of q4f16_1?

The q4f16_0 layout (KN) is dramatically faster on Qualcomm Adreno GPUs compared to q4f16_1 (NK). The q4f16_1 format requires weight transposition operations that are catastrophically slow on Adreno's OpenCL implementation, resulting in 6-10x slower inference.

Source

Original model: Qwen/Qwen2.5-Coder-3B-Instruct (Qwen Research License)
Quantization source: mlc-ai/Qwen2.5-Coder-3B-Instruct-q4f16_1-MLC
Requantized by: NavixMind using mlc_llm package with mobile overrides (context_window_size=2048, prefill_chunk_size=512, max_batch_size=1)

Only the weight layout was changed (q4f16_1 → q4f16_0). The model architecture, tokenizer, and training are identical to the original.

Usage with MLC LLM

Download the model weights and place them in your MLC models directory. Use model_lib: qwen2_q4f16_0_ecc0cde57625a5817018e8d547361bb3 when loading.

License

Qwen Research License — Non-commercial research use only. Same license as the original Qwen2.5-Coder-3B-Instruct model. See the LICENSE for full terms.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexandertaboriskiy/Qwen2.5-Coder-3B-Instruct-q4f16_0-MLC

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

Qwen/Qwen2.5-Coder-3B-Instruct

Finetuned

(101)

this model