Qwen3-1.7B-Q4 - Optimized 4-bit ONNX for Mobile
This repository contains a highly optimized 4-bit quantized version of Qwen3-1.7B, specifically tailored for local inference on mid-range Android hardware.
π Mobile Optimization Highlights
Unlike standard exports, this version was built to maximize the throughput of the XNNPACK execution provider on ARM-based chipsets (like the Dimensity 6300).
- Memory Mapping (mmap): Configured for zero-copy loading using
MappedByteBuffer, allowing a 1.31GB model to run on devices with limited RAM without triggering the Android OOM (Out of Memory) killer. - Tensor Alignment: Optimized for block size 32 to ensure efficient cache line usage on Cortex-A76/A55 architectures.
π Performance Benchmark (MediaTek Dimensity 6300)
| Metric | Performance |
|---|---|
| Quantization | 4-bit Integer (Q4) |
| Inference Engine | ONNX Runtime (GenAI API) |
| Avg. Speed | ~10.89 tokens/sec (MediaTek Dimensity 6300). Surpassing Google's Edge Gallery on same hardware according to the Technical Report. |
| RAM Footprint | Stable ~1.31 GB |
π Usage example
Run scripts/inference_example.py. Check the Ghost Assistant's Technical Report for more info and benchmarks.
π Creating this ONNX File
Made it easier for you with a customized JSON file. Run the following command: olive run --config qwen_config.json with the model generated on the Technical Report.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support