Qwen3-1.7B-Q4 - Optimized 4-bit ONNX for Mobile

This repository contains a highly optimized 4-bit quantized version of Qwen3-1.7B, specifically tailored for local inference on mid-range Android hardware.

πŸš€ Mobile Optimization Highlights

Unlike standard exports, this version was built to maximize the throughput of the XNNPACK execution provider on ARM-based chipsets (like the Dimensity 6300).

  • Memory Mapping (mmap): Configured for zero-copy loading using MappedByteBuffer, allowing a 1.31GB model to run on devices with limited RAM without triggering the Android OOM (Out of Memory) killer.
  • Tensor Alignment: Optimized for block size 32 to ensure efficient cache line usage on Cortex-A76/A55 architectures.

πŸ“Š Performance Benchmark (MediaTek Dimensity 6300)

Metric Performance
Quantization 4-bit Integer (Q4)
Inference Engine ONNX Runtime (GenAI API)
Avg. Speed ~10.89 tokens/sec (MediaTek Dimensity 6300). Surpassing Google's Edge Gallery on same hardware according to the Technical Report.
RAM Footprint Stable ~1.31 GB

πŸ›  Usage example

Run scripts/inference_example.py. Check the Ghost Assistant's Technical Report for more info and benchmarks.

πŸ›  Creating this ONNX File

Made it easier for you with a customized JSON file. Run the following command: olive run --config qwen_config.json with the model generated on the Technical Report.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Edison2ST/Qwen3-1.7B-Q4-ONNX

Finetuned
Qwen/Qwen3-1.7B
Quantized
(158)
this model