Qwen3-1.7B-Q4 - Optimized 4-bit ONNX for Mobile

This repository contains a highly optimized 4-bit quantized version of Qwen3-1.7B, specifically tailored for local inference on mid-range Android hardware.

🚀 Mobile Optimization Highlights

Unlike standard exports, this version was built to maximize the throughput of the XNNPACK execution provider on ARM-based chipsets (like the Dimensity 6300).

Memory Mapping (mmap): Configured for zero-copy loading using MappedByteBuffer, allowing a 1.31GB model to run on devices with limited RAM without triggering the Android OOM (Out of Memory) killer.
Tensor Alignment: Optimized for block size 32 to ensure efficient cache line usage on Cortex-A76/A55 architectures.

📊 Performance Benchmark (MediaTek Dimensity 6300)

Metric	Performance
Quantization	4-bit Integer (Q4)
Inference Engine	ONNX Runtime (GenAI API)
Avg. Speed	~10.89 tokens/sec (MediaTek Dimensity 6300). Surpassing Google's Edge Gallery on same hardware according to the Technical Report.
RAM Footprint	Stable ~1.31 GB

🛠 Usage example

Run scripts/inference_example.py. Check the Ghost Assistant's Technical Report for more info and benchmarks.

🛠 Creating this ONNX File

Made it easier for you with a customized JSON file. Run the following command: olive run --config qwen_config.json with the model generated on the Technical Report.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Edison2ST/Qwen3-1.7B-Q4-ONNX

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(276)

this model