| | --- |
| | license: apache-2.0 |
| | library_name: onnxruntime-genai |
| | pipeline_tag: text-generation |
| | base_model: Qwen/Qwen3-32B |
| | created_at: '2026-01-17T00:00:00.000Z' |
| | tags: |
| | - onnx |
| | - directml |
| | - int4 |
| | - quantized |
| | - qwen |
| | - qwen3 |
| | - instruct |
| | - text-generation |
| | - windows |
| | - csharp |
| | - dotnet |
| | - gpu |
| | inference: false |
| | language: |
| | - en |
| | - zh |
| | --- |
| | |
| | # Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime) |
| |
|
| | This repository provides **Qwen3-32B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**. |
| |
|
| | It enables **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in **C# / .NET applications** via ONNX Runtime + DirectML. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| |
|
| | - Base model: `Qwen/Qwen3-32B` |
| | - Variant: Instruct |
| | - Quantization: INT4 (MatMul NBits, per-channel) |
| | - Format: ONNX |
| | - Runtime: ONNX Runtime with `DmlExecutionProvider` |
| | - Conversion toolchain: Microsoft Olive + onnxruntime-genai |
| | - Target hardware: |
| | - Intel Arc (A770, 130V with large system RAM) |
| | - AMD RDNA2 / RDNA3 |
| | - NVIDIA RTX (24 GB recommended, 16 GB possible with paging) |
| |
|
| | --- |
| |
|
| | ## Files |
| |
|
| | Core inference artifacts: |
| |
|
| | - `model.onnx` |
| | - `model.onnx.data` ← INT4 weights (≈ 18.6 GB) |
| | - `genai_config.json` |
| | - `tokenizer.json`, `vocab.json`, `merges.txt` |
| | - `chat_template.jinja` |
| |
|
| | --- |
| |
|
| | ## Hardware & Memory Notes |
| |
|
| | Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires: |
| |
|
| | - ≥ 16 GB VRAM (with host memory fallback via DirectML) |
| | - ≥ 64 GB system RAM strongly recommended |
| | - Fast NVMe storage for paging |
| |
|
| | This model is intended for: |
| | - Advanced reasoning |
| | - Tool orchestration |
| | - Structured document analysis |
| | - Multi-step planning in local Windows applications |
| |
|
| | --- |
| |
|
| | ## Usage in C# (DirectML) |
| |
|
| | ```csharp |
| | using Microsoft.ML.OnnxRuntimeGenAI; |
| | |
| | var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4"; |
| | |
| | using var model = Model.Load(modelPath, new ModelOptions |
| | { |
| | ExecutionProvider = ExecutionProvider.DirectML |
| | }); |
| | |
| | using var tokenizer = new Tokenizer(model); |
| | var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction."); |
| | |
| | using var generator = new Generator(model, new GeneratorParams |
| | { |
| | MaxLength = 2048, |
| | Temperature = 0.6f |
| | }); |
| | |
| | generator.AppendTokens(tokens); |
| | generator.Generate(); |
| | |
| | string output = tokenizer.Decode(generator.GetSequence(0)); |
| | Console.WriteLine(output); |
| | Prompt Format |
| | The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas). |
| | |
| | The provided chat_template.jinja can be used for consistent role formatting. |
| | |
| | Performance Characteristics |
| | Much stronger reasoning and instruction following than 14B |
| | |
| | Higher latency, but better long-context coherence |
| | |
| | Ideal when model must: |
| | |
| | Infer document structures |
| | |
| | Select templates |
| | |
| | Extract structured fields from natural language |
| | |
| | License & Attribution |
| | Base model: |
| | |
| | Qwen3-32B by Alibaba (see original model card for license) |
| | |
| | Conversion: |
| | |
| | ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive. |
| | |
| | Independent community conversion. |
| | |
| | No affiliation with Alibaba or the Qwen team. |
| | |
| | Related Models |
| | Smaller & faster: |
| | |
| | https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4 |