---
license: apache-2.0
library_name: onnxruntime-genai
pipeline_tag: text-generation
base_model: Qwen/Qwen3-32B
created_at: '2026-01-17T00:00:00.000Z'
tags:
- onnx
- directml
- int4
- quantized
- qwen
- qwen3
- instruct
- text-generation
- windows
- csharp
- dotnet
- gpu
inference: false
language:
- en
- zh
---

# Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime)

This repository provides **Qwen3-32B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**.

It enables **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in **C# / .NET applications** via ONNX Runtime + DirectML.

---

## Model Details

- Base model: `Qwen/Qwen3-32B`
- Variant: Instruct
- Quantization: INT4 (MatMul NBits, per-channel)
- Format: ONNX
- Runtime: ONNX Runtime with `DmlExecutionProvider`
- Conversion toolchain: Microsoft Olive + onnxruntime-genai
- Target hardware:
  - Intel Arc (A770, 130V with large system RAM)
  - AMD RDNA2 / RDNA3
  - NVIDIA RTX (24 GB recommended, 16 GB possible with paging)

---

## Files

Core inference artifacts:

- `model.onnx`
- `model.onnx.data`  ← INT4 weights (≈ 18.6 GB)
- `genai_config.json`
- `tokenizer.json`, `vocab.json`, `merges.txt`
- `chat_template.jinja`

---

## Hardware & Memory Notes

Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:

- ≥ 16 GB VRAM (with host memory fallback via DirectML)
- ≥ 64 GB system RAM strongly recommended
- Fast NVMe storage for paging

This model is intended for:
- Advanced reasoning
- Tool orchestration
- Structured document analysis
- Multi-step planning in local Windows applications

---

## Usage in C# (DirectML)

```csharp
using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";

using var model = Model.Load(modelPath, new ModelOptions
{
    ExecutionProvider = ExecutionProvider.DirectML
});

using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");

using var generator = new Generator(model, new GeneratorParams
{
    MaxLength = 2048,
    Temperature = 0.6f
});

generator.AppendTokens(tokens);
generator.Generate();

string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).

The provided chat_template.jinja can be used for consistent role formatting.

Performance Characteristics
Much stronger reasoning and instruction following than 14B

Higher latency, but better long-context coherence

Ideal when model must:

Infer document structures

Select templates

Extract structured fields from natural language

License & Attribution
Base model:

Qwen3-32B by Alibaba (see original model card for license)

Conversion:

ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.

Independent community conversion.

No affiliation with Alibaba or the Qwen team.

Related Models
Smaller & faster:

https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4