--- license: apache-2.0 library_name: onnxruntime-genai pipeline_tag: text-generation base_model: Qwen/Qwen3-32B created_at: '2026-01-17T00:00:00.000Z' tags: - onnx - directml - int4 - quantized - qwen - qwen3 - instruct - text-generation - windows - csharp - dotnet - gpu inference: false language: - en - zh --- # Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime) This repository provides **Qwen3-32B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**. It enables **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in **C# / .NET applications** via ONNX Runtime + DirectML. --- ## Model Details - Base model: `Qwen/Qwen3-32B` - Variant: Instruct - Quantization: INT4 (MatMul NBits, per-channel) - Format: ONNX - Runtime: ONNX Runtime with `DmlExecutionProvider` - Conversion toolchain: Microsoft Olive + onnxruntime-genai - Target hardware: - Intel Arc (A770, 130V with large system RAM) - AMD RDNA2 / RDNA3 - NVIDIA RTX (24 GB recommended, 16 GB possible with paging) --- ## Files Core inference artifacts: - `model.onnx` - `model.onnx.data` ← INT4 weights (≈ 18.6 GB) - `genai_config.json` - `tokenizer.json`, `vocab.json`, `merges.txt` - `chat_template.jinja` --- ## Hardware & Memory Notes Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires: - ≥ 16 GB VRAM (with host memory fallback via DirectML) - ≥ 64 GB system RAM strongly recommended - Fast NVMe storage for paging This model is intended for: - Advanced reasoning - Tool orchestration - Structured document analysis - Multi-step planning in local Windows applications --- ## Usage in C# (DirectML) ```csharp using Microsoft.ML.OnnxRuntimeGenAI; var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4"; using var model = Model.Load(modelPath, new ModelOptions { ExecutionProvider = ExecutionProvider.DirectML }); using var tokenizer = new Tokenizer(model); var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction."); using var generator = new Generator(model, new GeneratorParams { MaxLength = 2048, Temperature = 0.6f }); generator.AppendTokens(tokens); generator.Generate(); string output = tokenizer.Decode(generator.GetSequence(0)); Console.WriteLine(output); Prompt Format The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas). The provided chat_template.jinja can be used for consistent role formatting. Performance Characteristics Much stronger reasoning and instruction following than 14B Higher latency, but better long-context coherence Ideal when model must: Infer document structures Select templates Extract structured fields from natural language License & Attribution Base model: Qwen3-32B by Alibaba (see original model card for license) Conversion: ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive. Independent community conversion. No affiliation with Alibaba or the Qwen team. Related Models Smaller & faster: https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4