wekkel's picture
Update README.md
66f4508 verified
---
license: apache-2.0
library_name: onnxruntime-genai
pipeline_tag: text-generation
base_model: Qwen/Qwen3-32B
created_at: '2026-01-17T00:00:00.000Z'
tags:
- onnx
- directml
- int4
- quantized
- qwen
- qwen3
- instruct
- text-generation
- windows
- csharp
- dotnet
- gpu
inference: false
language:
- en
- zh
---
# Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime)
This repository provides **Qwen3-32B-Instruct** converted to **INT4 ONNX** and optimized for **DirectML** using **Microsoft Olive** and **ONNX Runtime GenAI**.
It enables **native Windows GPU inference** (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in **C# / .NET applications** via ONNX Runtime + DirectML.
---
## Model Details
- Base model: `Qwen/Qwen3-32B`
- Variant: Instruct
- Quantization: INT4 (MatMul NBits, per-channel)
- Format: ONNX
- Runtime: ONNX Runtime with `DmlExecutionProvider`
- Conversion toolchain: Microsoft Olive + onnxruntime-genai
- Target hardware:
- Intel Arc (A770, 130V with large system RAM)
- AMD RDNA2 / RDNA3
- NVIDIA RTX (24 GB recommended, 16 GB possible with paging)
---
## Files
Core inference artifacts:
- `model.onnx`
- `model.onnx.data` ← INT4 weights (≈ 18.6 GB)
- `genai_config.json`
- `tokenizer.json`, `vocab.json`, `merges.txt`
- `chat_template.jinja`
---
## Hardware & Memory Notes
Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:
- ≥ 16 GB VRAM (with host memory fallback via DirectML)
- ≥ 64 GB system RAM strongly recommended
- Fast NVMe storage for paging
This model is intended for:
- Advanced reasoning
- Tool orchestration
- Structured document analysis
- Multi-step planning in local Windows applications
---
## Usage in C# (DirectML)
```csharp
using Microsoft.ML.OnnxRuntimeGenAI;
var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";
using var model = Model.Load(modelPath, new ModelOptions
{
ExecutionProvider = ExecutionProvider.DirectML
});
using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");
using var generator = new Generator(model, new GeneratorParams
{
MaxLength = 2048,
Temperature = 0.6f
});
generator.AppendTokens(tokens);
generator.Generate();
string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).
The provided chat_template.jinja can be used for consistent role formatting.
Performance Characteristics
Much stronger reasoning and instruction following than 14B
Higher latency, but better long-context coherence
Ideal when model must:
Infer document structures
Select templates
Extract structured fields from natural language
License & Attribution
Base model:
Qwen3-32B by Alibaba (see original model card for license)
Conversion:
ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.
Independent community conversion.
No affiliation with Alibaba or the Qwen team.
Related Models
Smaller & faster:
https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4