README.md · wekkel/Qwen3-32B-Instruct-DirectML-INT4 at main

Qwen3-32B-Instruct-DirectML-INT4 / README.md

wekkel

Update README.md

66f4508 verified about 1 month ago

preview code

raw

history blame contribute delete

3.3 kB

	---
	license: apache-2.0
	library_name: onnxruntime-genai
	pipeline_tag: text-generation
	base_model: Qwen/Qwen3-32B
	created_at: '2026-01-17T00:00:00.000Z'
	tags:
	- onnx
	- directml
	- int4
	- quantized
	- qwen
	- qwen3
	- instruct
	- text-generation
	- windows
	- csharp
	- dotnet
	- gpu
	inference: false
	language:
	- en
	- zh
	---

	# Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime)

	This repository provides Qwen3-32B-Instruct converted to INT4 ONNX and optimized for DirectML using Microsoft Olive and ONNX Runtime GenAI.

	It enables native Windows GPU inference (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in C# / .NET applications via ONNX Runtime + DirectML.

	---

	## Model Details

	- Base model: `Qwen/Qwen3-32B`
	- Variant: Instruct
	- Quantization: INT4 (MatMul NBits, per-channel)
	- Format: ONNX
	- Runtime: ONNX Runtime with `DmlExecutionProvider`
	- Conversion toolchain: Microsoft Olive + onnxruntime-genai
	- Target hardware:
	- Intel Arc (A770, 130V with large system RAM)
	- AMD RDNA2 / RDNA3
	- NVIDIA RTX (24 GB recommended, 16 GB possible with paging)

	---

	## Files

	Core inference artifacts:

	- `model.onnx`
	- `model.onnx.data` ← INT4 weights (≈ 18.6 GB)
	- `genai_config.json`
	- `tokenizer.json`, `vocab.json`, `merges.txt`
	- `chat_template.jinja`

	---

	## Hardware & Memory Notes

	Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:

	- ≥ 16 GB VRAM (with host memory fallback via DirectML)
	- ≥ 64 GB system RAM strongly recommended
	- Fast NVMe storage for paging

	This model is intended for:
	- Advanced reasoning
	- Tool orchestration
	- Structured document analysis
	- Multi-step planning in local Windows applications

	---

	## Usage in C# (DirectML)

	```csharp
	using Microsoft.ML.OnnxRuntimeGenAI;

	var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";

	using var model = Model.Load(modelPath, new ModelOptions
	{
	ExecutionProvider = ExecutionProvider.DirectML
	});

	using var tokenizer = new Tokenizer(model);
	var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");

	using var generator = new Generator(model, new GeneratorParams
	{
	MaxLength = 2048,
	Temperature = 0.6f
	});

	generator.AppendTokens(tokens);
	generator.Generate();

	string output = tokenizer.Decode(generator.GetSequence(0));
	Console.WriteLine(output);
	Prompt Format
	The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).

	The provided chat_template.jinja can be used for consistent role formatting.

	Performance Characteristics
	Much stronger reasoning and instruction following than 14B

	Higher latency, but better long-context coherence

	Ideal when model must:

	Infer document structures

	Select templates

	Extract structured fields from natural language

	License & Attribution
	Base model:

	Qwen3-32B by Alibaba (see original model card for license)

	Conversion:

	ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.

	Independent community conversion.

	No affiliation with Alibaba or the Qwen team.

	Related Models
	Smaller & faster:

	https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4