Qwen3Guard-0.6B-ONNX-Quantized
This repository provides a web-optimized, Q4 quantized ONNX version of Qwen/Qwen3Guard-Gen-0.6B. It is specifically designed for high-performance, client-side safety moderation in browsers using transformers.js.
π Key Features
- Multilingual Support: Supports safety assessment for 119 languages.
- Web-Ready: Fully compatible with
transformers.js(v3+/V4). - Hardware Accelerated: Optimized for WebGPU, providing near-instant inference on Apple Silicon (M1/M2/M3) and modern GPUs.
- Optimized Footprint: Q4 Dynamic Quantization reduces the model size from
1.5GB to **940MB** while maintaining high accuracy for safety guardrails.
π Model Details
| Attribute | Specification |
|---|---|
| Original Model | Qwen/Qwen3Guard-Gen-0.6B |
| Architecture | Causal Language Model (Qwen3) |
| Format | ONNX (Open Neural Network Exchange) |
| Quantization | Q4 (4-bit) (Optimized for CPU/WebGPU) |
π» Installation & Usage
1. Requirements
This model is optimized for transformers.js (v3/v4). It is recommended to use the latest development version for full 4-bit WebGPU acceleration support:
npm install @huggingface/transformers@4.0.0-next.3
Or via CDN:
import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.3';
2. Implementation (JavaScript/TypeScript)
It is highly recommended to detect WebGPU support before initialization, falling back to CPU (WASM) if necessary.
async function isWebGPUSupported() {
if (!navigator.gpu) return false;
try {
const adapter = await navigator.gpu.requestAdapter();
return !!adapter;
} catch (e) { return false; }
}
async function runSafetyCheck(text) {
const hasWebGPU = await isWebGPUSupported();
console.log(hasWebGPU ? "π WebGPU Accelerated" : "β οΈ WebGPU Unavailable, falling back to CPU");
try {
const classifier = await pipeline('text-generation', 'rogerdeng/Qwen3Guard-0.6B-ONNX-Quantized', {
model_file_name: 'model_quantized', // Maps to onnx/model_quantized.onnx
device: hasWebGPU ? 'webgpu' : 'cpu'
});
const output = await classifier(text, {
max_new_tokens: 15,
temperature: 0,
use_cache: true, // Crucial for < 1s inference speed
});
const result = output[0].generated_text.toLowerCase();
console.log("Safety Assessment:", result);
return result.includes('unsafe') ? 'Blocked' : 'Allowed';
} catch (error) {
console.error("Inference failed:", error);
}
}
Performance Benchmark (Apple M3 Max)
| Quantization | Execution Device | Inference Latency | Status |
|---|---|---|---|
| INT8 | WebGPU | ~10.0 - 12.0s | β οΈ Slow (Kernel Fallback) |
| Q4 (Current) | WebGPU | < 1.0s | β Highly Recommended |
| Q4 (Current) | CPU (WASM) | ~15.0s+ | βΉοΈ Backup Mode |
Note: These results are based on real-world tests on a MacBook Pro M3 Max using Chrome with WebGPU enabled.
β οΈ Limitations & Safety Disclaimer
This model is a quantized version of the original Qwen3Guard. While quantization improves speed and reduces size, users should be aware of:
- Nuance Loss: In extremely complex linguistic contexts, quantization may slightly alter sensitivity compared to the FP16/BF16 versions.
- Scope: This model is designed to detect policy violations. It should be used as part of a multi-layered moderation system rather than a sole decision-maker for critical safety.
π License & Attribution
This model is a derivative work of Qwen3Guard-Gen-0.6B.
- Original Authors: Qwen Team, Alibaba Group.
- License: Please refer to the Qwen Research License for usage restrictions.
- Citation: If you use this model, please cite the original Qwen3 paper and acknowledge the Qwen team.
- Downloads last month
- 163