Qwen3Guard-0.6B-ONNX-Quantized

This repository provides a web-optimized, Q4 quantized ONNX version of Qwen/Qwen3Guard-Gen-0.6B. It is specifically designed for high-performance, client-side safety moderation in browsers using transformers.js.

🚀 Key Features

Multilingual Support: Supports safety assessment for 119 languages.
Web-Ready: Fully compatible with transformers.js (v3+/V4).
Hardware Accelerated: Optimized for WebGPU, providing near-instant inference on Apple Silicon (M1/M2/M3) and modern GPUs.
Optimized Footprint: Q4 Dynamic Quantization reduces the model size from 1.5GB to **940MB** while maintaining high accuracy for safety guardrails.

🛠 Model Details

Attribute	Specification
Original Model	Qwen/Qwen3Guard-Gen-0.6B
Architecture	Causal Language Model (Qwen3)
Format	ONNX (Open Neural Network Exchange)
Quantization	Q4 (4-bit) (Optimized for CPU/WebGPU)

💻 Installation & Usage

1. Requirements

This model is optimized for transformers.js (v3/v4). It is recommended to use the latest development version for full 4-bit WebGPU acceleration support:

    npm install @huggingface/transformers@4.0.0-next.3

Or via CDN:

    import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.3';

2. Implementation (JavaScript/TypeScript)

It is highly recommended to detect WebGPU support before initialization, falling back to CPU (WASM) if necessary.

    async function isWebGPUSupported() {
        if (!navigator.gpu) return false;
        try {
            const adapter = await navigator.gpu.requestAdapter();
            return !!adapter;
        } catch (e) { return false; }
    }
    
    async function runSafetyCheck(text) {
        const hasWebGPU = await isWebGPUSupported();
        console.log(hasWebGPU ? "🚀 WebGPU Accelerated" : "⚠️ WebGPU Unavailable, falling back to CPU");
        
        try {
            const classifier = await pipeline('text-generation', 'rogerdeng/Qwen3Guard-0.6B-ONNX-Quantized', {
                model_file_name: 'model_quantized', // Maps to onnx/model_quantized.onnx
                device: hasWebGPU ? 'webgpu' : 'cpu'
            });
    
            const output = await classifier(text, {
                max_new_tokens: 15,
                temperature: 0,
                use_cache: true, // Crucial for < 1s inference speed
            });
    
            const result = output[0].generated_text.toLowerCase();
            console.log("Safety Assessment:", result);
            
            return result.includes('unsafe') ? 'Blocked' : 'Allowed';
        } catch (error) {
            console.error("Inference failed:", error);
        }
    }

Performance Benchmark (Apple M3 Max)

Quantization	Execution Device	Inference Latency	Status
INT8	WebGPU	~10.0 - 12.0s	⚠️ Slow (Kernel Fallback)
Q4 (Current)	WebGPU	< 1.0s	✅ Highly Recommended
Q4 (Current)	CPU (WASM)	~15.0s+	ℹ️ Backup Mode

Note: These results are based on real-world tests on a MacBook Pro M3 Max using Chrome with WebGPU enabled.

⚠️ Limitations & Safety Disclaimer

This model is a quantized version of the original Qwen3Guard. While quantization improves speed and reduces size, users should be aware of:

Nuance Loss: In extremely complex linguistic contexts, quantization may slightly alter sensitivity compared to the FP16/BF16 versions.
Scope: This model is designed to detect policy violations. It should be used as part of a multi-layered moderation system rather than a sole decision-maker for critical safety.

📜 License & Attribution

This model is a derivative work of Qwen3Guard-Gen-0.6B.

Original Authors: Qwen Team, Alibaba Group.
License: Please refer to the Qwen Research License for usage restrictions.
Citation: If you use this model, please cite the original Qwen3 paper and acknowledge the Qwen team.

Downloads last month: 44

Model tree for rogerdeng/Qwen3Guard-0.6B-ONNX-Quantized

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

Qwen/Qwen3Guard-Gen-0.6B

Quantized

(6)

this model