Qwen3Guard-0.6B-ONNX-Quantized

This repository provides a web-optimized, Q4 quantized ONNX version of Qwen/Qwen3Guard-Gen-0.6B. It is specifically designed for high-performance, client-side safety moderation in browsers using transformers.js.


πŸš€ Key Features

  • Multilingual Support: Supports safety assessment for 119 languages.
  • Web-Ready: Fully compatible with transformers.js (v3+/V4).
  • Hardware Accelerated: Optimized for WebGPU, providing near-instant inference on Apple Silicon (M1/M2/M3) and modern GPUs.
  • Optimized Footprint: Q4 Dynamic Quantization reduces the model size from 1.5GB to **940MB** while maintaining high accuracy for safety guardrails.

πŸ›  Model Details

Attribute Specification
Original Model Qwen/Qwen3Guard-Gen-0.6B
Architecture Causal Language Model (Qwen3)
Format ONNX (Open Neural Network Exchange)
Quantization Q4 (4-bit) (Optimized for CPU/WebGPU)

πŸ’» Installation & Usage

1. Requirements

This model is optimized for transformers.js (v3/v4). It is recommended to use the latest development version for full 4-bit WebGPU acceleration support:

    npm install @huggingface/transformers@4.0.0-next.3

Or via CDN:

    import { pipeline, env } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.3';

2. Implementation (JavaScript/TypeScript)

It is highly recommended to detect WebGPU support before initialization, falling back to CPU (WASM) if necessary.

    async function isWebGPUSupported() {
        if (!navigator.gpu) return false;
        try {
            const adapter = await navigator.gpu.requestAdapter();
            return !!adapter;
        } catch (e) { return false; }
    }
    
    async function runSafetyCheck(text) {
        const hasWebGPU = await isWebGPUSupported();
        console.log(hasWebGPU ? "πŸš€ WebGPU Accelerated" : "⚠️ WebGPU Unavailable, falling back to CPU");
        
        try {
            const classifier = await pipeline('text-generation', 'rogerdeng/Qwen3Guard-0.6B-ONNX-Quantized', {
                model_file_name: 'model_quantized', // Maps to onnx/model_quantized.onnx
                device: hasWebGPU ? 'webgpu' : 'cpu'
            });
    
            const output = await classifier(text, {
                max_new_tokens: 15,
                temperature: 0,
                use_cache: true, // Crucial for < 1s inference speed
            });
    
            const result = output[0].generated_text.toLowerCase();
            console.log("Safety Assessment:", result);
            
            return result.includes('unsafe') ? 'Blocked' : 'Allowed';
        } catch (error) {
            console.error("Inference failed:", error);
        }
    }

Performance Benchmark (Apple M3 Max)

Quantization Execution Device Inference Latency Status
INT8 WebGPU ~10.0 - 12.0s ⚠️ Slow (Kernel Fallback)
Q4 (Current) WebGPU < 1.0s βœ… Highly Recommended
Q4 (Current) CPU (WASM) ~15.0s+ ℹ️ Backup Mode

Note: These results are based on real-world tests on a MacBook Pro M3 Max using Chrome with WebGPU enabled.

⚠️ Limitations & Safety Disclaimer

This model is a quantized version of the original Qwen3Guard. While quantization improves speed and reduces size, users should be aware of:

  • Nuance Loss: In extremely complex linguistic contexts, quantization may slightly alter sensitivity compared to the FP16/BF16 versions.
  • Scope: This model is designed to detect policy violations. It should be used as part of a multi-layered moderation system rather than a sole decision-maker for critical safety.

πŸ“œ License & Attribution

This model is a derivative work of Qwen3Guard-Gen-0.6B.

  • Original Authors: Qwen Team, Alibaba Group.
  • License: Please refer to the Qwen Research License for usage restrictions.
  • Citation: If you use this model, please cite the original Qwen3 paper and acknowledge the Qwen team.
Downloads last month
163
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for rogerdeng/Qwen3Guard-0.6B-ONNX-Quantized

Finetuned
Qwen/Qwen3-0.6B
Quantized
(6)
this model