--- license: mit tags: - ncnn - vulkan - denoising - image-restoration - drunet - dpir library_name: ncnn pipeline_tag: image-to-image --- # DRUNet (color) — ncnn port ncnn-compatible weights for **DRUNet**, the plug-and-play color denoiser from [cszn/DPIR][dpir] (Zhang et al., *Plug-and-Play Image Restoration with Deep Denoiser Prior*, IEEE TPAMI 2021). One model handles every noise level σ ∈ [0, 50] because the noise level is passed in as a 4th input channel — no per-σ retraining needed. Converted from the official PyTorch checkpoint published at [deepinv/drunet][hf-pth]. To my knowledge no other ncnn port of DRUNet existed on the Hub — uploading so the next person doesn't have to spend the hour I did re-running pnnx. ## Files | File | Size | Purpose | |------|------|---------| | `drunet_color.ncnn.param` | ~11 KB | Network topology (text format, 125 layers) | | `drunet_color.ncnn.bin` | ~65 MB | fp16-quantized weights | The fp16 quantization halves the on-disk footprint vs the original 130 MB fp32 `.pth`; visually-perceptible differences vs fp32 are within noise on a real image at σ ≤ 50. ## Usage (ncnn C++) ```cpp #include "net.h" ncnn::Net net; net.opt.use_vulkan_compute = true; // ~5× faster than CPU on a real GPU net.load_param("drunet_color.ncnn.param"); net.load_model("drunet_color.ncnn.bin"); // Input layout: 4-channel float, (1, 4, H, W) // ch0..2 = RGB normalized to [0, 1] // ch3 = σ/255 broadcast as a constant plane // H and W must be multiples of 8 (4 downscale stages in the UNet). // Pad with cv::BORDER_REPLICATE and crop the result back. ncnn::Mat in(W, H, 4); // ...fill RGB and σ plane... ncnn::Extractor ex = net.create_extractor(); ex.input("in0", in); ncnn::Mat out; ex.extract("out0", out); // 3-channel float RGB in [0, 1] ``` A full C++ wrapper with tile-aware inference, replicate-padding, and Vulkan auto-detection lives in [mlc-ncnn-img2img/src/denoise.cpp][shim]. ## How this was produced ```python # python3, in a venv with torch + pnnx + opencv-python import sys, torch sys.path.insert(0, "DPIR") # clone of github.com/cszn/DPIR from models.network_unet import UNetRes model = UNetRes(in_nc=4, out_nc=3, nc=[64,128,256,512], nb=4, act_mode="R", downsample_mode="strideconv", upsample_mode="convtranspose") model.load_state_dict(torch.load("drunet_color.pth", weights_only=True)) model.eval() x = torch.randn(1, 4, 256, 256) torch.jit.trace(model, x, check_trace=False).save("drunet_color.pt") import pnnx pnnx.convert("drunet_color.pt", inputs=x, fp16=True) # → drunet_color.ncnn.param + drunet_color.ncnn.bin ``` The complete driver script is [`convert_drunet.py`][driver] in the sibling repo. ## Performance Eagle 1024×1024, σ=20 (single tile, no batching, fp16 weights): | Backend | Wall time | Notes | |------------------------------------------|-----------|----------------------------------------------------------| | **Vulkan, Apple M2 Ultra (MoltenVK)** | **1.3 s** | warm; first run ~44 s (shader JIT) | | **Vulkan, NVIDIA RTX 3060 (Windows)** | **3.66 s**| warm avg of 3; cold 3.49 s (5.7× over same-box CPU) | | **CPU, Apple M2 Ultra (4 threads)** | **3.1 s** | native arm64, AppleClang + libomp | | CPU, AMD Ryzen 7 2700X (4 threads, AVX2) | 21 s | RTX 3060 box, MLC_NCNN_CPU=1 forced | | CPU, Intel Xeon (4 threads, AVX2) | 23 s | Linux box without hardware Vulkan | | Vulkan, Mesa llvmpipe (software) | 127 s | **5× slower than CPU — filter this out** | Notable: M2 Ultra Vulkan (1.3 s) beats RTX 3060 Vulkan (3.66 s) by ~2.8×. Likely a combination of M2 Ultra's unified memory (no per-tile PCIe round-trip) and its high FP16 throughput; the default tile size of 256 px favours architectures with cheap small-batch dispatches. A power-user knob to bump tile size on discrete GPUs with plenty of VRAM is a worthwhile follow-up. The M2 Ultra Vulkan path is ~17× faster than the Xeon CPU baseline. First-call latency on MoltenVK is dominated by Metal-shader JIT compilation (~40 s); subsequent invocations from the same process amortize to the warm number. For one-shot CLI invocations on Apple Silicon, the difference matters — caching the binary's shader compile output across runs would close it (TODO). If `ncnn::get_gpu_info()` only reports `llvmpipe` (Mesa software Vulkan on headless Linux), prefer CPU — software Vulkan is a slowdown for this model size, not a speedup. The companion C++ wrapper [auto-detects this][shim] and falls back to CPU. ## License & citation MIT, inherited from the original [cszn/DPIR][dpir] repository. ``` @article{zhang2021plug, title={Plug-and-Play Image Restoration with Deep Denoiser Prior}, author={Zhang, Kai and Li, Yawei and Zuo, Wangmeng and Zhang, Lei and Van Gool, Luc and Timofte, Radu}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2021} } ``` ## Companion This model ships as the `denoise` Tool Plugin in [mlc OpticScript][mlcos] — JS scripts can call `Engine.tool('denoise').apply(img, {strength: 20})` and the runtime spawns the bundled C++ binary that loads these weights. [dpir]: https://github.com/cszn/DPIR [hf-pth]: https://huggingface.co/deepinv/drunet [shim]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/src/denoise.cpp [driver]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/conversion/drunet/convert_drunet.py [mlcos]: https://mlcgo.eu/products/mlc-opticscript