| --- |
| license: mit |
| tags: |
| - ncnn |
| - vulkan |
| - denoising |
| - image-restoration |
| - drunet |
| - dpir |
| library_name: ncnn |
| pipeline_tag: image-to-image |
| --- |
| |
| # DRUNet (color) β ncnn port |
|
|
| ncnn-compatible weights for **DRUNet**, the plug-and-play color |
| denoiser from [cszn/DPIR][dpir] (Zhang et al., *Plug-and-Play Image |
| Restoration with Deep Denoiser Prior*, IEEE TPAMI 2021). One model |
| handles every noise level Ο β [0, 50] because the noise level is |
| passed in as a 4th input channel β no per-Ο retraining needed. |
|
|
| Converted from the official PyTorch checkpoint published at |
| [deepinv/drunet][hf-pth]. To my knowledge no other ncnn port of |
| DRUNet existed on the Hub β uploading so the next person doesn't |
| have to spend the hour I did re-running pnnx. |
|
|
| ## Files |
|
|
| | File | Size | Purpose | |
| |------|------|---------| |
| | `drunet_color.ncnn.param` | ~11 KB | Network topology (text format, 125 layers) | |
| | `drunet_color.ncnn.bin` | ~65 MB | fp16-quantized weights | |
|
|
| The fp16 quantization halves the on-disk footprint vs the original |
| 130 MB fp32 `.pth`; visually-perceptible differences vs fp32 are |
| within noise on a real image at Ο β€ 50. |
|
|
| ## Usage (ncnn C++) |
|
|
| ```cpp |
| #include "net.h" |
| |
| ncnn::Net net; |
| net.opt.use_vulkan_compute = true; // ~5Γ faster than CPU on a real GPU |
| net.load_param("drunet_color.ncnn.param"); |
| net.load_model("drunet_color.ncnn.bin"); |
| |
| // Input layout: 4-channel float, (1, 4, H, W) |
| // ch0..2 = RGB normalized to [0, 1] |
| // ch3 = Ο/255 broadcast as a constant plane |
| // H and W must be multiples of 8 (4 downscale stages in the UNet). |
| // Pad with cv::BORDER_REPLICATE and crop the result back. |
| |
| ncnn::Mat in(W, H, 4); |
| // ...fill RGB and Ο plane... |
| |
| ncnn::Extractor ex = net.create_extractor(); |
| ex.input("in0", in); |
| ncnn::Mat out; |
| ex.extract("out0", out); // 3-channel float RGB in [0, 1] |
| ``` |
|
|
| A full C++ wrapper with tile-aware inference, replicate-padding, and |
| Vulkan auto-detection lives in [mlc-ncnn-img2img/src/denoise.cpp][shim]. |
|
|
| ## How this was produced |
|
|
| ```python |
| # python3, in a venv with torch + pnnx + opencv-python |
| import sys, torch |
| sys.path.insert(0, "DPIR") # clone of github.com/cszn/DPIR |
| from models.network_unet import UNetRes |
| |
| model = UNetRes(in_nc=4, out_nc=3, nc=[64,128,256,512], nb=4, |
| act_mode="R", downsample_mode="strideconv", |
| upsample_mode="convtranspose") |
| model.load_state_dict(torch.load("drunet_color.pth", weights_only=True)) |
| model.eval() |
| |
| x = torch.randn(1, 4, 256, 256) |
| torch.jit.trace(model, x, check_trace=False).save("drunet_color.pt") |
| |
| import pnnx |
| pnnx.convert("drunet_color.pt", inputs=x, fp16=True) |
| # β drunet_color.ncnn.param + drunet_color.ncnn.bin |
| ``` |
|
|
| The complete driver script is [`convert_drunet.py`][driver] in the |
| sibling repo. |
|
|
| ## Performance |
|
|
| Eagle 1024Γ1024, Ο=20 (single tile, no batching, fp16 weights): |
|
|
| | Backend | Wall time | Notes | |
| |------------------------------------------|-----------|----------------------------------------------------------| |
| | **Vulkan, Apple M2 Ultra (MoltenVK)** | **1.3 s** | warm; first run ~44 s (shader JIT) | |
| | **Vulkan, NVIDIA RTX 3060 (Windows)** | **3.66 s**| warm avg of 3; cold 3.49 s (5.7Γ over same-box CPU) | |
| | **CPU, Apple M2 Ultra (4 threads)** | **3.1 s** | native arm64, AppleClang + libomp | |
| | CPU, AMD Ryzen 7 2700X (4 threads, AVX2) | 21 s | RTX 3060 box, MLC_NCNN_CPU=1 forced | |
| | CPU, Intel Xeon (4 threads, AVX2) | 23 s | Linux box without hardware Vulkan | |
| | Vulkan, Mesa llvmpipe (software) | 127 s | **5Γ slower than CPU β filter this out** | |
|
|
| Notable: M2 Ultra Vulkan (1.3 s) beats RTX 3060 Vulkan (3.66 s) by |
| ~2.8Γ. Likely a combination of M2 Ultra's unified memory (no |
| per-tile PCIe round-trip) and its high FP16 throughput; the |
| default tile size of 256 px favours architectures with cheap |
| small-batch dispatches. A power-user knob to bump tile size on |
| discrete GPUs with plenty of VRAM is a worthwhile follow-up. |
|
|
| The M2 Ultra Vulkan path is ~17Γ faster than the Xeon CPU baseline. |
| First-call latency on MoltenVK is dominated by Metal-shader JIT |
| compilation (~40 s); subsequent invocations from the same process |
| amortize to the warm number. For one-shot CLI invocations on Apple |
| Silicon, the difference matters β caching the binary's shader |
| compile output across runs would close it (TODO). |
|
|
| If `ncnn::get_gpu_info()` only reports `llvmpipe` (Mesa software |
| Vulkan on headless Linux), prefer CPU β software Vulkan is a |
| slowdown for this model size, not a speedup. The companion C++ |
| wrapper [auto-detects this][shim] and falls back to CPU. |
|
|
| ## License & citation |
|
|
| MIT, inherited from the original [cszn/DPIR][dpir] repository. |
|
|
| ``` |
| @article{zhang2021plug, |
| title={Plug-and-Play Image Restoration with Deep Denoiser Prior}, |
| author={Zhang, Kai and Li, Yawei and Zuo, Wangmeng and Zhang, Lei and |
| Van Gool, Luc and Timofte, Radu}, |
| journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, |
| year={2021} |
| } |
| ``` |
|
|
| ## Companion |
|
|
| This model ships as the `denoise` Tool Plugin in |
| [mlc OpticScript][mlcos] β JS scripts can call |
| `Engine.tool('denoise').apply(img, {strength: 20})` and the runtime |
| spawns the bundled C++ binary that loads these weights. |
|
|
| [dpir]: https://github.com/cszn/DPIR |
| [hf-pth]: https://huggingface.co/deepinv/drunet |
| [shim]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/src/denoise.cpp |
| [driver]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/conversion/drunet/convert_drunet.py |
| [mlcos]: https://mlcgo.eu/products/mlc-opticscript |
|
|