drunet-ncnn / README.md
mlc911's picture
Add RTX 3060 + Ryzen 7 2700X benchmark numbers
e5f9c55 verified
---
license: mit
tags:
- ncnn
- vulkan
- denoising
- image-restoration
- drunet
- dpir
library_name: ncnn
pipeline_tag: image-to-image
---
# DRUNet (color) β€” ncnn port
ncnn-compatible weights for **DRUNet**, the plug-and-play color
denoiser from [cszn/DPIR][dpir] (Zhang et al., *Plug-and-Play Image
Restoration with Deep Denoiser Prior*, IEEE TPAMI 2021). One model
handles every noise level Οƒ ∈ [0, 50] because the noise level is
passed in as a 4th input channel β€” no per-Οƒ retraining needed.
Converted from the official PyTorch checkpoint published at
[deepinv/drunet][hf-pth]. To my knowledge no other ncnn port of
DRUNet existed on the Hub β€” uploading so the next person doesn't
have to spend the hour I did re-running pnnx.
## Files
| File | Size | Purpose |
|------|------|---------|
| `drunet_color.ncnn.param` | ~11 KB | Network topology (text format, 125 layers) |
| `drunet_color.ncnn.bin` | ~65 MB | fp16-quantized weights |
The fp16 quantization halves the on-disk footprint vs the original
130 MB fp32 `.pth`; visually-perceptible differences vs fp32 are
within noise on a real image at Οƒ ≀ 50.
## Usage (ncnn C++)
```cpp
#include "net.h"
ncnn::Net net;
net.opt.use_vulkan_compute = true; // ~5Γ— faster than CPU on a real GPU
net.load_param("drunet_color.ncnn.param");
net.load_model("drunet_color.ncnn.bin");
// Input layout: 4-channel float, (1, 4, H, W)
// ch0..2 = RGB normalized to [0, 1]
// ch3 = Οƒ/255 broadcast as a constant plane
// H and W must be multiples of 8 (4 downscale stages in the UNet).
// Pad with cv::BORDER_REPLICATE and crop the result back.
ncnn::Mat in(W, H, 4);
// ...fill RGB and Οƒ plane...
ncnn::Extractor ex = net.create_extractor();
ex.input("in0", in);
ncnn::Mat out;
ex.extract("out0", out); // 3-channel float RGB in [0, 1]
```
A full C++ wrapper with tile-aware inference, replicate-padding, and
Vulkan auto-detection lives in [mlc-ncnn-img2img/src/denoise.cpp][shim].
## How this was produced
```python
# python3, in a venv with torch + pnnx + opencv-python
import sys, torch
sys.path.insert(0, "DPIR") # clone of github.com/cszn/DPIR
from models.network_unet import UNetRes
model = UNetRes(in_nc=4, out_nc=3, nc=[64,128,256,512], nb=4,
act_mode="R", downsample_mode="strideconv",
upsample_mode="convtranspose")
model.load_state_dict(torch.load("drunet_color.pth", weights_only=True))
model.eval()
x = torch.randn(1, 4, 256, 256)
torch.jit.trace(model, x, check_trace=False).save("drunet_color.pt")
import pnnx
pnnx.convert("drunet_color.pt", inputs=x, fp16=True)
# β†’ drunet_color.ncnn.param + drunet_color.ncnn.bin
```
The complete driver script is [`convert_drunet.py`][driver] in the
sibling repo.
## Performance
Eagle 1024Γ—1024, Οƒ=20 (single tile, no batching, fp16 weights):
| Backend | Wall time | Notes |
|------------------------------------------|-----------|----------------------------------------------------------|
| **Vulkan, Apple M2 Ultra (MoltenVK)** | **1.3 s** | warm; first run ~44 s (shader JIT) |
| **Vulkan, NVIDIA RTX 3060 (Windows)** | **3.66 s**| warm avg of 3; cold 3.49 s (5.7Γ— over same-box CPU) |
| **CPU, Apple M2 Ultra (4 threads)** | **3.1 s** | native arm64, AppleClang + libomp |
| CPU, AMD Ryzen 7 2700X (4 threads, AVX2) | 21 s | RTX 3060 box, MLC_NCNN_CPU=1 forced |
| CPU, Intel Xeon (4 threads, AVX2) | 23 s | Linux box without hardware Vulkan |
| Vulkan, Mesa llvmpipe (software) | 127 s | **5Γ— slower than CPU β€” filter this out** |
Notable: M2 Ultra Vulkan (1.3 s) beats RTX 3060 Vulkan (3.66 s) by
~2.8Γ—. Likely a combination of M2 Ultra's unified memory (no
per-tile PCIe round-trip) and its high FP16 throughput; the
default tile size of 256 px favours architectures with cheap
small-batch dispatches. A power-user knob to bump tile size on
discrete GPUs with plenty of VRAM is a worthwhile follow-up.
The M2 Ultra Vulkan path is ~17Γ— faster than the Xeon CPU baseline.
First-call latency on MoltenVK is dominated by Metal-shader JIT
compilation (~40 s); subsequent invocations from the same process
amortize to the warm number. For one-shot CLI invocations on Apple
Silicon, the difference matters β€” caching the binary's shader
compile output across runs would close it (TODO).
If `ncnn::get_gpu_info()` only reports `llvmpipe` (Mesa software
Vulkan on headless Linux), prefer CPU β€” software Vulkan is a
slowdown for this model size, not a speedup. The companion C++
wrapper [auto-detects this][shim] and falls back to CPU.
## License & citation
MIT, inherited from the original [cszn/DPIR][dpir] repository.
```
@article{zhang2021plug,
title={Plug-and-Play Image Restoration with Deep Denoiser Prior},
author={Zhang, Kai and Li, Yawei and Zuo, Wangmeng and Zhang, Lei and
Van Gool, Luc and Timofte, Radu},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2021}
}
```
## Companion
This model ships as the `denoise` Tool Plugin in
[mlc OpticScript][mlcos] β€” JS scripts can call
`Engine.tool('denoise').apply(img, {strength: 20})` and the runtime
spawns the bundled C++ binary that loads these weights.
[dpir]: https://github.com/cszn/DPIR
[hf-pth]: https://huggingface.co/deepinv/drunet
[shim]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/src/denoise.cpp
[driver]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/conversion/drunet/convert_drunet.py
[mlcos]: https://mlcgo.eu/products/mlc-opticscript