Add RTX 3060 + Ryzen 7 2700X benchmark numbers

e5f9c55 verified 6 days ago

5.72 kB

	---
	license: mit
	tags:
	- ncnn
	- vulkan
	- denoising
	- image-restoration
	- drunet
	- dpir
	library_name: ncnn
	pipeline_tag: image-to-image
	---

	# DRUNet (color) — ncnn port

	ncnn-compatible weights for DRUNet, the plug-and-play color
	denoiser from [cszn/DPIR][dpir] (Zhang et al., *Plug-and-Play Image
	Restoration with Deep Denoiser Prior*, IEEE TPAMI 2021). One model
	handles every noise level σ ∈ [0, 50] because the noise level is
	passed in as a 4th input channel — no per-σ retraining needed.

	Converted from the official PyTorch checkpoint published at
	[deepinv/drunet][hf-pth]. To my knowledge no other ncnn port of
	DRUNet existed on the Hub — uploading so the next person doesn't
	have to spend the hour I did re-running pnnx.

	## Files

	\| File \| Size \| Purpose \|
	\|------\|------\|---------\|
	\| `drunet_color.ncnn.param` \| ~11 KB \| Network topology (text format, 125 layers) \|
	\| `drunet_color.ncnn.bin` \| ~65 MB \| fp16-quantized weights \|

	The fp16 quantization halves the on-disk footprint vs the original
	130 MB fp32 `.pth`; visually-perceptible differences vs fp32 are
	within noise on a real image at σ ≤ 50.

	## Usage (ncnn C++)

	```cpp
	#include "net.h"

	ncnn::Net net;
	net.opt.use_vulkan_compute = true; // ~5× faster than CPU on a real GPU
	net.load_param("drunet_color.ncnn.param");
	net.load_model("drunet_color.ncnn.bin");

	// Input layout: 4-channel float, (1, 4, H, W)
	// ch0..2 = RGB normalized to [0, 1]
	// ch3 = σ/255 broadcast as a constant plane
	// H and W must be multiples of 8 (4 downscale stages in the UNet).
	// Pad with cv::BORDER_REPLICATE and crop the result back.

	ncnn::Mat in(W, H, 4);
	// ...fill RGB and σ plane...

	ncnn::Extractor ex = net.create_extractor();
	ex.input("in0", in);
	ncnn::Mat out;
	ex.extract("out0", out); // 3-channel float RGB in [0, 1]
	```

	A full C++ wrapper with tile-aware inference, replicate-padding, and
	Vulkan auto-detection lives in [mlc-ncnn-img2img/src/denoise.cpp][shim].

	## How this was produced

	```python
	# python3, in a venv with torch + pnnx + opencv-python
	import sys, torch
	sys.path.insert(0, "DPIR") # clone of github.com/cszn/DPIR
	from models.network_unet import UNetRes

	model = UNetRes(in_nc=4, out_nc=3, nc=[64,128,256,512], nb=4,
	act_mode="R", downsample_mode="strideconv",
	upsample_mode="convtranspose")
	model.load_state_dict(torch.load("drunet_color.pth", weights_only=True))
	model.eval()

	x = torch.randn(1, 4, 256, 256)
	torch.jit.trace(model, x, check_trace=False).save("drunet_color.pt")

	import pnnx
	pnnx.convert("drunet_color.pt", inputs=x, fp16=True)
	# → drunet_color.ncnn.param + drunet_color.ncnn.bin
	```

	The complete driver script is [`convert_drunet.py`][driver] in the
	sibling repo.

	## Performance

	Eagle 1024×1024, σ=20 (single tile, no batching, fp16 weights):

	\| Backend \| Wall time \| Notes \|
	\|------------------------------------------\|-----------\|----------------------------------------------------------\|
	\| Vulkan, Apple M2 Ultra (MoltenVK) \| 1.3 s \| warm; first run ~44 s (shader JIT) \|
	\| Vulkan, NVIDIA RTX 3060 (Windows) \| 3.66 s\| warm avg of 3; cold 3.49 s (5.7× over same-box CPU) \|
	\| CPU, Apple M2 Ultra (4 threads) \| 3.1 s \| native arm64, AppleClang + libomp \|
	\| CPU, AMD Ryzen 7 2700X (4 threads, AVX2) \| 21 s \| RTX 3060 box, MLC_NCNN_CPU=1 forced \|
	\| CPU, Intel Xeon (4 threads, AVX2) \| 23 s \| Linux box without hardware Vulkan \|
	\| Vulkan, Mesa llvmpipe (software) \| 127 s \| 5× slower than CPU — filter this out \|

	Notable: M2 Ultra Vulkan (1.3 s) beats RTX 3060 Vulkan (3.66 s) by
	~2.8×. Likely a combination of M2 Ultra's unified memory (no
	per-tile PCIe round-trip) and its high FP16 throughput; the
	default tile size of 256 px favours architectures with cheap
	small-batch dispatches. A power-user knob to bump tile size on
	discrete GPUs with plenty of VRAM is a worthwhile follow-up.

	The M2 Ultra Vulkan path is ~17× faster than the Xeon CPU baseline.
	First-call latency on MoltenVK is dominated by Metal-shader JIT
	compilation (~40 s); subsequent invocations from the same process
	amortize to the warm number. For one-shot CLI invocations on Apple
	Silicon, the difference matters — caching the binary's shader
	compile output across runs would close it (TODO).

	If `ncnn::get_gpu_info()` only reports `llvmpipe` (Mesa software
	Vulkan on headless Linux), prefer CPU — software Vulkan is a
	slowdown for this model size, not a speedup. The companion C++
	wrapper [auto-detects this][shim] and falls back to CPU.

	## License & citation

	MIT, inherited from the original [cszn/DPIR][dpir] repository.

	```
	@article{zhang2021plug,
	title={Plug-and-Play Image Restoration with Deep Denoiser Prior},
	author={Zhang, Kai and Li, Yawei and Zuo, Wangmeng and Zhang, Lei and
	Van Gool, Luc and Timofte, Radu},
	journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
	year={2021}
	}
	```

	## Companion

	This model ships as the `denoise` Tool Plugin in
	[mlc OpticScript][mlcos] — JS scripts can call
	`Engine.tool('denoise').apply(img, {strength: 20})` and the runtime
	spawns the bundled C++ binary that loads these weights.

	[dpir]: https://github.com/cszn/DPIR
	[hf-pth]: https://huggingface.co/deepinv/drunet
	[shim]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/src/denoise.cpp
	[driver]: https://github.com/mlc911/mlc-ncnn-img2img/blob/main/conversion/drunet/convert_drunet.py
	[mlcos]: https://mlcgo.eu/products/mlc-opticscript