Bilateral Reference for High-Resolution Dichotomous Image Segmentation
Paper β’ 2401.03407 β’ Published β’ 4
Optimized BiRefNet (full Swin-L backbone, 220M parameters) for Intel Xeon W-2145 CPU inference using OpenVINO with NNCF INT8 post-training quantization.
| Path | Format | Size | Resolution | Best For |
|---|---|---|---|---|
openvino_int8/int8_1024x1024.xml |
INT8 (NNCF PTQ) | 238 MB | 1024Γ1024 | Maximum quality; model size reduction |
openvino_int8/int8_512x512.xml |
INT8 (NNCF PTQ) | 238 MB | 512Γ512 | ~4Γ lower latency vs 1024Β² |
openvino_fp16/fp16_1024x1024.xml |
FP16 weights | 442 MB | both | Minimal latency change on Skylake-SP |
openvino_fp32/fp32_1024x1024.xml |
FP32 | 442 MB | both | Reference accuracy |
openvino_int8wo/int8wo_1024x1024.xml |
INT8 weight-only | 237 MB | both | Alternative compression |
| CPU Architecture | Recommended Variant | Why |
|---|---|---|
| Skylake-SP (Xeon W-2145) | OpenVINO FP32/FP16/INT8 (any) | ~3Γ faster than PyTorch; all variants perform similarly because bottleneck is memory bandwidth, not compute |
| Cascade Lake+ (VNNI) | INT8 (NNCF) | ~6Γ faster than PyTorch; VNNI INT8 dot-products double throughput over Skylake-SP |
| Sapphire Rapids (AMX) | INT8 or FP16 | ~12-24Γ faster; AMX tile engines give massive INT8/BF16 throughput |
| AMD Zen 4 (AVX-512) | OpenVINO FP32/FP16 | No VNNI β INT8 gains limited to memory bandwidth; graph fusion is the main win |
| Any CPU with <16GB RAM | INT8 | 2Γ model size reduction (442 MB β 238 MB) |
| Model | Backbone | Params | OpenVINO vs PyTorch | Why the difference |
|---|---|---|---|---|
| BiRefNet Full | Swin-L | 220M | ~3Γ | Backbone is memory-bound (197M params stream through 11MB L3); decoder is identical but proportionally smaller |
| BiRefNet Lite | Swin-T | 44M | ~7Γ | Smaller backbone fits cache better; decoder optimizations dominate |
Both models use the same decoder β the speedup difference comes entirely from the encoder size.
import openvino as ov
import numpy as np
from PIL import Image
from torchvision import transforms
# Load optimized model (best for Skylake-SP β INT8 for size, FP32/FP16 for speed)
core = ov.Core()
model = core.read_model("openvino_int8/int8_1024x1024.xml")
# Optimal config for Xeon W-2145 (8 physical cores)
config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"INFERENCE_NUM_THREADS": "8",
}
compiled = core.compile_model(model, "CPU", config)
infer_req = compiled.create_infer_request()
# Preprocess
transform = transforms.Compose([
transforms.Resize((1024, 1024)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
image = Image.open("input.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0).numpy()
# Inference
result = infer_req.infer({0: input_tensor})
mask = 1 / (1 + np.exp(-result[0])) # sigmoid
mask = (mask[0, 0] * 255).astype(np.uint8)
# Save
Image.fromarray(mask).resize(image.size).save("mask.png")
export OMP_NUM_THREADS=8 # physical cores only
export KMP_AFFINITY="granularity=fine,compact,1,0"
export KMP_BLOCKTIME=1
export OMP_WAIT_POLICY=ACTIVE
pip install openvino torchvision pillow
OpenVINO gains on Xeon W-2145:
INT8 quantization:
Resolution: 1024Γ1024 for maximum quality; 512Γ512 for ~4Γ lower latency with minor quality loss.
swin_v1_l) β 197M params, window size 12Base model
ZhengPeng7/BiRefNet