BiRefNet (Full) — Optimized for Intel Xeon W-2145 (Skylake-SP)

Optimized BiRefNet (full Swin-L backbone, 220M parameters) for Intel Xeon W-2145 CPU inference using OpenVINO with NNCF INT8 post-training quantization.

Model Variants

Path	Format	Size	Resolution	Best For
`openvino_int8/int8_1024x1024.xml`	INT8 (NNCF PTQ)	238 MB	1024×1024	Maximum quality; model size reduction
`openvino_int8/int8_512x512.xml`	INT8 (NNCF PTQ)	238 MB	512×512	~4× lower latency vs 1024²
`openvino_fp16/fp16_1024x1024.xml`	FP16 weights	442 MB	both	Minimal latency change on Skylake-SP
`openvino_fp32/fp32_1024x1024.xml`	FP32	442 MB	both	Reference accuracy
`openvino_int8wo/int8wo_1024x1024.xml`	INT8 weight-only	237 MB	both	Alternative compression

Which variant to use

CPU Architecture	Recommended Variant	Why
Skylake-SP (Xeon W-2145)	OpenVINO FP32/FP16/INT8 (any)	~3× faster than PyTorch; all variants perform similarly because bottleneck is memory bandwidth, not compute
Cascade Lake+ (VNNI)	INT8 (NNCF)	~6× faster than PyTorch; VNNI INT8 dot-products double throughput over Skylake-SP
Sapphire Rapids (AMX)	INT8 or FP16	~12-24× faster; AMX tile engines give massive INT8/BF16 throughput
AMD Zen 4 (AVX-512)	OpenVINO FP32/FP16	No VNNI → INT8 gains limited to memory bandwidth; graph fusion is the main win
Any CPU with <16GB RAM	INT8	2× model size reduction (442 MB → 238 MB)

Why Full BiRefNet gains less than Lite on Skylake-SP

Model	Backbone	Params	OpenVINO vs PyTorch	Why the difference
BiRefNet Full	Swin-L	220M	~3×	Backbone is memory-bound (197M params stream through 11MB L3); decoder is identical but proportionally smaller
BiRefNet Lite	Swin-T	44M	~7×	Smaller backbone fits cache better; decoder optimizations dominate

Both models use the same decoder — the speedup difference comes entirely from the encoder size.

Quick Start

import openvino as ov
import numpy as np
from PIL import Image
from torchvision import transforms

# Load optimized model (best for Skylake-SP — INT8 for size, FP32/FP16 for speed)
core = ov.Core()
model = core.read_model("openvino_int8/int8_1024x1024.xml")

# Optimal config for Xeon W-2145 (8 physical cores)
config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "INFERENCE_NUM_THREADS": "8",
}
compiled = core.compile_model(model, "CPU", config)
infer_req = compiled.create_infer_request()

# Preprocess
transform = transforms.Compose([
    transforms.Resize((1024, 1024)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open("input.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0).numpy()

# Inference
result = infer_req.infer({0: input_tensor})
mask = 1 / (1 + np.exp(-result[0]))  # sigmoid
mask = (mask[0, 0] * 255).astype(np.uint8)

# Save
Image.fromarray(mask).resize(image.size).save("mask.png")

Environment Setup

export OMP_NUM_THREADS=8          # physical cores only
export KMP_AFFINITY="granularity=fine,compact,1,0"
export KMP_BLOCKTIME=1
export OMP_WAIT_POLICY=ACTIVE

pip install openvino torchvision pillow

Optimization Notes

OpenVINO gains on Xeon W-2145:

AVX-512 SIMD vectorization — 16 FP32 ops/cycle
Graph-level fusion — Conv+BN+Act, attention patterns merged into single kernels
nChw16c memory layout — aligned to AVX-512 registers, maximizes L3 cache efficiency
Static shape compilation — no dynamic dispatch overhead
Thread affinity — avoids hyperthreading contention on 11 MB shared L3

INT8 quantization:

NNCF mixed-precision PTQ with 20 synthetic calibration images
On Skylake-SP (no VNNI): gain is primarily from memory bandwidth reduction, not compute throughput — all OV variants perform similarly
On Cascade Lake+ (VNNI): INT8 dot-products give additional 2× throughput
On Sapphire Rapids (AMX): tile engines give additional 4-8× throughput
Model compression: 839 MB (PyTorch FP32) → 238 MB (OpenVINO INT8)

Resolution: 1024×1024 for maximum quality; 512×512 for ~4× lower latency with minor quality loss.

Architecture

Backbone: Swin Transformer Large (swin_v1_l) — 197M params, window size 12
Decoder: ASPP with deformable convolutions + bilateral reference blocks
Total: 220M parameters

Base Model

ZhengPeng7/BiRefNet
Paper: arxiv:2401.03407

Model tree for ibrhr/BiRefNet-openvino-xeon-w2145

Base model

ZhengPeng7/BiRefNet

Quantized

(4)

this model

Space using ibrhr/BiRefNet-openvino-xeon-w2145 1

Paper for ibrhr/BiRefNet-openvino-xeon-w2145

Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Paper • 2401.03407 • Published Jan 7, 2024 • 4

ibrhr
/

BiRefNet-openvino-xeon-w2145