BiRefNet (Full) β€” Optimized for Intel Xeon W-2145 (Skylake-SP)

Optimized BiRefNet (full Swin-L backbone, 220M parameters) for Intel Xeon W-2145 CPU inference using OpenVINO with NNCF INT8 post-training quantization.

Model Variants

Path Format Size Resolution Best For
openvino_int8/int8_1024x1024.xml INT8 (NNCF PTQ) 238 MB 1024Γ—1024 Maximum quality; model size reduction
openvino_int8/int8_512x512.xml INT8 (NNCF PTQ) 238 MB 512Γ—512 ~4Γ— lower latency vs 1024Β²
openvino_fp16/fp16_1024x1024.xml FP16 weights 442 MB both Minimal latency change on Skylake-SP
openvino_fp32/fp32_1024x1024.xml FP32 442 MB both Reference accuracy
openvino_int8wo/int8wo_1024x1024.xml INT8 weight-only 237 MB both Alternative compression

Which variant to use

CPU Architecture Recommended Variant Why
Skylake-SP (Xeon W-2145) OpenVINO FP32/FP16/INT8 (any) ~3Γ— faster than PyTorch; all variants perform similarly because bottleneck is memory bandwidth, not compute
Cascade Lake+ (VNNI) INT8 (NNCF) ~6Γ— faster than PyTorch; VNNI INT8 dot-products double throughput over Skylake-SP
Sapphire Rapids (AMX) INT8 or FP16 ~12-24Γ— faster; AMX tile engines give massive INT8/BF16 throughput
AMD Zen 4 (AVX-512) OpenVINO FP32/FP16 No VNNI β†’ INT8 gains limited to memory bandwidth; graph fusion is the main win
Any CPU with <16GB RAM INT8 2Γ— model size reduction (442 MB β†’ 238 MB)

Why Full BiRefNet gains less than Lite on Skylake-SP

Model Backbone Params OpenVINO vs PyTorch Why the difference
BiRefNet Full Swin-L 220M ~3Γ— Backbone is memory-bound (197M params stream through 11MB L3); decoder is identical but proportionally smaller
BiRefNet Lite Swin-T 44M ~7Γ— Smaller backbone fits cache better; decoder optimizations dominate

Both models use the same decoder β€” the speedup difference comes entirely from the encoder size.

Quick Start

import openvino as ov
import numpy as np
from PIL import Image
from torchvision import transforms

# Load optimized model (best for Skylake-SP β€” INT8 for size, FP32/FP16 for speed)
core = ov.Core()
model = core.read_model("openvino_int8/int8_1024x1024.xml")

# Optimal config for Xeon W-2145 (8 physical cores)
config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "INFERENCE_NUM_THREADS": "8",
}
compiled = core.compile_model(model, "CPU", config)
infer_req = compiled.create_infer_request()

# Preprocess
transform = transforms.Compose([
    transforms.Resize((1024, 1024)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open("input.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0).numpy()

# Inference
result = infer_req.infer({0: input_tensor})
mask = 1 / (1 + np.exp(-result[0]))  # sigmoid
mask = (mask[0, 0] * 255).astype(np.uint8)

# Save
Image.fromarray(mask).resize(image.size).save("mask.png")

Environment Setup

export OMP_NUM_THREADS=8          # physical cores only
export KMP_AFFINITY="granularity=fine,compact,1,0"
export KMP_BLOCKTIME=1
export OMP_WAIT_POLICY=ACTIVE

pip install openvino torchvision pillow

Optimization Notes

OpenVINO gains on Xeon W-2145:

  1. AVX-512 SIMD vectorization β€” 16 FP32 ops/cycle
  2. Graph-level fusion β€” Conv+BN+Act, attention patterns merged into single kernels
  3. nChw16c memory layout β€” aligned to AVX-512 registers, maximizes L3 cache efficiency
  4. Static shape compilation β€” no dynamic dispatch overhead
  5. Thread affinity β€” avoids hyperthreading contention on 11 MB shared L3

INT8 quantization:

  • NNCF mixed-precision PTQ with 20 synthetic calibration images
  • On Skylake-SP (no VNNI): gain is primarily from memory bandwidth reduction, not compute throughput β€” all OV variants perform similarly
  • On Cascade Lake+ (VNNI): INT8 dot-products give additional 2Γ— throughput
  • On Sapphire Rapids (AMX): tile engines give additional 4-8Γ— throughput
  • Model compression: 839 MB (PyTorch FP32) β†’ 238 MB (OpenVINO INT8)

Resolution: 1024Γ—1024 for maximum quality; 512Γ—512 for ~4Γ— lower latency with minor quality loss.

Architecture

  • Backbone: Swin Transformer Large (swin_v1_l) β€” 197M params, window size 12
  • Decoder: ASPP with deformable convolutions + bilateral reference blocks
  • Total: 220M parameters

Base Model

See Also

  • BiRefNet Lite optimized β€” 44M params, 3.5Γ— smaller, ~7Γ— faster on Skylake-SP, still excellent quality for most use cases
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ibrhr/BiRefNet-openvino-xeon-w2145

Quantized
(4)
this model

Space using ibrhr/BiRefNet-openvino-xeon-w2145 1

Paper for ibrhr/BiRefNet-openvino-xeon-w2145