SCRFD det_500m โ€” buffalo_s weights, dynamic batch (640ร—640)

Drop-in replacement for the InsightFace buffalo_s face detector (det_500m.onnx) with batch=1 fixed to dynamic batch. Pass N frames in a single forward pass; each frame's output is bit-identical to running the model N times individually.


Background

PR deepinsight/insightface#1781 fixed the SCRFD export script to produce dynamic batch axes, but the distributed buffalo_s model pack was never re-exported. Calling the original session with N > 1 frames crashes:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input.1
index: 0  Got: 2  Expected: 1

This model is the fixed version. Tracked in deepinsight/insightface#2878.

Note on existing workarounds: alonsorobots/scrfd_320_batched is a re-export at 320ร—320. This model uses the original buffalo_s weights at 640ร—640, which preserves small-face detection quality.


How the fix works

Every output path in the original SCRFD graph ends with:

Conv [N,C,H,W] โ†’ Transpose perm=[2,3,0,1] โ†’ Reshape [-1, K]

The Transpose moves the batch dim between spatial and channel dims; the Reshape then flattens everything together โ€” interleaving frames. Three targeted ONNX graph edits fix it:

  1. Input batch dim โ†’ dynamic (1 โ†’ "batch")
  2. 9 output Transpose nodes: perm [2,3,0,1] โ†’ perm [0,2,3,1] (batch stays outermost)
  3. 3 Reshape initializers: [-1, K] โ†’ [0, -1, K] (0 copies the batch dim โ†’ [N, anchors, K])

Source + surgery script: ceyxasm/insightface-det-batch-fix


Usage

import cv2, numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ceyxprime/scrfd_640_batched", filename="det_500m_fixed.onnx")

sess = ort.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
inp_name = sess.get_inputs()[0].name

def preprocess(path):
    img = cv2.imread(path)
    return cv2.dnn.blobFromImage(
        cv2.resize(img, (640, 640)), 1.0 / 128.0, (640, 640),
        (127.5, 127.5, 127.5), swapRB=True,
    )[0]  # (3, 640, 640)

# N images โ†’ one forward pass
batch = np.stack([preprocess(p) for p in image_paths])   # (N, 3, 640, 640)
outputs = sess.run(None, {inp_name: batch})

# outputs: 9 tensors โ€” 3 scales (stride 8/16/32) ร— 3 heads (cls, reg, kps)
# outputs[i][n] = frame n's result for head i
# Anchor counts at 640ร—640: 12800 / 3200 / 800 per scale
for n in range(len(image_paths)):
    cls_s8 = outputs[0][n]   # (12800, 1)  scores, stride 8
    reg_s8 = outputs[3][n]   # (12800, 4)  box deltas, stride 8
    kps_s8 = outputs[6][n]   # (12800, 10) keypoints, stride 8
    # run your anchor decoding + NMS per frame as usual

Or reproduce from the original model:

pip install onnx onnxruntime insightface
python -c "import insightface; insightface.app.FaceAnalysis(name='buffalo_s').prepare(ctx_id=-1)"
# then:
git clone https://github.com/ceyxasm/insightface-det-batch-fix
cd insightface-det-batch-fix
python fix_det_batch.py \
    --model ~/.insightface/models/buffalo_s/det_500m.onnx \
    --out   det_500m_fixed.onnx

Validation

Same image duplicated 5ร— in one forward pass (validate_5frames.py):

Comparison Max diff
ORT B=1 vs onnx2torch B=1 โ‰ค 2e-5 (float backend noise)
ORT B=1 vs each of 5 batched frames identical to above
onnx2torch B=1 vs each of 5 batched frames 0.000000
All cross-frame pairs 0.000000

All 9 output heads pass.


Model details

Property Value
Architecture SCRFD-500MF
Input [N, 3, 640, 640], float32, normalised (x - 127.5) / 128.0
Outputs 9 tensors: clsร—3, regร—3, kpsร—3 (strides 8, 16, 32)
Weights buffalo_s (not identical to public SCRFD500.pth)
Framework ONNX (opset 11)
Fixed from ~/.insightface/models/buffalo_s/det_500m.onnx
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support