SCRFD det_500m — buffalo_s weights, dynamic batch (640×640)

Drop-in replacement for the InsightFace buffalo_s face detector (det_500m.onnx) with batch=1 fixed to dynamic batch. Pass N frames in a single forward pass; each frame's output is bit-identical to running the model N times individually.

Background

PR deepinsight/insightface#1781 fixed the SCRFD export script to produce dynamic batch axes, but the distributed buffalo_s model pack was never re-exported. Calling the original session with N > 1 frames crashes:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input.1
index: 0  Got: 2  Expected: 1

This model is the fixed version. Tracked in deepinsight/insightface#2878.

Note on existing workarounds: alonsorobots/scrfd_320_batched is a re-export at 320×320. This model uses the original buffalo_s weights at 640×640, which preserves small-face detection quality.

How the fix works

Every output path in the original SCRFD graph ends with:

Conv [N,C,H,W] → Transpose perm=[2,3,0,1] → Reshape [-1, K]

The Transpose moves the batch dim between spatial and channel dims; the Reshape then flattens everything together — interleaving frames. Three targeted ONNX graph edits fix it:

Input batch dim → dynamic (1 → "batch")
9 output Transpose nodes: perm [2,3,0,1] → perm [0,2,3,1] (batch stays outermost)
3 Reshape initializers: [-1, K] → [0, -1, K] (0 copies the batch dim → [N, anchors, K])

Source + surgery script: ceyxasm/insightface-det-batch-fix

Usage

import cv2, numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ceyxprime/scrfd_640_batched", filename="det_500m_fixed.onnx")

sess = ort.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
inp_name = sess.get_inputs()[0].name

def preprocess(path):
    img = cv2.imread(path)
    return cv2.dnn.blobFromImage(
        cv2.resize(img, (640, 640)), 1.0 / 128.0, (640, 640),
        (127.5, 127.5, 127.5), swapRB=True,
    )[0]  # (3, 640, 640)

# N images → one forward pass
batch = np.stack([preprocess(p) for p in image_paths])   # (N, 3, 640, 640)
outputs = sess.run(None, {inp_name: batch})

# outputs: 9 tensors — 3 scales (stride 8/16/32) × 3 heads (cls, reg, kps)
# outputs[i][n] = frame n's result for head i
# Anchor counts at 640×640: 12800 / 3200 / 800 per scale
for n in range(len(image_paths)):
    cls_s8 = outputs[0][n]   # (12800, 1)  scores, stride 8
    reg_s8 = outputs[3][n]   # (12800, 4)  box deltas, stride 8
    kps_s8 = outputs[6][n]   # (12800, 10) keypoints, stride 8
    # run your anchor decoding + NMS per frame as usual

Or reproduce from the original model:

pip install onnx onnxruntime insightface
python -c "import insightface; insightface.app.FaceAnalysis(name='buffalo_s').prepare(ctx_id=-1)"
# then:
git clone https://github.com/ceyxasm/insightface-det-batch-fix
cd insightface-det-batch-fix
python fix_det_batch.py \
    --model ~/.insightface/models/buffalo_s/det_500m.onnx \
    --out   det_500m_fixed.onnx

Validation

Same image duplicated 5× in one forward pass (validate_5frames.py):

Comparison	Max diff
ORT B=1 vs onnx2torch B=1	≤ 2e-5 (float backend noise)
ORT B=1 vs each of 5 batched frames	identical to above
onnx2torch B=1 vs each of 5 batched frames	0.000000
All cross-frame pairs	0.000000

All 9 output heads pass.

Model details

Property	Value
Architecture	SCRFD-500MF
Input	`[N, 3, 640, 640]`, float32, normalised `(x - 127.5) / 128.0`
Outputs	9 tensors: cls×3, reg×3, kps×3 (strides 8, 16, 32)
Weights	buffalo_s (not identical to public SCRFD500.pth)
Framework	ONNX (opset 11)
Fixed from	`~/.insightface/models/buffalo_s/det_500m.onnx`

Downloads last month: -; Downloads are not tracked for this model. How to track