SCRFD det_500m โ buffalo_s weights, dynamic batch (640ร640)
Drop-in replacement for the InsightFace buffalo_s face detector (det_500m.onnx) with
batch=1 fixed to dynamic batch. Pass N frames in a single forward pass; each frame's
output is bit-identical to running the model N times individually.
Background
PR deepinsight/insightface#1781
fixed the SCRFD export script to produce dynamic batch axes, but the distributed
buffalo_s model pack was never re-exported. Calling the original session with N > 1 frames
crashes:
[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input.1
index: 0 Got: 2 Expected: 1
This model is the fixed version. Tracked in deepinsight/insightface#2878.
Note on existing workarounds: alonsorobots/scrfd_320_batched is a re-export at 320ร320. This model uses the original buffalo_s weights at 640ร640, which preserves small-face detection quality.
How the fix works
Every output path in the original SCRFD graph ends with:
Conv [N,C,H,W] โ Transpose perm=[2,3,0,1] โ Reshape [-1, K]
The Transpose moves the batch dim between spatial and channel dims; the Reshape then flattens everything together โ interleaving frames. Three targeted ONNX graph edits fix it:
- Input batch dim โ dynamic (
1โ"batch") - 9 output Transpose nodes:
perm [2,3,0,1]โperm [0,2,3,1](batch stays outermost) - 3 Reshape initializers:
[-1, K]โ[0, -1, K](0copies the batch dim โ[N, anchors, K])
Source + surgery script: ceyxasm/insightface-det-batch-fix
Usage
import cv2, numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="ceyxprime/scrfd_640_batched", filename="det_500m_fixed.onnx")
sess = ort.InferenceSession(model_path, providers=["CUDAExecutionProvider"])
inp_name = sess.get_inputs()[0].name
def preprocess(path):
img = cv2.imread(path)
return cv2.dnn.blobFromImage(
cv2.resize(img, (640, 640)), 1.0 / 128.0, (640, 640),
(127.5, 127.5, 127.5), swapRB=True,
)[0] # (3, 640, 640)
# N images โ one forward pass
batch = np.stack([preprocess(p) for p in image_paths]) # (N, 3, 640, 640)
outputs = sess.run(None, {inp_name: batch})
# outputs: 9 tensors โ 3 scales (stride 8/16/32) ร 3 heads (cls, reg, kps)
# outputs[i][n] = frame n's result for head i
# Anchor counts at 640ร640: 12800 / 3200 / 800 per scale
for n in range(len(image_paths)):
cls_s8 = outputs[0][n] # (12800, 1) scores, stride 8
reg_s8 = outputs[3][n] # (12800, 4) box deltas, stride 8
kps_s8 = outputs[6][n] # (12800, 10) keypoints, stride 8
# run your anchor decoding + NMS per frame as usual
Or reproduce from the original model:
pip install onnx onnxruntime insightface
python -c "import insightface; insightface.app.FaceAnalysis(name='buffalo_s').prepare(ctx_id=-1)"
# then:
git clone https://github.com/ceyxasm/insightface-det-batch-fix
cd insightface-det-batch-fix
python fix_det_batch.py \
--model ~/.insightface/models/buffalo_s/det_500m.onnx \
--out det_500m_fixed.onnx
Validation
Same image duplicated 5ร in one forward pass (validate_5frames.py):
| Comparison | Max diff |
|---|---|
| ORT B=1 vs onnx2torch B=1 | โค 2e-5 (float backend noise) |
| ORT B=1 vs each of 5 batched frames | identical to above |
| onnx2torch B=1 vs each of 5 batched frames | 0.000000 |
| All cross-frame pairs | 0.000000 |
All 9 output heads pass.
Model details
| Property | Value |
|---|---|
| Architecture | SCRFD-500MF |
| Input | [N, 3, 640, 640], float32, normalised (x - 127.5) / 128.0 |
| Outputs | 9 tensors: clsร3, regร3, kpsร3 (strides 8, 16, 32) |
| Weights | buffalo_s (not identical to public SCRFD500.pth) |
| Framework | ONNX (opset 11) |
| Fixed from | ~/.insightface/models/buffalo_s/det_500m.onnx |