File size: 4,508 Bytes
8c46cab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---

license: other
license_name: insightface-non-commercial
license_link: https://github.com/deepinsight/insightface#license
tags:
  - face-detection
  - face-recognition
  - scrfd
  - arcface
  - onnx
  - batch-inference
  - tensorrt
library_name: onnx
pipeline_tag: image-classification
---


# InsightFace Batch-Optimized Models (Max Batch 64)

Re-exported InsightFace models with **proper dynamic batch support** and **no cross-frame contamination**.

## ⚠️ Version Difference

| Repository | Max Batch | Best For |
|------------|-----------|----------|
| [alonsorobots/scrfd_320_batched](https://huggingface.co/alonsorobots/scrfd_320_batched) | 1-32 | Standard use, tested extensively |
| **This repo** | **1-64** | Experimentation with larger batches |

**Recommendation:** Use max batch=32 for optimal performance. Batch=64 provides similar throughput but uses more VRAM.

## Why These Models?

The original InsightFace ONNX models have issues with batch inference:

- `buffalo_l` detection model: hardcoded batch=1
- `buffalo_l_batch` detection model: **broken** - has cross-frame contamination due to reshape operations that flatten the batch dimension

These re-exports fix the `dynamic_axes` in the ONNX graph for **true batch inference**.

## Models

| Model | Task | Input Shape | Output | Batch | Speedup |
|-------|------|-------------|--------|-------|---------|
| `scrfd_10g_320_batch64.onnx` | Face Detection | `[N, 3, 320, 320]` | boxes, landmarks | 1-64 | **6×** |
| `arcface_w600k_r50_batch64.onnx` | Face Embedding | `[N, 3, 112, 112]` | 512-dim vectors | 1-64 | **10×** |

## Performance (TensorRT FP16, RTX 5090)

### Batch Size Comparison (Full Video, 12,263 frames)

| Batch Size | FPS | Relative |
|------------|-----|----------|
| 16 | 2,007 | 1.00× |
| **32** | **2,097** | **1.05×** ✅ Optimal |
| 64 | 2,034 | 1.01× |

**Key Finding:** Batch=32 is optimal. Batch=64 provides no additional benefit due to GPU memory bandwidth saturation.

### With Pipelined Preprocessing (4 workers)

| Configuration | FPS | Speedup |
|---------------|-----|---------|
| Sequential batch=16 | 1,211 | baseline |
| **Pipelined batch=32** | **2,097** | **1.73×** |

## Usage

```python

import numpy as np

import onnxruntime as ort



# Load model

sess = ort.InferenceSession("scrfd_10g_320_batch64.onnx", 

                            providers=["TensorrtExecutionProvider", "CUDAExecutionProvider"])



# Batch inference (any size from 1-64)

batch = np.random.randn(32, 3, 320, 320).astype(np.float32)

outputs = sess.run(None, {"input.1": batch})



# outputs[0-2]: scores per FPN level (stride 8, 16, 32)

# outputs[3-5]: bboxes per FPN level

# outputs[6-8]: keypoints per FPN level

```

## TensorRT Configuration

When using TensorRT, set profile shapes to support your desired batch range:

```python

providers = [

    ("TensorrtExecutionProvider", {

        "trt_fp16_enable": True,

        "trt_engine_cache_enable": True,

        "trt_profile_min_shapes": "input.1:1x3x320x320",

        "trt_profile_opt_shapes": "input.1:32x3x320x320",  # Optimize for batch=32

        "trt_profile_max_shapes": "input.1:64x3x320x320",  # Support up to 64

    }),

    "CUDAExecutionProvider",

]

```

## Verified: No Batch Contamination

```python

# Same frame processed alone vs in batch = identical results

single_output = sess.run(None, {"input.1": frame[np.newaxis, ...]})

batch[7] = frame

batch_output = sess.run(None, {"input.1": batch})



max_diff = np.max(np.abs(single_output[0] - batch_output[0][7]))

# max_diff < 1e-5 ✓

```

## Re-export Process

These models were re-exported from InsightFace's PyTorch source using MMDetection with proper `dynamic_axes`:

```python

dynamic_axes = {

    "input.1": {0: "batch"},

    "score_8": {0: "batch"},

    "score_16": {0: "batch"},

    # ... all outputs

}

```

## License

**Non-commercial research purposes only** - per [InsightFace license](https://github.com/deepinsight/insightface#license).

For commercial licensing, contact: `recognition-oss-pack@insightface.ai`

## Credits

- Original models: [InsightFace](https://github.com/deepinsight/insightface) by Jia Guo et al.
- SCRFD paper: [Sample and Computation Redistribution for Efficient Face Detection](https://arxiv.org/abs/2105.04714)
- ArcFace paper: [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)