Upload README.md with huggingface_hub

8c46cab verified 3 months ago

4.51 kB

	---
	license: other
	license_name: insightface-non-commercial
	license_link: https://github.com/deepinsight/insightface#license
	tags:
	- face-detection
	- face-recognition
	- scrfd
	- arcface
	- onnx
	- batch-inference
	- tensorrt
	library_name: onnx
	pipeline_tag: image-classification
	---

	# InsightFace Batch-Optimized Models (Max Batch 64)

	Re-exported InsightFace models with proper dynamic batch support and no cross-frame contamination.

	## ⚠️ Version Difference

	\| Repository \| Max Batch \| Best For \|
	\|------------\|-----------\|----------\|
	\| [alonsorobots/scrfd_320_batched](https://huggingface.co/alonsorobots/scrfd_320_batched) \| 1-32 \| Standard use, tested extensively \|
	\| This repo \| 1-64 \| Experimentation with larger batches \|

	Recommendation: Use max batch=32 for optimal performance. Batch=64 provides similar throughput but uses more VRAM.

	## Why These Models?

	The original InsightFace ONNX models have issues with batch inference:

	- `buffalo_l` detection model: hardcoded batch=1
	- `buffalo_l_batch` detection model: broken - has cross-frame contamination due to reshape operations that flatten the batch dimension

	These re-exports fix the `dynamic_axes` in the ONNX graph for true batch inference.

	## Models

	\| Model \| Task \| Input Shape \| Output \| Batch \| Speedup \|
	\|-------\|------\|-------------\|--------\|-------\|---------\|
	\| `scrfd_10g_320_batch64.onnx` \| Face Detection \| `[N, 3, 320, 320]` \| boxes, landmarks \| 1-64 \| 6× \|
	\| `arcface_w600k_r50_batch64.onnx` \| Face Embedding \| `[N, 3, 112, 112]` \| 512-dim vectors \| 1-64 \| 10× \|

	## Performance (TensorRT FP16, RTX 5090)

	### Batch Size Comparison (Full Video, 12,263 frames)

	\| Batch Size \| FPS \| Relative \|
	\|------------\|-----\|----------\|
	\| 16 \| 2,007 \| 1.00× \|
	\| 32 \| 2,097 \| 1.05× ✅ Optimal \|
	\| 64 \| 2,034 \| 1.01× \|

	Key Finding: Batch=32 is optimal. Batch=64 provides no additional benefit due to GPU memory bandwidth saturation.

	### With Pipelined Preprocessing (4 workers)

	\| Configuration \| FPS \| Speedup \|
	\|---------------\|-----\|---------\|
	\| Sequential batch=16 \| 1,211 \| baseline \|
	\| Pipelined batch=32 \| 2,097 \| 1.73× \|

	## Usage

	```python
	import numpy as np
	import onnxruntime as ort

	# Load model
	sess = ort.InferenceSession("scrfd_10g_320_batch64.onnx",
	providers=["TensorrtExecutionProvider", "CUDAExecutionProvider"])

	# Batch inference (any size from 1-64)
	batch = np.random.randn(32, 3, 320, 320).astype(np.float32)
	outputs = sess.run(None, {"input.1": batch})

	# outputs[0-2]: scores per FPN level (stride 8, 16, 32)
	# outputs[3-5]: bboxes per FPN level
	# outputs[6-8]: keypoints per FPN level
	```

	## TensorRT Configuration

	When using TensorRT, set profile shapes to support your desired batch range:

	```python
	providers = [
	("TensorrtExecutionProvider", {
	"trt_fp16_enable": True,
	"trt_engine_cache_enable": True,
	"trt_profile_min_shapes": "input.1:1x3x320x320",
	"trt_profile_opt_shapes": "input.1:32x3x320x320", # Optimize for batch=32
	"trt_profile_max_shapes": "input.1:64x3x320x320", # Support up to 64
	}),
	"CUDAExecutionProvider",
	]
	```

	## Verified: No Batch Contamination

	```python
	# Same frame processed alone vs in batch = identical results
	single_output = sess.run(None, {"input.1": frame[np.newaxis, ...]})
	batch[7] = frame
	batch_output = sess.run(None, {"input.1": batch})

	max_diff = np.max(np.abs(single_output[0] - batch_output[0][7]))
	# max_diff < 1e-5 ✓
	```

	## Re-export Process

	These models were re-exported from InsightFace's PyTorch source using MMDetection with proper `dynamic_axes`:

	```python
	dynamic_axes = {
	"input.1": {0: "batch"},
	"score_8": {0: "batch"},
	"score_16": {0: "batch"},
	# ... all outputs
	}
	```

	## License

	Non-commercial research purposes only - per [InsightFace license](https://github.com/deepinsight/insightface#license).

	For commercial licensing, contact: `recognition-oss-pack@insightface.ai`

	## Credits

	- Original models: [InsightFace](https://github.com/deepinsight/insightface) by Jia Guo et al.
	- SCRFD paper: [Sample and Computation Redistribution for Efficient Face Detection](https://arxiv.org/abs/2105.04714)
	- ArcFace paper: [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)