| # High-level comparison | |
| ## Inference engine | |
| The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.). | |
| !!! summary | |
| * don't use Pytorch in production for inference | |
| * ONNX Runtime is your good enough API for most inference jobs | |
| * if you need best performances, use TensorRT | |
| | | Nvidia TensorRT | :material-microsoft: Microsoft ONNX Runtime | :material-facebook: Meta Pytorch | comments | | |
| |:----------------------------------------------------------|:--------------------------------------------------------|:-----------------------------------------------------------|:----------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------| | |
| | :octicons-rocket-16: transformer-deploy support | :material-check: | :material-check: | :material-cancel: | | | |
| | :material-license: Licence | Apache 2, optimization engine is closed source | MIT | Modified BSD | | | |
| | :material-api: ease of use (API) | :fontawesome-regular-angry: | :material-check-all: | :material-check-all: | Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association | | |
| | :material-file-document-edit: ease of use (documentation) | :material-spider-thread: <br/> (spread out, incomplete) | :material-check: <br/> (improving) | :material-check-all: <br/> (strong community) | | | |
| | :octicons-cpu-16: Hardware support | :material-check: <br/> GPU + Jetson | :material-check-all: <br/> CPU + GPU + IoT + Edge + Mobile | :material-check: <br/> CPU + GPU | | | |
| | :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. | | |
| | :material-target: Accuracy | :material-speedometer-medium: | :material-speedometer: | :material-speedometer: | TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. | | |
| ## Inference HTTP/GRPC server | |
| | | Nvidia Triton | :material-facebook: Meta TorchServe | FastAPI | comments | | |
| |:----------------------------------------------------------|:-----------------------|:------------------------------------|:----------------------------|:---------------------------------------------------------------------------------------| | |
| | :octicons-rocket-16: transformer-deploy support | :material-check: | :material-cancel: | :material-cancel: | | | |
| | :material-license: Licence | Modified BSD | Apache 2 | MIT | | | |
| | :material-api: ease of use (API) | :material-check: | :material-check: | :material-check-all: | As a classic HTTP server, FastAPI may appear easier to use | | |
| | :material-file-document-edit: ease of use (documentation) | :material-check: | :material-check: | :material-check-all: | FastAPI has one of the most beautiful documentation ever! | | |
| | :octicons-stopwatch-16: Performance | :material-speedometer: | :material-speedometer-medium: | :material-speedometer-slow: | FastAPI is 6-10X slower to manage user query than Triton | | |
| | **Support** | | | | | | |
| | :octicons-cpu-16: CPU | :material-check: | :material-check: | :material-check: | | | |
| | :octicons-cpu-16: GPU | :material-check: | :material-check: | :material-check: | | | |
| | dynamic batching | :material-check: | :material-check: | :material-cancel: | combine individual inference requests together to improve inference throughput | | |
| | concurrent model execution | :material-check: | :material-check: | :material-cancel: | run multiple models (or multiple instances of the same model) | | |
| | pipeline | :material-check: | :material-cancel: | :material-cancel: | one or more models and the connection of input and output tensors between those models | | |
| | native multiple backends* support | :material-check: | :material-cancel: | :material-check: | *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch | | |
| | REST API | :material-check: | :material-check: | :material-check: | | | |
| | GRPC API | :material-check: | :material-check: | :material-cancel: | | | |
| | Inference metrics | :material-check: | :material-check: | :material-cancel: | GPU utilization, server throughput, and server latency | | |
| --8<-- "resources/abbreviations.md" |