LivePortrait2

Build error

App Files Files Community

LivePortrait2 / stf /stf-api-alternative /pytriton /docs /binding_configuration.md

yerang

Upload 1110 files

e3af00f verified over 1 year ago

preview code

raw

history blame contribute delete

4.73 kB

	<!--
	Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	# Binding Configuration

	The additional configuration of binding the model for running a model through the Triton Inference Server can be
	provided in the `config` argument in the `bind` method. This section describes the possible configuration enhancements.
	The configuration of the model can be adjusted by overriding the defaults for the `ModelConfig` object.

	```python
	from pytriton.model_config.common import DynamicBatcher

	class ModelConfig:
	batching: bool = True
	max_batch_size: int = 4
	batcher: DynamicBatcher = DynamicBatcher()
	response_cache: bool = False
	```

	## Batching

	The batching feature collects one or more samples and passes them to the model together. The model processes
	multiple samples at the same time and returns the output for all the samples processed together.

	Batching can significantly improve throughput. Processing multiple samples at the same time leverages the benefits of
	utilizing GPU performance for inference.

	The Triton Inference Server is responsible for collecting multiple incoming requests into a single batch. The batch is
	passed to the model, which improves the inference performance (throughput and latency). This feature is called
	`dynamic batching`, which collects samples from multiple clients into a single batch processed by the model.

	On the PyTriton side, the `infer_fn` obtain the fully created batch by Triton Inference Server so the only
	responsibility is to perform computation and return the output.

	By default, batching is enabled for the model. The default behavior for Triton is to have dynamic batching enabled.
	If your model does not support batching, use `batching=False` to disable it in Triton.

	## Maximal batch size

	The maximal batch size defines the number of samples that can be processed at the same time by the model. This configuration
	has an impact not only on throughput but also on memory usage, as a bigger batch means more data loaded to the memory
	at the same time.

	The `max_batch_size` has to be a value greater than or equal to 1.

	## Dynamic batching

	The dynamic batching is a Triton Inference Server feature and can be configured by defining the `DynamicBatcher`
	object:

	```python
	from typing import Dict, Optional
	from pytriton.model_config.common import QueuePolicy

	class DynamicBatcher:
	max_queue_delay_microseconds: int = 0
	preferred_batch_size: Optional[list] = None
	preserve_ordering: bool = False
	priority_levels: int = 0
	default_priority_level: int = 0
	default_queue_policy: Optional[QueuePolicy] = None
	priority_queue_policy: Optional[Dict[int, QueuePolicy]] = None
	```

	More about dynamic batching can be found in
	the [Triton Inference Server documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
	and [API spec](api.md)

	## Response cache

	The Triton Inference Server provides functionality to use a cached response for the model. To use the response cache:

	- provide the `cache_config` in `TritonConfig`
	- set `response_cache=True` in `ModelConfig`

	More about response cache can be found in the [Triton Response Cache](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/response_cache.md) page.

	Example:

	<!--pytest.mark.skip-->
	```python
	import numpy as np

	from pytriton.decorators import batch
	from pytriton.model_config import ModelConfig, Tensor
	from pytriton.triton import Triton, TritonConfig

	triton_config = TritonConfig(
	cache_config=[f"local,size={1024 * 1024}"], # 1MB
	)

	@batch
	def _add_sub(**inputs):
	a_batch, b_batch = inputs.values()
	add_batch = a_batch + b_batch
	sub_batch = a_batch - b_batch
	return {"add": add_batch, "sub": sub_batch}

	with Triton(config=triton_config) as triton:
	triton.bind(
	model_name="AddSub",
	infer_func=_add_sub,
	inputs=[Tensor(shape=(1,), dtype=np.float32), Tensor(shape=(1,), dtype=np.float32)],
	outputs=[Tensor(shape=(1,), dtype=np.float32), Tensor(shape=(1,), dtype=np.float32)],
	config=ModelConfig(max_batch_size=8, response_cache=True)
	)
	...
	```