Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / optimum-neuron /main /en /guides /benchmark.html

rtrm

about 2 months ago

download

raw

63.7 kB

	<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[],"depth":2}">
	<link href="/docs/optimum.neuron/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/entry/start.e7cdb183.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/scheduler.56725da7.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/singletons.635e76a3.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/paths.ed3a4dd8.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/entry/app.c5810efa.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/preload-helper.ec99a452.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/index.18a26576.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/nodes/0.f24306d7.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/each.e59479a4.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/nodes/9.56ea9f07.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/CodeBlock.bd8b9965.js">
	<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/MermaidChart.svelte_svelte_type_style_lang.47599cff.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[],"depth":2}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h2 class="relative group"><a id="introduction" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#introduction"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Introduction</span></h2> <p data-svelte-h="svelte-1qtnios">In todays world, mostly every AI Engineer is familiar with running inference by simply making an API call, but how is that request served optimally by the backend? How does the model provider or service you are using ensure latency and throughput requirements are met?</p> <p data-svelte-h="svelte-1jmxo1j">In this blog we will cover how to serve a model using Optimum Neuron on AWS Inferentia2 with the HuggingFace vLLM container. I’ll also delve into how to optimize for latency and throughput and what decisions we can make to influence our priorities.</p> <h2 class="relative group"><a id="understanding-the-tools" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#understanding-the-tools"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Understanding the Tools</span></h2> <ul data-svelte-h="svelte-ub85c8"><li>Inferentia2 chips: Inferentia2 is the second generation AWS purpose-built Machine Learning inference accelerator.</li> <li>Optimum Neuron: The interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia.</li> <li>vLLM container: vLLM is a toolkit for deploying and serving Large Language Models (LLMs).</li> <li>GuideLLM: A tool for evaluating and optimizing the deployment of large language models (LLMs).</li></ul> <p data-svelte-h="svelte-yvuj">The instance I am using for this experiment will be <code>inf2.48xlarge</code>. I can check instance type as well as see each device by running <code>neuron-ls</code> which gives the following output:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->instance-type: inf2.48xlarge
	+--------+--------+--------+-----------+---------+
	\|<span class="hljs-string"> NEURON </span>\|<span class="hljs-string"> NEURON </span>\|<span class="hljs-string"> NEURON </span>\|<span class="hljs-string"> CONNECTED </span>\|<span class="hljs-string"> PCI </span>\|
	\|<span class="hljs-string"> DEVICE </span>\|<span class="hljs-string"> CORES </span>\|<span class="hljs-string"> MEMORY </span>\|<span class="hljs-string"> DEVICES </span>\|<span class="hljs-string"> BDF </span>\|
	+--------+--------+--------+-----------+---------+
	\|<span class="hljs-string"> 0 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 11, 1 </span>\|<span class="hljs-string"> 80:1e.0 </span>\|
	\|<span class="hljs-string"> 1 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 0, 2 </span>\|<span class="hljs-string"> 90:1e.0 </span>\|
	\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 1, 3 </span>\|<span class="hljs-string"> 80:1d.0 </span>\|
	\|<span class="hljs-string"> 3 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 2, 4 </span>\|<span class="hljs-string"> 90:1f.0 </span>\|
	\|<span class="hljs-string"> 4 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 3, 5 </span>\|<span class="hljs-string"> 80:1f.0 </span>\|
	\|<span class="hljs-string"> 5 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 4, 6 </span>\|<span class="hljs-string"> 90:1d.0 </span>\|
	\|<span class="hljs-string"> 6 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 5, 7 </span>\|<span class="hljs-string"> 20:1e.0 </span>\|
	\|<span class="hljs-string"> 7 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 6, 8 </span>\|<span class="hljs-string"> 20:1f.0 </span>\|
	\|<span class="hljs-string"> 8 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 7, 9 </span>\|<span class="hljs-string"> 10:1e.0 </span>\|
	\|<span class="hljs-string"> 9 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 8, 10 </span>\|<span class="hljs-string"> 10:1f.0 </span>\|
	\|<span class="hljs-string"> 10 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 9, 11 </span>\|<span class="hljs-string"> 10:1d.0 </span>\|
	\|<span class="hljs-string"> 11 </span>\|<span class="hljs-string"> 2 </span>\|<span class="hljs-string"> 32 GB </span>\|<span class="hljs-string"> 10, 0 </span>\|<span class="hljs-string"> 20:1d.0 </span>\|
	+--------+--------+--------+-----------+---------+<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="setup-and-installation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#setup-and-installation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Setup and Installation</span></h2> <p data-svelte-h="svelte-1nwqic8">First, I ran the following commands to install the necessary dependencies, and pull the container needed to compile the model, as well as serve the compiled model for benchmarking.</p> <p data-svelte-h="svelte-10i0lsm"><code>!pip install hftransfer guidellm==0.1.0</code> <code>!git clone https://github.com/huggingface/optimum-neuron.git</code> <code>!docker pull ghcr.io/huggingface/text-generation-inference:latest-neuron</code></p> <p data-svelte-h="svelte-1aem5lq">Depending on the model, optionally configure your HF_TOKEN like so:</p> <p data-svelte-h="svelte-1l2mb9r"><code>!export HF_TOKEN=YOUR_HF_TOKEN</code></p> <h2 class="relative group"><a id="model-compilation-and-deployment" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#model-compilation-and-deployment"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Model Compilation and Deployment</span></h2> <p data-svelte-h="svelte-paeoq">For my use case, I needed to compile my model with specific parameters that were unique. It is important to mention that compilation is not always needed. For example, in the event that the already cached configuration would have worked for me, optimum would use that by default.</p> <p data-svelte-h="svelte-p9wslc">From the docs: “The Neuron Model Cache is a remote cache for compiled Neuron models in the <code>neff</code> format. It is integrated into the <code>NeuronTrainer</code> and <code>NeuronModelForCausalLM</code> classes to enable loading pretrained models from the cache instead of compiling them locally.”</p> <p data-svelte-h="svelte-ppy9hy">Now I compile the model I have selected, <code>meta-llama-3.1-8b-instruct</code> with the following command:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!docker run -p 8080:80 -e HF_TOKEN=YOUR_TOKEN \
	-v $(<span class="hljs-built_in">pwd</span>):/data \
	--device=/dev/neuron0 \
	--device=/dev/neuron1 \
	--device=/dev/neuron2 \
	--device=/dev/neuron3 \
	--device=/dev/neuron4 \
	--device=/dev/neuron5 \
	--device=/dev/neuron6 \
	--device=/dev/neuron7 \
	--device=/dev/neuron8 \
	--device=/dev/neuron9 \
	--device=/dev/neuron10 \
	--device=/dev/neuron11 \
	-ti \
	--entrypoint <span class="hljs-string">"optimum-cli"</span> ghcr.io/huggingface/text-generation-inference:latest-neuron \
	<span class="hljs-built_in">export</span> neuron --model <span class="hljs-string">"meta-llama/Meta-Llama-3.1-8B-Instruct"</span> \
	--sequence_length 16512 \
	--batch_size 8 \
	--num_cores 8 \
	/data/exportedmodel/<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qtk29o">Take note that for my use case, I have decided to use a batch size of 8, with a tensor parallel degree of 8. Since an inf2.48xlarge has 24 cores, I can use a data parallel of 3, which means I will have 3 copies of my model across the instance.`</p> <h2 class="relative group"><a id="optimizing-batch-size-for-maximum-throughput" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#optimizing-batch-size-for-maximum-throughput"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Optimizing Batch Size for Maximum Throughput</span></h2> <p data-svelte-h="svelte-1scbccj">When optimizing hardware utilization for cost-efficiency, particularly for the inf2.48xlarge instance at $12.98 per hour on-demand, the roofline model is a valuable framework.</p> <p data-svelte-h="svelte-1vzdjbj">The roofline model defines theoretical performance bounds. On one extreme, memory-bound workloads are limited by memory capacity, necessitating frequent read/write operations. On the other, compute-bound workloads fully utilize the accelerator’s compute capabilities, maximizing on-device data processing.
	Batch size is a key lever for controlling this balance. Larger batch sizes tend to shift workloads towards being compute-bound, while smaller batch sizes may result in more memory-bound operations.
	With that stated, maximizing batch size is not always viable. Keeping in mind max batch size for the specified latency budget (the time we want to take to return a response) is paramount.
	This is most directly controlled with batch size. For more information on this topic, check out this resource:
	<a href="https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuroncore-batching.html" rel="nofollow">https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuroncore-batching.html</a></p> <h2 class="relative group"><a id="creating-files-for-serving" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#creating-files-for-serving"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Creating Files for Serving</span></h2> <p data-svelte-h="svelte-1u02a8">Several files are needed to ensure our configuration is setup properly, and that the model I compiled is used rather than the cached configuration.</p> <p data-svelte-h="svelte-18oj44j">First I’ll need to create my .env file, which specifies my batch size, precision, etc. It is important to note, that since I compiled my model, I needed to change the model_id from the usual huggingface repo designation, to the container volume location I specified within the compilation command.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-attr">MODEL_ID</span>=<span class="hljs-string">'/data/exportedmodel'</span>
	<span class="hljs-attr">MAX_BATCH_SIZE</span>=<span class="hljs-number">8</span>
	<span class="hljs-attr">MAX_TOTAL_TOKENS</span>=<span class="hljs-number">16512</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ihngmv">Next, I create the benchmark.sh script with my desired settings:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">#!/bin/bash</span>

	model=<span class="hljs-variable">${1:-meta-llama/Meta-Llama-3.1-8B-Instruct}</span>

	date_str=$(<span class="hljs-built_in">date</span> <span class="hljs-string">'+%Y-%m-%d-%H-%M-%S'</span>)
	output_path=<span class="hljs-string">"<span class="hljs-variable">${model//\//_}</span>#<span class="hljs-variable">${date_str}</span>_guidellm_report.json"</span>

	<span class="hljs-built_in">export</span> HF_TOKEN=YOUR_TOKEN

	<span class="hljs-built_in">export</span> GUIDELLM__NUM_SWEEP_PROFILES=1
	<span class="hljs-built_in">export</span> GUIDELLM__MAX_CONCURRENCY=128
	<span class="hljs-built_in">export</span> GUIDELLM__REQUEST_TIMEOUT=60

	guidellm \
	--target <span class="hljs-string">"http://localhost:8080/v1"</span> \
	--model <span class="hljs-variable">${model}</span> \
	--data-type emulated \
	--data <span class="hljs-string">"prompt_tokens=15900,prompt_tokens_variance=100,generated_tokens=450,generated_tokens_variance=50"</span> \
	--output-path <span class="hljs-variable">${output_path}</span> \<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1yjwwi3">Take note of the parameters passed via the <code>--data</code> flag. As my use case is for long prompts and long generation, I have set <code>prompt_tokens</code> and `generated_tokens accordingly. Remember to set these according to your use case and the input / output token load you expect.
	Based on these numbers, GuideLLM will generate prompts of random sizes in a normal distribution of around 15900 tokens, and ask for a random number of generated tokens in a normal distribution of around 450 tokens.</p> <p data-svelte-h="svelte-1eszkgs">The docker compose file is important for defining your data parallel, by specifying the number of devices I wish to allocate to each container. This is also where I specify the load balancer.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-attribute">version</span><span class="hljs-punctuation">:</span> <span class="hljs-string">'3.7'</span>

	<span class="hljs-attribute">services</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">vllm-1</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">image</span><span class="hljs-punctuation">:</span> <span class="hljs-string">ghcr.io/huggingface/optimum-neuron-vllm:latest</span>
	<span class="hljs-attribute">ports</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"8081:8081"</span>
	<span class="hljs-attribute">volumes</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">$PWD:/data</span>
	<span class="hljs-attribute">environment</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">PORT=8081</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_MODEL=${MODEL_ID}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_TENSOR_PARALLEL_SIZE=8</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_BATCH_SIZE=${MAX_BATCH_SIZE}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">HF_TOKEN=YOUR_TOKEN</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_SEQUENCE_LENGTH=${MAX_TOTAL_TOKENS}</span>
	<span class="hljs-attribute">devices</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron0"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron1"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron2"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron3"</span>

	<span class="hljs-attribute">vllm-2</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">image</span><span class="hljs-punctuation">:</span> <span class="hljs-string">ghcr.io/huggingface/optimum-neuron-vllm:latest</span>
	<span class="hljs-attribute">ports</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"8082:8082"</span>
	<span class="hljs-attribute">volumes</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">$PWD:/data</span>
	<span class="hljs-attribute">environment</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">PORT=8082</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_MODEL=${MODEL_ID}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_TENSOR_PARALLEL_SIZE=8</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_BATCH_SIZE=${MAX_BATCH_SIZE}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">HF_TOKEN=YOUR_TOKEN</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_SEQUENCE_LENGTH=${MAX_TOTAL_TOKENS}</span>
	<span class="hljs-attribute">devices</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron4"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron5"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron6"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron7"</span>

	<span class="hljs-attribute">vllm-3</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">image</span><span class="hljs-punctuation">:</span> <span class="hljs-string">ghcr.io/huggingface/optimum-neuron-vllm:latest</span>
	<span class="hljs-attribute">ports</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"8083:8083"</span>
	<span class="hljs-attribute">volumes</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">$PWD:/data</span>
	<span class="hljs-attribute">environment</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">PORT=8083</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_MODEL=${MODEL_ID}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_TENSOR_PARALLEL_SIZE=8</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_BATCH_SIZE=${MAX_BATCH_SIZE}</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">HF_TOKEN=YOUR_TOKEN</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">SM_ON_SEQUENCE_LENGTH=${MAX_TOTAL_TOKENS}</span>
	<span class="hljs-attribute">devices</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron8"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron9"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron10"</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"/dev/neuron11"</span>

	<span class="hljs-attribute">loadbalancer</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">image</span><span class="hljs-punctuation">:</span> <span class="hljs-string">nginx:alpine</span>
	<span class="hljs-attribute">ports</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">"8080:80"</span>
	<span class="hljs-attribute">volumes</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">./nginx.conf:/etc/nginx/nginx.conf:ro</span>
	<span class="hljs-attribute">depends_on</span><span class="hljs-punctuation">:</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">vllm-1</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">vllm-2</span>
	<span class="hljs-bullet">-</span> <span class="hljs-string">vllm-3</span>
	<span class="hljs-attribute">deploy</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">placement</span><span class="hljs-punctuation">:</span>
	<span class="hljs-attribute">constraints</span><span class="hljs-punctuation">:</span> <span class="hljs-string">[node.role == manager]</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1t3vgji">Lastly, I define the nginx.conf for the load balancer:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment">### Nginx vllm Load Balancer</span>
	<span class="hljs-section">events</span> {}
	<span class="hljs-section">http</span> {
	<span class="hljs-section">upstream</span> vllmcluster {
	<span class="hljs-attribute">server</span> vllm-<span class="hljs-number">1</span>:<span class="hljs-number">8081</span>;
	<span class="hljs-attribute">server</span> vllm-<span class="hljs-number">2</span>:<span class="hljs-number">8082</span>;
	<span class="hljs-attribute">server</span> vllm-<span class="hljs-number">3</span>:<span class="hljs-number">8083</span>;
	}
	<span class="hljs-section">server</span> {
	<span class="hljs-attribute">listen</span> <span class="hljs-number">80</span>;
	<span class="hljs-section">location</span> / {
	<span class="hljs-attribute">proxy_pass</span> http://vllmcluster;
	}
	}
	}<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="benchmarking-with-guidellm" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#benchmarking-with-guidellm"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Benchmarking with GuideLLM</span></h2> <p data-svelte-h="svelte-1bljfye">Now that I have defined the necessary files, I start serving my optimum-neuron model with vllm backend.</p> <p data-svelte-h="svelte-15pqef9"><code>!docker compose -f docker-compose.yaml --env-file .env up</code></p> <p data-svelte-h="svelte-1fewlsy">As a sanity check, I can watch the output of the above command to ensure that each container starts properly as well as the load balancer.
	Once I have started the containers successfully, I can begin benchmarking using the previously defined benchmarking script.</p> <p data-svelte-h="svelte-tuypim"><code>!benchmark.sh "meta-llama/Meta-Llama-3.1-8B-Instruct"</code></p> <p data-svelte-h="svelte-zl3b8i">A colorful stdout will begin to populate the terminal as guidellm begins to test your model serving setup.</p> <h2 class="relative group"><a id="performance-analysis" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#performance-analysis"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Performance Analysis</span></h2> <p data-svelte-h="svelte-xy8qv7">In approximately 15-20 minutes, benchmarking is completed and displays the following detailed breakdown in the terminal:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->╭─ <span class="hljs-selector-tag">Benchmarks</span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
	│ <span class="hljs-selector-attr">[15:02:17]</span> <span class="hljs-number">100%</span> <span class="hljs-selector-tag">synchronous</span> (<span class="hljs-number">0.10</span> req/sec avg)│
	│ <span class="hljs-selector-attr">[15:04:17]</span> <span class="hljs-number">100%</span> <span class="hljs-selector-tag">throughput</span> (<span class="hljs-number">0.85</span> req/sec avg)│
	│ <span class="hljs-selector-attr">[15:05:25]</span> <span class="hljs-number">100%</span> <span class="hljs-selector-tag">constant</span>@<span class="hljs-number">0.85</span> <span class="hljs-selector-tag">req</span>/<span class="hljs-selector-tag">s</span> (<span class="hljs-number">0.77</span> req/sec avg) │
	╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
	<span class="hljs-selector-tag">Generating</span> <span class="hljs-selector-tag">report</span>... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (<span class="hljs-number">3</span>/<span class="hljs-number">3</span>) <span class="hljs-selector-attr">[ 0:05:04 < 0:00:00 ]</span>
	╭─ <span class="hljs-selector-tag">GuideLLM</span> <span class="hljs-selector-tag">Benchmarks</span> <span class="hljs-selector-tag">Report</span> (meta-llama_Meta-Llama-<span class="hljs-number">3.1</span>-<span class="hljs-number">8</span>B-Instruct<span class="hljs-number">#2025</span>-<span class="hljs-number">05</span>-<span class="hljs-number">27</span>-<span class="hljs-number">15</span>-<span class="hljs-number">02</span>-<span class="hljs-number">11</span>_guidellm_report.json) ──────────────────────────────────────────────────────────────────────────────────╮
	│ ╭─ <span class="hljs-selector-tag">Benchmark</span> <span class="hljs-selector-tag">Report</span> <span class="hljs-number">1</span> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
	│ │ <span class="hljs-selector-tag">Backend</span>(type=openai_server, target=<span class="hljs-attribute">http</span>:<span class="hljs-comment">//localhost:8080/v1, model=meta-llama/Meta-Llama-3.1-8B-Instruct) │ │</span>
	│ │ <span class="hljs-built_in">Data</span>(type=emulated, source=prompt_tokens=<span class="hljs-number">15900</span>,prompt_tokens_variance=<span class="hljs-number">100</span>,generated_tokens=<span class="hljs-number">450</span>,generated_tokens_variance=<span class="hljs-number">50</span>, tokenizer=meta-llama/Meta-Llama-<span class="hljs-number">3.1</span>-<span class="hljs-number">8</span>B-Instruct) │ │
	│ │ <span class="hljs-built_in">Rate</span>(type=sweep, rate=None) │ │
	│ │ <span class="hljs-built_in">Limits</span>(max_number=None requests, max_duration=<span class="hljs-number">120</span> sec) │ │
	│ │ │ │
	│ │ │ │
	│ │ Requests Data by Benchmark │ │
	│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │ │
	│ │ ┃ Benchmark ┃ Requests Completed ┃ Request Failed ┃ Duration ┃ Start Time ┃ End Time ┃ │ │
	│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │ │
	│ │ │ synchronous │ <span class="hljs-number">11</span>/<span class="hljs-number">11</span> │ <span class="hljs-number">0</span>/<span class="hljs-number">11</span> │ <span class="hljs-number">113.56</span> sec │ <span class="hljs-number">15</span>:<span class="hljs-number">02</span>:<span class="hljs-number">17</span> │ <span class="hljs-number">15</span>:<span class="hljs-number">04</span>:<span class="hljs-number">11</span> │ │ │
	│ │ │ asynchronous<span class="hljs-variable">@0</span>.<span class="hljs-number">85</span> req/sec │ <span class="hljs-number">88</span>/<span class="hljs-number">88</span> │ <span class="hljs-number">0</span>/<span class="hljs-number">88</span> │ <span class="hljs-number">114.59</span> sec │ <span class="hljs-number">15</span>:<span class="hljs-number">05</span>:<span class="hljs-number">25</span> │ <span class="hljs-number">15</span>:<span class="hljs-number">07</span>:<span class="hljs-number">19</span> │ │ │
	│ │ │ throughput │ <span class="hljs-number">55</span>/<span class="hljs-number">55</span> │ <span class="hljs-number">0</span>/<span class="hljs-number">55</span> │ <span class="hljs-number">64.83</span> sec │ <span class="hljs-number">15</span>:<span class="hljs-number">04</span>:<span class="hljs-number">17</span> │ <span class="hljs-number">15</span>:<span class="hljs-number">05</span>:<span class="hljs-number">22</span> │ │ │
	│ │ └───────────────────────────┴────────────────────┴────────────────┴────────────┴────────────┴──────────┘ │ │
	│ │ │ │
	│ │ Tokens Data by Benchmark │ │
	│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
	│ │ ┃ Benchmark ┃ Prompt ┃ Prompt (<span class="hljs-number">1%</span>, <span class="hljs-number">5%</span>, <span class="hljs-number">50%</span>, <span class="hljs-number">95%</span>, <span class="hljs-number">99%</span>) ┃ Output ┃ Output (<span class="hljs-number">1%</span>, <span class="hljs-number">5%</span>, <span class="hljs-number">50%</span>, <span class="hljs-number">95%</span>, <span class="hljs-number">99%</span>) ┃ │ │
	│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
	│ │ │ synchronous │ <span class="hljs-number">15902.82</span> │ <span class="hljs-number">15896.0</span>, <span class="hljs-number">15896.0</span>, <span class="hljs-number">15902.0</span>, <span class="hljs-number">15913.0</span>, <span class="hljs-number">15914.6</span> │ <span class="hljs-number">293.09</span> │ <span class="hljs-number">70.3</span>, <span class="hljs-number">119.5</span>, <span class="hljs-number">315.0</span>, <span class="hljs-number">423.5</span>, <span class="hljs-number">443.1</span> │ │ │
	│ │ │ asynchronous<span class="hljs-variable">@0</span>.<span class="hljs-number">85</span> req/sec │ <span class="hljs-number">15899.06</span> │ <span class="hljs-number">15877.4</span>, <span class="hljs-number">15879.4</span>, <span class="hljs-number">15898.5</span>, <span class="hljs-number">15918.0</span>, <span class="hljs-number">15919.8</span> │ <span class="hljs-number">288.75</span> │ <span class="hljs-number">24.6</span>, <span class="hljs-number">74.1</span>, <span class="hljs-number">298.5</span>, <span class="hljs-number">452.6</span>, <span class="hljs-number">459.1</span> │ │ │
	│ │ │ throughput │ <span class="hljs-number">15899.22</span> │ <span class="hljs-number">15879.5</span>, <span class="hljs-number">15883.7</span>, <span class="hljs-number">15898.0</span>, <span class="hljs-number">15914.6</span>, <span class="hljs-number">15920.5</span> │ <span class="hljs-number">294.24</span> │ <span class="hljs-number">59.1</span>, <span class="hljs-number">114.9</span>, <span class="hljs-number">285.0</span>, <span class="hljs-number">452.9</span>, <span class="hljs-number">456.4</span> │ │ │
	│ │ └───────────────────────────┴──────────┴─────────────────────────────────────────────┴────────┴──────────────────────────────────┘ │ │
	│ │ │ │
	│ │ Performance Stats by Benchmark │ │
	│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
	│ │ ┃ ┃ Request Latency [<span class="hljs-number">1%</span>, <span class="hljs-number">5%</span>, <span class="hljs-number">10%</span>, <span class="hljs-number">50%</span>, <span class="hljs-number">90%</span>, <span class="hljs-number">95%</span>, <span class="hljs-number">99%</span>] ┃ Time to First Token [<span class="hljs-number">1%</span>, <span class="hljs-number">5%</span>, <span class="hljs-number">10%</span>, <span class="hljs-number">50%</span>, <span class="hljs-number">90%</span>, <span class="hljs-number">95%</span>, ┃ Inter Token Latency [<span class="hljs-number">1%</span>, <span class="hljs-number">5%</span>, <span class="hljs-number">10%</span>, <span class="hljs-number">50%</span>, <span class="hljs-number">90%</span> <span class="hljs-number">95%</span>, ┃ │ │
	│ │ ┃ Benchmark ┃ (sec) ┃ <span class="hljs-number">99%</span>] (ms) ┃ <span class="hljs-number">99%</span>] (ms) ┃ │ │
	│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
	│ │ │ synchronous │ <span class="hljs-number">3.68</span>, <span class="hljs-number">5.13</span>, <span class="hljs-number">6.94</span>, <span class="hljs-number">10.91</span>, <span class="hljs-number">13.51</span>, <span class="hljs-number">14.26</span>, <span class="hljs-number">14.87</span> │ <span class="hljs-number">1563.3</span>, <span class="hljs-number">1569.2</span>, <span class="hljs-number">1576.5</span>, <span class="hljs-number">1589.4</span>, <span class="hljs-number">1594.0</span>, <span class="hljs-number">1595.3</span>, │ <span class="hljs-number">23.2</span>, <span class="hljs-number">28.2</span>, <span class="hljs-number">29.4</span>, <span class="hljs-number">29.8</span>, <span class="hljs-number">30.3</span>, <span class="hljs-number">31.7</span>, <span class="hljs-number">36.5</span> │ │ │
	│ │ │ │ │ <span class="hljs-number">1596.4</span> │ │ │ │
	│ │ │ asynchronous<span class="hljs-variable">@0</span>.<span class="hljs-number">85</span> req/sec │ <span class="hljs-number">2.62</span>, <span class="hljs-number">6.55</span>, <span class="hljs-number">9.40</span>, <span class="hljs-number">20.66</span>, <span class="hljs-number">30.60</span>, <span class="hljs-number">32.78</span>, <span class="hljs-number">35.07</span> │ <span class="hljs-number">1594.1</span>, <span class="hljs-number">1602.5</span>, <span class="hljs-number">1605.7</span>, <span class="hljs-number">1629.7</span>, <span class="hljs-number">4650.1</span>, <span class="hljs-number">4924.1</span>, │ <span class="hljs-number">0.2</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">34.3</span>, <span class="hljs-number">44.9</span>, <span class="hljs-number">54.5</span>, <span class="hljs-number">1613.9</span> │ │ │
	│ │ │ │ │ <span class="hljs-number">5345.6</span> │ │ │ │
	│ │ │ throughput │ <span class="hljs-number">18.29</span>, <span class="hljs-number">21.24</span>, <span class="hljs-number">23.81</span>, <span class="hljs-number">44.60</span>, <span class="hljs-number">61.50</span>, <span class="hljs-number">62.80</span>, <span class="hljs-number">63.72</span> │ <span class="hljs-number">2157.6</span>, <span class="hljs-number">9185.1</span>, <span class="hljs-number">12220.5</span>, <span class="hljs-number">23333.5</span>, <span class="hljs-number">44214.1</span>, │ <span class="hljs-number">28.2</span>, <span class="hljs-number">31.5</span>, <span class="hljs-number">33.1</span>, <span class="hljs-number">39.1</span>, <span class="hljs-number">59.0</span>, <span class="hljs-number">65.2</span>, <span class="hljs-number">1604.6</span> │ │ │
	│ │ │ │ │ <span class="hljs-number">45329.8</span>, <span class="hljs-number">51276.9</span> │ │ │ │
	│ │ └───────────────────────────┴───────────────────────────────────────────────────┴───────────────────────────────────────────────────┴────────────────────────────────────────────────────┘ │ │
	│ │ │ │
	│ │ Performance Summary by Benchmark │ │
	│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
	│ │ ┃ Benchmark ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃ │ │
	│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
	│ │ │ synchronous │ <span class="hljs-number">0.10</span> req/sec │ <span class="hljs-number">10.32</span> sec │ <span class="hljs-number">1585.08</span> ms │ <span class="hljs-number">29.81</span> ms │ <span class="hljs-number">28.39</span> tokens/sec │ │ │
	│ │ │ asynchronous<span class="hljs-variable">@0</span>.<span class="hljs-number">85</span> req/sec │ <span class="hljs-number">0.77</span> req/sec │ <span class="hljs-number">20.77</span> sec │ <span class="hljs-number">2401.32</span> ms │ <span class="hljs-number">63.69</span> ms │ <span class="hljs-number">221.75</span> tokens/sec │ │ │
	│ │ │ throughput │ <span class="hljs-number">0.85</span> req/sec │ <span class="hljs-number">43.78</span> sec │ <span class="hljs-number">24624.46</span> ms │ <span class="hljs-number">65.18</span> ms │ <span class="hljs-number">249.64</span> tokens/sec │ │ │
	│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘ │ │
	│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
	╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-bf06av">Unpacking the results, we get quite a few useful data points for us to use. Under the hood, guidellm runs three separate “loads” with which to benchmark the system against.</p> <ol data-svelte-h="svelte-j7gq2a"><li>Synchronous - Serving one request at a time</li> <li>Asynchronous - Serving multiple requests at once at a locked in req/sec (0.85 in this case)</li> <li>Throughput - Serving the maximum number of requests that the system can sustain</li></ol> <p data-svelte-h="svelte-xkujic">From these tests we are given several metrics for each like how many requests were successfully performed vs how many failed. The time to first token, prompt input and output sizes and more.
	For my experiment, I can see that under max load, I can serve up to 0.85 requests per second at a maximum latency of just under 44 seconds per request. Depending on my latency budget, the next step would be to increase my batch size if I can tolerate longer response times and desire more throughput. Alternatively, I could lower my batch size to decrease the latency, at the cost of potentially reducing throughput.</p> <p data-svelte-h="svelte-11j4vv6">Lastly, the large input and output tokens required for my workload directly effect the benchmark results, specifically the time needed to encode my input context contributing to most of the benchmark time.</p> <h2 class="relative group"><a id="conclusion" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#conclusion"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Conclusion</span></h2> <p data-svelte-h="svelte-82p2i1">In this blog post, I took you through how to compile and load an Optimum Neuron model, how to serve it with the HuggingFace Text Generation Inference container, and how to benchmark your settings to optimize for your workload.</p> <h2 class="relative group"><a id="references" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#references"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>References</span></h2> <p data-svelte-h="svelte-18fl1af"><a href="https://huggingface.co/docs/optimum-neuron/en/guides/cache_system" rel="nofollow">https://huggingface.co/docs/optimum-neuron/en/guides/cache_system</a> <a href="https://github.com/huggingface/optimum-neuron/tree/main/benchmark/vllm/performance" rel="nofollow">https://github.com/huggingface/optimum-neuron/tree/main/benchmark/vllm/performance</a> <a href="https://github.com/vllm-project/guidellm" rel="nofollow">https://github.com/vllm-project/guidellm</a></p> <p></p>

	<script>
	{
	__sveltekit_89oqon = {
	assets: "/docs/optimum.neuron/main/en",
	base: "/docs/optimum.neuron/main/en",
	env: {}
	};

	const element = document.currentScript.parentElement;

	const data = [null,null];

	Promise.all([
	import("/docs/optimum.neuron/main/en/_app/immutable/entry/start.e7cdb183.js"),
	import("/docs/optimum.neuron/main/en/_app/immutable/entry/app.c5810efa.js")
	]).then(([kit, app]) => {
	kit.start(app, element, {
	node_ids: [0, 9],
	data,
	form: null,
	error: null
	});
	});
	}
	</script>

Xet Storage Details

Size:: 63.7 kB
Xet hash:: 945e2af135672acb4c0fdb8a94f59080704b6a36a06990847f77544a939c3b09

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.