Buckets:

rtrm's picture
download
raw
33.4 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Neuron Model Inference&quot;,&quot;local&quot;:&quot;neuron-model-inference&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Switching from Transformers to Optimum&quot;,&quot;local&quot;:&quot;switching-from-transformers-to-optimum&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Discriminative NLP models&quot;,&quot;local&quot;:&quot;discriminative-nlp-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Generative NLP models&quot;,&quot;local&quot;:&quot;generative-nlp-models&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Configuring the export of a generative model&quot;,&quot;local&quot;:&quot;configuring-the-export-of-a-generative-model&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;Text generation inference&quot;,&quot;local&quot;:&quot;text-generation-inference&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/optimum.neuron/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/entry/start.52fb68c1.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/scheduler.a2b4ca8e.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/singletons.129c0188.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/paths.e08b37f2.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/entry/app.30230995.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/index.d2f673cc.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/nodes/0.2aaf5a61.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/nodes/12.c96f6459.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/Tip.a902c250.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/CodeBlock.792343a6.js">
<link rel="modulepreload" href="/docs/optimum.neuron/main/en/_app/immutable/chunks/Heading.675d4c1e.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Neuron Model Inference&quot;,&quot;local&quot;:&quot;neuron-model-inference&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Switching from Transformers to Optimum&quot;,&quot;local&quot;:&quot;switching-from-transformers-to-optimum&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Discriminative NLP models&quot;,&quot;local&quot;:&quot;discriminative-nlp-models&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Generative NLP models&quot;,&quot;local&quot;:&quot;generative-nlp-models&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Configuring the export of a generative model&quot;,&quot;local&quot;:&quot;configuring-the-export-of-a-generative-model&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;Text generation inference&quot;,&quot;local&quot;:&quot;text-generation-inference&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="neuron-model-inference" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#neuron-model-inference"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Neuron Model Inference</span></h1> <p data-svelte-h="svelte-1eizqkl"><em>The APIs presented in the following documentation are relevant for the inference on <a href="https://aws.amazon.com/ec2/instance-types/inf2/" rel="nofollow">inf2</a>,
<a href="https://aws.amazon.com/ec2/instance-types/trn1/" rel="nofollow">trn1</a> and <a href="https://aws.amazon.com/ec2/instance-types/inf1/" rel="nofollow">inf1</a>.</em></p> <p data-svelte-h="svelte-1pwj84v"><code>NeuronModelForXXX</code> classes help to load models from the <a href="hf.co/models">Hugging Face Hub</a> and compile them to a serialized format optimized for
neuron devices. You will then be able to load the model and run inference with the acceleration powered by AWS Neuron devices.</p> <h2 class="relative group"><a id="switching-from-transformers-to-optimum" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#switching-from-transformers-to-optimum"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Switching from Transformers to Optimum</span></h2> <p data-svelte-h="svelte-11gy3bo">The <code>optimum.neuron.NeuronModelForXXX</code> model classes are APIs compatible with Hugging Face Transformers models. This means seamless integration
with Hugging Face’s ecosystem. You can just replace your <code>AutoModelForXXX</code> class with the corresponding <code>NeuronModelForXXX</code> class in <code>optimum.neuron</code>.</p> <p data-svelte-h="svelte-1nb8pat">If you already use Transformers, you will be able to reuse your code just by replacing model classes:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->from transformers import AutoTokenizer
<span class="hljs-deletion">-from transformers import AutoModelForSequenceClassification</span>
<span class="hljs-addition">+from optimum.neuron import NeuronModelForSequenceClassification</span>
# PyTorch checkpoint
<span class="hljs-deletion">-model = AutoModelForSequenceClassification.from_pretrained(&quot;distilbert-base-uncased-finetuned-sst-2-english&quot;)</span>
<span class="hljs-addition">+model = NeuronModelForSequenceClassification.from_pretrained(&quot;distilbert-base-uncased-finetuned-sst-2-english&quot;,</span>
<span class="hljs-addition">+ export=True, **neuron_kwargs)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-qud127">As shown above, when you use <code>NeuronModelForXXX</code> for the first time, you will need to set <code>export=True</code> to compile your model from PyTorch to a neuron-compatible format.</p> <p data-svelte-h="svelte-126hjbj">You will also need to pass Neuron specific parameters to configure the export. Each model architecture has its own set of parameters, as detailed in the next paragraphs.</p> <p data-svelte-h="svelte-1l6u6ld">Once your model has been exported, you can save it either on your local or in the <a href="https://hf.co/models" rel="nofollow">Hugging Face Model Hub</a>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># Save the neuron model</span>
<span class="hljs-meta">&gt;&gt;&gt; </span>model.save_pretrained(<span class="hljs-string">&quot;a_local_path_for_compiled_neuron_model&quot;</span>)
<span class="hljs-comment"># Push the neuron model to HF Hub</span>
<span class="hljs-meta">&gt;&gt;&gt; </span>model.push_to_hub(
<span class="hljs-meta">... </span> <span class="hljs-string">&quot;a_local_path_for_compiled_neuron_model&quot;</span>, repository_id=<span class="hljs-string">&quot;my-neuron-repo&quot;</span>, use_auth_token=<span class="hljs-literal">True</span>
<span class="hljs-meta">... </span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-x3o861">And the next time when you want to run inference, just load your compiled model which will save you the compilation time:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> optimum.neuron <span class="hljs-keyword">import</span> NeuronModelForSequenceClassification
<span class="hljs-meta">&gt;&gt;&gt; </span>model = NeuronModelForSequenceClassification.from_pretrained(<span class="hljs-string">&quot;my-neuron-repo&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ugid4e">As you see, there is no need to pass the neuron arguments used during the export as they are
saved in a <code>config.json</code> file, and will be restored automatically by <code>NeuronModelForXXX</code> class.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1p17snc">When running inference for the first time, there is a warmup phase when you run the pipeline for the first time. This run would take 3x-4x higher latency than a regular run.</p></div> <h2 class="relative group"><a id="discriminative-nlp-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#discriminative-nlp-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Discriminative NLP models</span></h2> <p data-svelte-h="svelte-10onjii">As explained in the previous section, you will need only few modifications to your Transformers code to export and run NLP models:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->from transformers import AutoTokenizer
<span class="hljs-deletion">-from transformers import AutoModelForSequenceClassification</span>
<span class="hljs-addition">+from optimum.neuron import NeuronModelForSequenceClassification</span>
# PyTorch checkpoint
<span class="hljs-deletion">-model = AutoModelForSequenceClassification.from_pretrained(&quot;distilbert-base-uncased-finetuned-sst-2-english&quot;)</span>
# Compile your model during the first time
<span class="hljs-addition">+compiler_args = {&quot;auto_cast&quot;: &quot;matmul&quot;, &quot;auto_cast_type&quot;: &quot;bf16&quot;}</span>
<span class="hljs-addition">+input_shapes = {&quot;batch_size&quot;: 1, &quot;sequence_length&quot;: 64}</span>
<span class="hljs-addition">+model = NeuronModelForSequenceClassification.from_pretrained(</span>
<span class="hljs-addition">+ &quot;distilbert-base-uncased-finetuned-sst-2-english&quot;, export=True, **compiler_args, **input_shapes,</span>
<span class="hljs-addition">+)</span>
tokenizer = AutoTokenizer.from_pretrained(&quot;distilbert-base-uncased-finetuned-sst-2-english&quot;)
inputs = tokenizer(&quot;Hamilton is considered to be the best musical of human history.&quot;, return_tensors=&quot;pt&quot;)
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax().item()])
# &#x27;POSITIVE&#x27;<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f1bjg7"><code>compiler_args</code> are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine.</p> <p data-svelte-h="svelte-qvhyy7"><code>input_shapes</code> are mandatory static shape information that you need to send to the neuron compiler. Wondering what shapes are mandatory for your model? Check it out
with the following code:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForSequenceClassification
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> optimum.exporters <span class="hljs-keyword">import</span> TasksManager
<span class="hljs-meta">&gt;&gt;&gt; </span>model = AutoModelForSequenceClassification.from_pretrained(<span class="hljs-string">&quot;distilbert-base-uncased-finetuned-sst-2-english&quot;</span>)
<span class="hljs-comment"># Infer the task name if you don&#x27;t know</span>
<span class="hljs-meta">&gt;&gt;&gt; </span>task = TasksManager.infer_task_from_model(model) <span class="hljs-comment"># &#x27;text-classification&#x27;</span>
<span class="hljs-meta">&gt;&gt;&gt; </span>neuron_config_constructor = TasksManager.get_exporter_config_constructor(
<span class="hljs-meta">... </span> model=model, exporter=<span class="hljs-string">&quot;neuron&quot;</span>, task=<span class="hljs-string">&#x27;text-classification&#x27;</span>
<span class="hljs-meta">... </span>)
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-built_in">print</span>(neuron_config_constructor.func.get_mandatory_axes_for_task(task))
<span class="hljs-comment"># (&#x27;batch_size&#x27;, &#x27;sequence_length&#x27;)</span><!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-urc7mx">Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.</p></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><ul data-svelte-h="svelte-6yr52r"><li>What if input sizes are smaller than compilation input shapes?</li></ul> <p data-svelte-h="svelte-1i2c71c">No worries, <code>NeuronModelForXXX</code> class will pad your inputs to an eligible shape. Besides you can set <code>dynamic_batch_size=True</code> in the <code>from_pretrained</code> method to enable dynamic batching, which means that your inputs can have variable batch size.</p> <p data-svelte-h="svelte-1aeohpg"><em>(Just keep in mind: dynamicity and padding comes with not only flexibility but also performance drop. Fair enough!)</em></p></div> <h2 class="relative group"><a id="generative-nlp-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#generative-nlp-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Generative NLP models</span></h2> <p data-svelte-h="svelte-njr3h4">As explained before, you will need only a few modifications to your Transformers code to export and run NLP models:</p> <h3 class="relative group"><a id="configuring-the-export-of-a-generative-model" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#configuring-the-export-of-a-generative-model"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Configuring the export of a generative model</span></h3> <p data-svelte-h="svelte-18n8q64">As for non-generative models, two sets of parameters can be passed to the <code>from_pretrained()</code> method to configure how a transformers checkpoint is exported to
a neuron optimized model:</p> <ul data-svelte-h="svelte-ezyrr9"><li><p><code>compiler_args = { num_cores, auto_cast_type }</code> are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference latency and throughput and the accuracy.</p></li> <li><p><code>input_shapes = { batch_size, sequence_length }</code> correspond to the static shape of the model input and the KV-cache (attention keys and values for past tokens).</p></li> <li><p><code>num_cores</code> is the number of neuron cores used when instantiating the model. Each neuron core has 16 Gb of memory, which means that
bigger models need to be split on multiple cores. Defaults to 1,</p></li> <li><p><code>auto_cast_type</code> specifies the format to encode the weights. It can be one of <code>fp32</code> (<code>float32</code>), <code>fp16</code> (<code>float16</code>) or <code>bf16</code> (<code>bfloat16</code>). Defaults to <code>fp32</code>.</p></li> <li><p><code>batch_size</code> is the number of input sequences that the model will accept. Defaults to 1,</p></li> <li><p><code>sequence_length</code> is the maximum number of tokens in an input sequence. Defaults to <code>max_position_embeddings</code> (<code>n_positions</code> for older models).</p></li></ul> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->from transformers import AutoTokenizer
<span class="hljs-deletion">-from transformers import AutoModelForCausalLM</span>
<span class="hljs-addition">+from optimum.neuron import NeuronModelForCausalLM</span>
# Instantiate and convert to Neuron a PyTorch checkpoint
<span class="hljs-addition">+compiler_args = {&quot;num_cores&quot;: 1, &quot;auto_cast_type&quot;: &#x27;fp32&#x27;}</span>
<span class="hljs-addition">+input_shapes = {&quot;batch_size&quot;: 1, &quot;sequence_length&quot;: 512}</span>
<span class="hljs-deletion">-model = AutoModelForCausalLM.from_pretrained(&quot;gpt2&quot;)</span>
<span class="hljs-addition">+model = NeuronModelForCausalLM.from_pretrained(&quot;gpt2&quot;, export=True, **compiler_args, **input_shapes)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1d0ot4r">As explained before, these parameters can only be configured during export.
This means in particular that during inference:</p> <ul data-svelte-h="svelte-d4fneu"><li>the <code>batch_size</code> of the inputs should be equal to the <code>batch_size</code> used during export,</li> <li>the <code>length</code> of the input sequences should be lower than the <code>sequence_length</code> used during export,</li> <li>the maximum number of tokens (input + generated) cannot exceed the <code>sequence_length</code> used during export.</li></ul> <h3 class="relative group"><a id="text-generation-inference" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#text-generation-inference"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Text generation inference</span></h3> <p data-svelte-h="svelte-qc8n5z">As with the original transformers models, use <code>generate()</code> instead of <code>forward()</code> to generate text sequences.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->from transformers import AutoTokenizer
<span class="hljs-deletion">-from transformers import AutoModelForCausalLM</span>
<span class="hljs-addition">+from optimum.neuron import NeuronModelForCausalLM</span>
# Instantiate and convert to Neuron a PyTorch checkpoint
<span class="hljs-deletion">-model = AutoModelForCausalLM.from_pretrained(&quot;gpt2&quot;)</span>
<span class="hljs-addition">+model = NeuronModelForCausalLM.from_pretrained(&quot;gpt2&quot;, export=True)</span>
tokenizer = AutoTokenizer.from_pretrained(&quot;gpt2&quot;)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokens = tokenizer(&quot;I really wish &quot;, return_tensors=&quot;pt&quot;)
with torch.inference_mode():
sample_output = model.generate(
**tokens,
do_sample=True,
min_length=128,
max_length=256,
temperature=0.7,
)
outputs = [tokenizer.decode(tok) for tok in sample_output]
print(outputs)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1266wn0">The generation is highly configurable. Please refer to <a href="https://huggingface.co/docs/transformers/generation_strategies" rel="nofollow">https://huggingface.co/docs/transformers/generation_strategies</a> for details.</p> <p data-svelte-h="svelte-1g2jw6w">Please be aware that:</p> <ul data-svelte-h="svelte-14ajc9a"><li>for each model architecture, default values are provided for all parameters, but values passed to the <code>generate</code> method will take precedence,</li> <li>the generation parameters can be stored in a <code>generation_config.json</code> file. When such a file is present in model directory,
it will be parsed to set the default parameters (the values passed to the <code>generate</code> method still take precedence).</li></ul> <p data-svelte-h="svelte-12v26si">Happy inference with Neuron! 🚀</p> <p></p>
<script>
{
__sveltekit_zl08ia = {
assets: "/docs/optimum.neuron/main/en",
base: "/docs/optimum.neuron/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/optimum.neuron/main/en/_app/immutable/entry/start.52fb68c1.js"),
import("/docs/optimum.neuron/main/en/_app/immutable/entry/app.30230995.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 12],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
33.4 kB
·
Xet hash:
ff80c8ccd70ae7db4b8b9ecabc92684b5c257d76b4cc050becd997d545db487c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.