Buckets:

rtrm's picture
download
raw
32 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;AWQ&quot;,&quot;local&quot;:&quot;awq&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;퓨즈된 모듈&quot;,&quot;local&quot;:&quot;fused-modules&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;ExLlama-v2 서포트&quot;,&quot;local&quot;:&quot;exllama-v2-support&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/transformers/main/ko/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/entry/start.9aa88961.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/scheduler.9bc65507.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/singletons.9eec45c3.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/index.3b203c72.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/paths.566078f7.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/entry/app.84fb67c3.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/index.707bf1b6.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/nodes/0.1c99376b.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/nodes/51.4a999ea4.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/Tip.c2ecdbf4.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/CodeBlock.54a9f38d.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/EditOnGithub.922df6ba.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/HfOption.6d864328.js">
<link rel="modulepreload" href="/docs/transformers/main/ko/_app/immutable/chunks/stores.c16bc1a5.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;AWQ&quot;,&quot;local&quot;:&quot;awq&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;퓨즈된 모듈&quot;,&quot;local&quot;:&quot;fused-modules&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;ExLlama-v2 서포트&quot;,&quot;local&quot;:&quot;exllama-v2-support&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="awq" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#awq"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>AWQ</span></h1> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-cfteqc"><a href="https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY" rel="nofollow">노트북</a> 으로 AWQ 양자화를 실습해보세요 !</p></div> <p data-svelte-h="svelte-50rz6s"><a href="https://hf.co/papers/2306.00978" rel="nofollow">Activation-aware Weight Quantization (AWQ)</a>은 모델의 모든 가중치를 양자화하지 않고, LLM 성능에 중요한 가중치를 유지합니다. 이로써 4비트 정밀도로 모델을 실행해도 성능 저하 없이 양자화 손실을 크게 줄일 수 있습니다.</p> <p data-svelte-h="svelte-ibfq68">AWQ 알고리즘을 사용하여 모델을 양자화할 수 있는 여러 라이브러리가 있습니다. 예를 들어 <a href="https://github.com/mit-han-lab/llm-awq" rel="nofollow">llm-awq</a>, <a href="https://github.com/casper-hansen/AutoAWQ" rel="nofollow">autoawq</a> , <a href="https://huggingface.co/docs/optimum/main/en/intel/optimization_inc" rel="nofollow">optimum-intel</a> 등이 있습니다. Transformers는 llm-awq, autoawq 라이브러리를 이용해 양자화된 모델을 가져올 수 있도록 지원합니다. 이 가이드에서는 autoawq로 양자화된 모델을 가져오는 방법을 보여드리나, llm-awq로 양자화된 모델의 경우도 유사한 절차를 따릅니다.</p> <p data-svelte-h="svelte-13tyifb">autoawq가 설치되어 있는지 확인하세요:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install autoawq<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-127ay4z">AWQ 양자화된 모델은 해당 모델의 <a href="https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json" rel="nofollow">config.json</a> 파일의 <code>quantization_config</code> 속성을 통해 식별할 수 있습니다.:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-punctuation">{</span>
<span class="hljs-attr">&quot;_name_or_path&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-string">&quot;/workspace/process/huggingfaceh4_zephyr-7b-alpha/source&quot;</span><span class="hljs-punctuation">,</span>
<span class="hljs-attr">&quot;architectures&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">[</span>
<span class="hljs-string">&quot;MistralForCausalLM&quot;</span>
<span class="hljs-punctuation">]</span><span class="hljs-punctuation">,</span>
...
...
...
<span class="hljs-attr">&quot;quantization_config&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-punctuation">{</span>
<span class="hljs-attr">&quot;quant_method&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-string">&quot;awq&quot;</span><span class="hljs-punctuation">,</span>
<span class="hljs-attr">&quot;zero_point&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-literal"><span class="hljs-keyword">true</span></span><span class="hljs-punctuation">,</span>
<span class="hljs-attr">&quot;group_size&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-number">128</span><span class="hljs-punctuation">,</span>
<span class="hljs-attr">&quot;bits&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-number">4</span><span class="hljs-punctuation">,</span>
<span class="hljs-attr">&quot;version&quot;</span><span class="hljs-punctuation">:</span> <span class="hljs-string">&quot;gemm&quot;</span>
<span class="hljs-punctuation">}</span>
<span class="hljs-punctuation">}</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dq1mc">양자화된 모델은 <code>from_pretrained()</code> 메서드를 사용하여 가져옵니다. 모델을 CPU에 가져왔다면, 먼저 모델을 GPU 장치로 옮겨야 합니다. <code>device_map</code> 파라미터를 사용하여 모델을 배치할 위치를 지정하세요:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM, AutoTokenizer
model_id = <span class="hljs-string">&quot;TheBloke/zephyr-7B-alpha-AWQ&quot;</span>
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=<span class="hljs-string">&quot;cuda:0&quot;</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-pmm406">AWQ 양자화 모델을 가져오면 자동으로 성능상의 이유로 인해 가중치들의 기본값이 fp16으로 설정됩니다. 가중치를 다른 형식으로 가져오려면, <code>torch_dtype</code> 파라미터를 사용하세요:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM, AutoTokenizer
model_id = <span class="hljs-string">&quot;TheBloke/zephyr-7B-alpha-AWQ&quot;</span>
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-111qm6v">추론을 더욱 가속화하기 위해 AWQ 양자화와 <a href="../perf_infer_gpu_one#flashattention-2">FlashAttention-2</a> 를 결합 할 수 있습니다:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(<span class="hljs-string">&quot;TheBloke/zephyr-7B-alpha-AWQ&quot;</span>, attn_implementation=<span class="hljs-string">&quot;flash_attention_2&quot;</span>, device_map=<span class="hljs-string">&quot;cuda:0&quot;</span>)<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="fused-modules" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#fused-modules"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>퓨즈된 모듈</span></h2> <p data-svelte-h="svelte-g2x0tg">퓨즈된 모듈은 정확도와 성능을 개선합니다. 퓨즈된 모듈은 <a href="https://huggingface.co/meta-llama" rel="nofollow">Llama</a> 아키텍처와 <a href="https://huggingface.co/mistralai/Mistral-7B-v0.1" rel="nofollow">Mistral</a> 아키텍처의 AWQ모듈에 기본적으로 지원됩니다. 그러나 지원되지 않는 아키텍처에 대해서도 AWQ 모듈을 퓨즈할 수 있습니다.</p> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-epevub">퓨즈된 모듈은 FlashAttention-2와 같은 다른 최적화 기술과 결합할 수 없습니다.</p></div> <div class="flex space-x-2 items-center my-1.5 mr-8 h-7 !pl-0 -mx-3 md:mx-0"><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd border-gray-800 bg-black dark:bg-gray-700 text-white">supported architectures </div><div class="flex items-center border rounded-lg px-1.5 py-1 leading-none select-none text-smd text-gray-500 cursor-pointer opacity-90 hover:text-gray-700 dark:hover:text-gray-200 hover:shadow-sm">unsupported architectures </div></div> <div class="language-select"><p data-svelte-h="svelte-1imy52d">지원되는 아키텍처에서 퓨즈된 모듈을 활성화하려면, <code>AwqConfig</code> 를 생성하고 매개변수 <code>fuse_max_seq_len</code><code>do_fuse=True</code>를 설정해야 합니다. <code>fuse_max_seq_len</code> 매개변수는 전체 시퀀스 길이로, 컨텍스트 길이와 예상 생성 길이를 포함해야 합니다. 안전하게 사용하기 위해 더 큰 값으로 설정할 수 있습니다.</p> <p data-svelte-h="svelte-16m2a83">예를 들어, <a href="https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ" rel="nofollow">TheBloke/Mistral-7B-OpenOrca-AWQ</a> 모델의 AWQ 모듈을 퓨즈해보겠습니다.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AwqConfig, AutoModelForCausalLM
model_id = <span class="hljs-string">&quot;TheBloke/Mistral-7B-OpenOrca-AWQ&quot;</span>
quantization_config = AwqConfig(
bits=<span class="hljs-number">4</span>,
fuse_max_seq_len=<span class="hljs-number">512</span>,
do_fuse=<span class="hljs-literal">True</span>,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(<span class="hljs-number">0</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1utzrt0"><a href="https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ" rel="nofollow">TheBloke/Mistral-7B-OpenOrca-AWQ</a> 모델은 퓨즈된 모듈이 있는 경우와 없는 경우 모두 <code>batch_size=1</code> 로 성능 평가되었습니다.</p> <figcaption class="text-center text-gray-500 text-lg" data-svelte-h="svelte-uubs04">퓨즈되지 않은 모듈</figcaption> <table data-svelte-h="svelte-ty84pu"><thead><tr><th align="right">배치 크기</th> <th align="right">프리필 길이</th> <th align="right">디코드 길이</th> <th align="right">프리필 토큰/초</th> <th align="right">디코드 토큰/초</th> <th align="left">메모리 (VRAM)</th></tr></thead> <tbody><tr><td align="right">1</td> <td align="right">32</td> <td align="right">32</td> <td align="right">60.0984</td> <td align="right">38.4537</td> <td align="left">4.50 GB (5.68%)</td></tr> <tr><td align="right">1</td> <td align="right">64</td> <td align="right">64</td> <td align="right">1333.67</td> <td align="right">31.6604</td> <td align="left">4.50 GB (5.68%)</td></tr> <tr><td align="right">1</td> <td align="right">128</td> <td align="right">128</td> <td align="right">2434.06</td> <td align="right">31.6272</td> <td align="left">4.50 GB (5.68%)</td></tr> <tr><td align="right">1</td> <td align="right">256</td> <td align="right">256</td> <td align="right">3072.26</td> <td align="right">38.1731</td> <td align="left">4.50 GB (5.68%)</td></tr> <tr><td align="right">1</td> <td align="right">512</td> <td align="right">512</td> <td align="right">3184.74</td> <td align="right">31.6819</td> <td align="left">4.59 GB (5.80%)</td></tr> <tr><td align="right">1</td> <td align="right">1024</td> <td align="right">1024</td> <td align="right">3148.18</td> <td align="right">36.8031</td> <td align="left">4.81 GB (6.07%)</td></tr> <tr><td align="right">1</td> <td align="right">2048</td> <td align="right">2048</td> <td align="right">2927.33</td> <td align="right">35.2676</td> <td align="left">5.73 GB (7.23%)</td></tr></tbody></table> <figcaption class="text-center text-gray-500 text-lg" data-svelte-h="svelte-1fro5y6">퓨즈된 모듈</figcaption> <table data-svelte-h="svelte-1irqixs"><thead><tr><th align="right">배치 크기</th> <th align="right">프리필 길이</th> <th align="right">디코드 길이</th> <th align="right">프리필 토큰/초</th> <th align="right">디코드 토큰/초</th> <th align="left">메모리 (VRAM)</th></tr></thead> <tbody><tr><td align="right">1</td> <td align="right">32</td> <td align="right">32</td> <td align="right">81.4899</td> <td align="right">80.2569</td> <td align="left">4.00 GB (5.05%)</td></tr> <tr><td align="right">1</td> <td align="right">64</td> <td align="right">64</td> <td align="right">1756.1</td> <td align="right">106.26</td> <td align="left">4.00 GB (5.05%)</td></tr> <tr><td align="right">1</td> <td align="right">128</td> <td align="right">128</td> <td align="right">2479.32</td> <td align="right">105.631</td> <td align="left">4.00 GB (5.06%)</td></tr> <tr><td align="right">1</td> <td align="right">256</td> <td align="right">256</td> <td align="right">1813.6</td> <td align="right">85.7485</td> <td align="left">4.01 GB (5.06%)</td></tr> <tr><td align="right">1</td> <td align="right">512</td> <td align="right">512</td> <td align="right">2848.9</td> <td align="right">97.701</td> <td align="left">4.11 GB (5.19%)</td></tr> <tr><td align="right">1</td> <td align="right">1024</td> <td align="right">1024</td> <td align="right">3044.35</td> <td align="right">87.7323</td> <td align="left">4.41 GB (5.57%)</td></tr> <tr><td align="right">1</td> <td align="right">2048</td> <td align="right">2048</td> <td align="right">2715.11</td> <td align="right">89.4709</td> <td align="left">5.57 GB (7.04%)</td></tr></tbody></table> <p data-svelte-h="svelte-1fux1xr">퓨즈된 모듈 및 퓨즈되지 않은 모듈의 속도와 처리량은 <a href="https://github.com/huggingface/optimum-benchmark" rel="nofollow">optimum-benchmark</a>라이브러리를 사용하여 테스트 되었습니다.</p> <div class="flex gap-4" data-svelte-h="svelte-f84qab"><div><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size"> <figcaption class="mt-2 text-center text-sm text-gray-500">포워드 피크 메모리 (forward peak memory)/배치 크기</figcaption></div> <div><img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size"> <figcaption class="mt-2 text-center text-sm text-gray-500">생성 처리량/배치크기</figcaption></div></div> </div> <h2 class="relative group"><a id="exllama-v2-support" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#exllama-v2-support"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ExLlama-v2 서포트</span></h2> <p data-svelte-h="svelte-189uhby">최신 버전 <code>autoawq</code>는 빠른 프리필과 디코딩을 위해 ExLlama-v2 커널을 지원합니다. 시작하기 위해 먼저 최신 버전 <code>autoawq</code> 를 설치하세요 :</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install git+https://github.com/casper-hansen/AutoAWQ.git<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1up0kbc">매개변수를 <code>version=&quot;exllama&quot;</code>로 설정해 <code>AwqConfig()</code>를 생성하고 모델에 넘겨주세요.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForCausalLM, AutoTokenizer, AwqConfig
quantization_config = AwqConfig(version=<span class="hljs-string">&quot;exllama&quot;</span>)
model = AutoModelForCausalLM.from_pretrained(
<span class="hljs-string">&quot;TheBloke/Mistral-7B-Instruct-v0.1-AWQ&quot;</span>,
quantization_config=quantization_config,
device_map=<span class="hljs-string">&quot;auto&quot;</span>,
)
input_ids = torch.randint(<span class="hljs-number">0</span>, <span class="hljs-number">100</span>, (<span class="hljs-number">1</span>, <span class="hljs-number">128</span>), dtype=torch.long, device=<span class="hljs-string">&quot;cuda&quot;</span>)
output = model(input_ids)
<span class="hljs-built_in">print</span>(output.logits)
tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">&quot;TheBloke/Mistral-7B-Instruct-v0.1-AWQ&quot;</span>)
input_ids = tokenizer.encode(<span class="hljs-string">&quot;How to make a cake&quot;</span>, return_tensors=<span class="hljs-string">&quot;pt&quot;</span>).to(model.device)
output = model.generate(input_ids, do_sample=<span class="hljs-literal">True</span>, max_length=<span class="hljs-number">50</span>, pad_token_id=<span class="hljs-number">50256</span>)
<span class="hljs-built_in">print</span>(tokenizer.decode(output[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>))<!-- HTML_TAG_END --></pre></div> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-1y7b24l">이 기능은 AMD GPUs에서 지원됩니다.</p></div> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/transformers/blob/main/docs/source/ko/quantization/awq.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1hrx8 = {
assets: "/docs/transformers/main/ko",
base: "/docs/transformers/main/ko",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/transformers/main/ko/_app/immutable/entry/start.9aa88961.js"),
import("/docs/transformers/main/ko/_app/immutable/entry/app.84fb67c3.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 51],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
32 kB
·
Xet hash:
4e18d7fa5c715435aa9a95ceefac2afe575bc1d55b32a0e3313e34fd64a11a7c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.