Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Image-text-to-text","local":"image-text-to-text","sections":[{"title":"Streaming","local":"streaming","sections":[],"depth":2},{"title":"Fit models in smaller hardware","local":"fit-models-in-smaller-hardware","sections":[],"depth":2},{"title":"Further Reading","local":"further-reading","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/transformers/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/entry/start.2135b7e6.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/scheduler.25b97de1.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/singletons.0f2b7d5f.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/index.e188933d.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/paths.3d04d2c6.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/entry/app.24372c84.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/index.d9030fc9.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/nodes/0.026d2fdd.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/nodes/406.467fe3ec.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/CodeBlock.e6cd0d95.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/DocNotebookDropdown.5ea6cb78.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/globals.7f7f1b26.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/EditOnGithub.91d95064.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Image-text-to-text","local":"image-text-to-text","sections":[{"title":"Streaming","local":"streaming","sections":[],"depth":2},{"title":"Fit models in smaller hardware","local":"fit-models-in-smaller-hardware","sections":[],"depth":2},{"title":"Further Reading","local":"further-reading","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="image-text-to-text" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-text-to-text"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Image-text-to-text</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"> </button> </div> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"> </button> </div></div> <p data-svelte-h="svelte-16zti1">Image-text-to-text models, also known as vision language models (VLMs), are language models that take an image input. These models can tackle various tasks, from visual question answering to image segmentation. This task shares many similarities with image-to-text, but with some overlapping use cases like image captioning. Image-to-text models only take image inputs and often accomplish a specific task, whereas VLMs take open-ended text and image inputs and are more generalist models.</p> <p data-svelte-h="svelte-1fk5u1j">In this guide, we provide a brief overview of VLMs and show how to use them with Transformers for inference.</p> <p data-svelte-h="svelte-ql8pa">To begin with, there are multiple types of VLMs:</p> <ul data-svelte-h="svelte-1vpzkb0"><li>base models used for fine-tuning</li> <li>chat fine-tuned models for conversation</li> <li>instruction fine-tuned models</li></ul> <p data-svelte-h="svelte-1g8e8bl">This guide focuses on inference with an instruction-tuned model.</p> <p data-svelte-h="svelte-5jp6fp">Let’s begin installing the dependencies.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install -q transformers accelerate flash_attn <!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a1j2kh">Let’s initialize the model and the processor.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoProcessor, Idefics2ForConditionalGeneration | |
| <span class="hljs-keyword">import</span> torch | |
| device = torch.device(<span class="hljs-string">"cuda"</span>) | |
| model = Idefics2ForConditionalGeneration.from_pretrained( | |
| <span class="hljs-string">"HuggingFaceM4/idefics2-8b"</span>, | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation=<span class="hljs-string">"flash_attention_2"</span>, | |
| ).to(device) | |
| processor = AutoProcessor.from_pretrained(<span class="hljs-string">"HuggingFaceM4/idefics2-8b"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-14g5m6d">This model has a <a href="./chat_templating">chat template</a> that helps user parse chat outputs. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.</p> <p data-svelte-h="svelte-5oehs9">The image inputs look like the following.</p> <div class="flex justify-center" data-svelte-h="svelte-z5b734"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png" alt="Two cats sitting on a net"></div> <div class="flex justify-center" data-svelte-h="svelte-vw53ct"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"></div> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image | |
| <span class="hljs-keyword">import</span> requests | |
| img_urls =[<span class="hljs-string">"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"</span>, | |
| <span class="hljs-string">"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"</span>] | |
| images = [Image.<span class="hljs-built_in">open</span>(requests.get(img_urls[<span class="hljs-number">0</span>], stream=<span class="hljs-literal">True</span>).raw), | |
| Image.<span class="hljs-built_in">open</span>(requests.get(img_urls[<span class="hljs-number">1</span>], stream=<span class="hljs-literal">True</span>).raw)]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1wl2kpw">Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->messages = [ | |
| { | |
| <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, | |
| <span class="hljs-string">"content"</span>: [ | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"image"</span>}, | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"What do we see in this image?"</span>}, | |
| ] | |
| }, | |
| { | |
| <span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, | |
| <span class="hljs-string">"content"</span>: [ | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"In this image we can see two cats on the nets."</span>}, | |
| ] | |
| }, | |
| { | |
| <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, | |
| <span class="hljs-string">"content"</span>: [ | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"image"</span>}, | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"And how about this image?"</span>}, | |
| ] | |
| }, | |
| ]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dclbpo">We will now call the processors’ <a href="/docs/transformers/main/en/main_classes/processors#transformers.ProcessorMixin.apply_chat_template">apply_chat_template()</a> method to preprocess its output along with the image inputs.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->prompt = processor.apply_chat_template(messages, add_generation_prompt=<span class="hljs-literal">True</span>) | |
| inputs = processor(text=prompt, images=[images[<span class="hljs-number">0</span>], images[<span class="hljs-number">1</span>]], return_tensors=<span class="hljs-string">"pt"</span>).to(device)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1c12q4j">We can now pass the preprocessed inputs to the model.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">with</span> torch.no_grad(): | |
| generated_ids = model.generate(**inputs, max_new_tokens=<span class="hljs-number">500</span>) | |
| generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=<span class="hljs-literal">True</span>) | |
| <span class="hljs-built_in">print</span>(generated_texts) | |
| <span class="hljs-comment">## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="streaming" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#streaming"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Streaming</span></h2> <p data-svelte-h="svelte-ycqsyr">We can use <a href="./generation_strategies#streaming">text streaming</a> for a better generation experience. Transformers supports streaming with the <a href="/docs/transformers/main/en/internal/generation_utils#transformers.TextStreamer">TextStreamer</a> or <a href="/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer">TextIteratorStreamer</a> classes. We will use the <a href="/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer">TextIteratorStreamer</a> with IDEFICS-8B.</p> <p data-svelte-h="svelte-opb7dg">Assume we have an application that keeps chat history and takes in the new user input. We will preprocess the inputs as usual and initialize <a href="/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer">TextIteratorStreamer</a> to handle the generation in a separate thread. This allows you to stream the generated text tokens in real-time. Any generation arguments can be passed to <a href="/docs/transformers/main/en/internal/generation_utils#transformers.TextIteratorStreamer">TextIteratorStreamer</a>.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> time | |
| <span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> TextIteratorStreamer | |
| <span class="hljs-keyword">from</span> threading <span class="hljs-keyword">import</span> Thread | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">model_inference</span>(<span class="hljs-params"> | |
| user_prompt, | |
| chat_history, | |
| max_new_tokens, | |
| images | |
| </span>): | |
| user_prompt = { | |
| <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, | |
| <span class="hljs-string">"content"</span>: [ | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"image"</span>}, | |
| {<span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>, <span class="hljs-string">"text"</span>: user_prompt}, | |
| ] | |
| } | |
| chat_history.append(user_prompt) | |
| streamer = TextIteratorStreamer( | |
| processor.tokenizer, | |
| skip_prompt=<span class="hljs-literal">True</span>, | |
| timeout=<span class="hljs-number">5.0</span>, | |
| ) | |
| generation_args = { | |
| <span class="hljs-string">"max_new_tokens"</span>: max_new_tokens, | |
| <span class="hljs-string">"streamer"</span>: streamer, | |
| <span class="hljs-string">"do_sample"</span>: <span class="hljs-literal">False</span> | |
| } | |
| <span class="hljs-comment"># add_generation_prompt=True makes model generate bot response</span> | |
| prompt = processor.apply_chat_template(chat_history, add_generation_prompt=<span class="hljs-literal">True</span>) | |
| inputs = processor( | |
| text=prompt, | |
| images=images, | |
| return_tensors=<span class="hljs-string">"pt"</span>, | |
| ).to(device) | |
| generation_args.update(inputs) | |
| thread = Thread( | |
| target=model.generate, | |
| kwargs=generation_args, | |
| ) | |
| thread.start() | |
| acc_text = <span class="hljs-string">""</span> | |
| <span class="hljs-keyword">for</span> text_token <span class="hljs-keyword">in</span> streamer: | |
| time.sleep(<span class="hljs-number">0.04</span>) | |
| acc_text += text_token | |
| <span class="hljs-keyword">if</span> acc_text.endswith(<span class="hljs-string">"<end_of_utterance>"</span>): | |
| acc_text = acc_text[:-<span class="hljs-number">18</span>] | |
| <span class="hljs-keyword">yield</span> acc_text | |
| thread.join()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1jlbesg">Now let’s call the <code>model_inference</code> function we created and stream the values.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->generator = model_inference( | |
| user_prompt=<span class="hljs-string">"And what is in this image?"</span>, | |
| chat_history=messages, | |
| max_new_tokens=<span class="hljs-number">100</span>, | |
| images=images | |
| ) | |
| <span class="hljs-keyword">for</span> value <span class="hljs-keyword">in</span> generator: | |
| <span class="hljs-built_in">print</span>(value) | |
| <span class="hljs-comment"># In</span> | |
| <span class="hljs-comment"># In this</span> | |
| <span class="hljs-comment"># In this image ...</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="fit-models-in-smaller-hardware" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#fit-models-in-smaller-hardware"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Fit models in smaller hardware</span></h2> <p data-svelte-h="svelte-l93gl1">VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with <a href="./quantization/quanto#quanto">Quanto</a>. int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency.</p> <p data-svelte-h="svelte-1kehp8k">First, install dependencies.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install -U quanto bitsandbytes<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-hu48ru">To quantize a model during loading, we need to first create <a href="/docs/transformers/main/en/main_classes/quantization#transformers.QuantoConfig">QuantoConfig</a>. Then load the model as usual, but pass <code>quantization_config</code> during model initialization.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> Idefics2ForConditionalGeneration, AutoTokenizer, QuantoConfig | |
| model_id = <span class="hljs-string">"HuggingFaceM4/idefics2-8b"</span> | |
| quantization_config = QuantoConfig(weights=<span class="hljs-string">"int8"</span>) | |
| quantized_model = Idefics2ForConditionalGeneration.from_pretrained(model_id, device_map=<span class="hljs-string">"cuda"</span>, quantization_config=quantization_config)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-rt04rk">And that’s it, we can use the model the same way with no changes.</p> <h2 class="relative group"><a id="further-reading" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#further-reading"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Further Reading</span></h2> <p data-svelte-h="svelte-n9r3vs">Here are some more resources for the image-text-to-text task.</p> <ul data-svelte-h="svelte-yyt3p8"><li><a href="https://huggingface.co/tasks/image-text-to-text" rel="nofollow">Image-text-to-text task page</a> covers model types, use cases, datasets, and more.</li> <li><a href="https://huggingface.co/blog/vlms" rel="nofollow">Vision Language Models Explained</a> is a blog post that covers everything about vision language models and supervised fine-tuning using <a href="https://huggingface.co/docs/trl/en/index" rel="nofollow">TRL</a>.</li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/image_text_to_text.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1xexzbk = { | |
| assets: "/docs/transformers/main/en", | |
| base: "/docs/transformers/main/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/transformers/main/en/_app/immutable/entry/start.2135b7e6.js"), | |
| import("/docs/transformers/main/en/_app/immutable/entry/app.24372c84.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 406], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 33.9 kB
- Xet hash:
- b8541d3c05d19991f7360e6d1f125a62a386c1f0c08ca9765d199f1202b10913
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.