Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Video-text-to-text","local":"video-text-to-text","sections":[],"depth":1}"> | |
| <link href="/docs/transformers/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/entry/start.2135b7e6.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/scheduler.25b97de1.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/singletons.0f2b7d5f.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/index.e188933d.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/paths.3d04d2c6.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/entry/app.24372c84.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/index.d9030fc9.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/nodes/0.026d2fdd.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/nodes/424.2ea51dd1.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/CodeBlock.e6cd0d95.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/DocNotebookDropdown.5ea6cb78.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/globals.7f7f1b26.js"> | |
| <link rel="modulepreload" href="/docs/transformers/main/en/_app/immutable/chunks/EditOnGithub.91d95064.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Video-text-to-text","local":"video-text-to-text","sections":[],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="video-text-to-text" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#video-text-to-text"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Video-text-to-text</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"> </button> </div> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"> </button> </div></div> <p data-svelte-h="svelte-lima51">Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.</p> <p data-svelte-h="svelte-1vltrkp">These models have nearly the same architecture as <a href="../image_text_to_text.md">image-text-to-text</a> models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like “What is happening in this video? <code><video></code>“.</p> <p data-svelte-h="svelte-srvvgy">In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.</p> <p data-svelte-h="svelte-jc4sjd">To begin with, there are multiple types of video LMs:</p> <ul data-svelte-h="svelte-1vpzkb0"><li>base models used for fine-tuning</li> <li>chat fine-tuned models for conversation</li> <li>instruction fine-tuned models</li></ul> <p data-svelte-h="svelte-1ds11b0">This guide focuses on inference with an instruction-tuned model, <a href="https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf" rel="nofollow">llava-hf/llava-interleave-qwen-7b-hf</a> which can take in interleaved data. Alternatively, you can try <a href="https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf" rel="nofollow">llava-interleave-qwen-0.5b-hf</a> if your hardware doesn’t allow running a 7B model.</p> <p data-svelte-h="svelte-5jp6fp">Let’s begin installing the dependencies.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install -q transformers accelerate flash_attn <!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a1j2kh">Let’s initialize the model and the processor.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> LlavaProcessor, LlavaForConditionalGeneration | |
| <span class="hljs-keyword">import</span> torch | |
| model_id = <span class="hljs-string">"llava-hf/llava-interleave-qwen-0.5b-hf"</span> | |
| processor = LlavaProcessor.from_pretrained(model_id) | |
| model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16) | |
| model.to(<span class="hljs-string">"cuda"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1jecxg9">Some models directly consume the <code><video></code> token, and others accept <code><image></code> tokens equal to the number of sampled frames. This model handles videos in the latter fashion. We will write a simple utility to handle image tokens, and another utility to get a video from a url and sample frames from it.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> uuid | |
| <span class="hljs-keyword">import</span> requests | |
| <span class="hljs-keyword">import</span> cv2 | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">replace_video_with_images</span>(<span class="hljs-params">text, frames</span>): | |
| <span class="hljs-keyword">return</span> text.replace(<span class="hljs-string">"<video>"</span>, <span class="hljs-string">"<image>"</span> * frames) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">sample_frames</span>(<span class="hljs-params">url, num_frames</span>): | |
| response = requests.get(url) | |
| path_id = <span class="hljs-built_in">str</span>(uuid.uuid4()) | |
| path = <span class="hljs-string">f"./<span class="hljs-subst">{path_id}</span>.mp4"</span> | |
| <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(path, <span class="hljs-string">"wb"</span>) <span class="hljs-keyword">as</span> f: | |
| f.write(response.content) | |
| video = cv2.VideoCapture(path) | |
| total_frames = <span class="hljs-built_in">int</span>(video.get(cv2.CAP_PROP_FRAME_COUNT)) | |
| interval = total_frames // num_frames | |
| frames = [] | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(total_frames): | |
| ret, frame = video.read() | |
| pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) | |
| <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ret: | |
| <span class="hljs-keyword">continue</span> | |
| <span class="hljs-keyword">if</span> i % interval == <span class="hljs-number">0</span>: | |
| frames.append(pil_img) | |
| video.release() | |
| <span class="hljs-keyword">return</span> frames<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1cbhwz2">Let’s get our inputs. We will sample frames and concatenate them.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->video_1 = <span class="hljs-string">"https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"</span> | |
| video_2 = <span class="hljs-string">"https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4"</span> | |
| video_1 = sample_frames(video_1, <span class="hljs-number">6</span>) | |
| video_2 = sample_frames(video_2, <span class="hljs-number">6</span>) | |
| videos = video_1 + video_2 | |
| videos | |
| <span class="hljs-comment"># [<PIL.Image.Image image mode=RGB size=1920x1080>,</span> | |
| <span class="hljs-comment"># <PIL.Image.Image image mode=RGB size=1920x1080>,</span> | |
| <span class="hljs-comment"># <PIL.Image.Image image mode=RGB size=1920x1080>, ...]</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1wvt5vc">Both videos have cats.</p> <div class="container" data-svelte-h="svelte-1a6pc9g"><div class="video-container"><video width="400" controls=""><source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4" type="video/mp4"></video></div></div> <div class="video-container" data-svelte-h="svelte-1jly6uy"><video width="400" controls=""><source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4" type="video/mp4"></video></div> <p data-svelte-h="svelte-1y3zqy2">Now we can preprocess the inputs.</p> <p data-svelte-h="svelte-sdf91z">This model has a prompt template that looks like following. First, we’ll put all the sampled frames into one list. Since we have eight frames in each video, we will insert 12 <code><image></code> tokens to our prompt. Add <code>assistant</code> at the end of the prompt to trigger the model to give answers. Then we can preprocess.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->user_prompt = <span class="hljs-string">"Are these two cats in these two videos doing the same thing?"</span> | |
| toks = <span class="hljs-string">"<image>"</span> * <span class="hljs-number">12</span> | |
| prompt = <span class="hljs-string">"<|im_start|>user"</span>+ toks + <span class="hljs-string">f"\n<span class="hljs-subst">{user_prompt}</span><|im_end|><|im_start|>assistant"</span> | |
| inputs = processor(prompt, images=videos).to(model.device, model.dtype)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-xm2r0y">We can now call <a href="/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate">generate()</a> for inference. The model outputs the question in our input and answer, so we only take the text after the prompt and <code>assistant</code> part from the model output.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->output = model.generate(**inputs, max_new_tokens=<span class="hljs-number">100</span>, do_sample=<span class="hljs-literal">False</span>) | |
| <span class="hljs-built_in">print</span>(processor.decode(output[<span class="hljs-number">0</span>][<span class="hljs-number">2</span>:], skip_special_tokens=<span class="hljs-literal">True</span>)[<span class="hljs-built_in">len</span>(user_prompt)+<span class="hljs-number">10</span>:]) | |
| <span class="hljs-comment"># The first cat is shown in a relaxed state, with its eyes closed and a content expression, while the second cat is shown in a more active state, with its mouth open wide, possibly in a yawn or a vocalization.</span> | |
| <!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1ui1acl">And voila!</p> <p data-svelte-h="svelte-1elywkp">To learn more about chat templates and token streaming for video-text-to-text models, refer to the <a href="../image_text_to_text">image-text-to-text</a> task guide because these models work similarly.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/video_text_to_text.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1xexzbk = { | |
| assets: "/docs/transformers/main/en", | |
| base: "/docs/transformers/main/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/transformers/main/en/_app/immutable/entry/start.2135b7e6.js"), | |
| import("/docs/transformers/main/en/_app/immutable/entry/app.24372c84.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 424], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 20.5 kB
- Xet hash:
- fe845f8fb29087c25c6deb7ff67c0c524ae5c139b67d5dec7773cf3462b3dc78
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.