Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / computer-vision-course /pr_397 /en /unit3 /vision-transformers /mobilevit.html

rtrm

about 2 months ago

download

raw

30.2 kB

	<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"MobileViT v2","local":"mobilevit-v2","sections":[{"title":"MobileViT Architecture","local":"mobilevit-architecture","sections":[],"depth":2},{"title":"MobileViT Block","local":"mobilevit-block","sections":[],"depth":2}],"depth":1}">
	<link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/45.ad440689.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/Tip.016f38d9.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/CodeBlock.bb61a5a9.js">
	<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"MobileViT v2","local":"mobilevit-v2","sections":[{"title":"MobileViT Architecture","local":"mobilevit-architecture","sections":[],"depth":2},{"title":"MobileViT Block","local":"mobilevit-block","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="mobilevit-v2" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#mobilevit-v2"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>MobileViT v2</span></h1> <p data-svelte-h="svelte-a5sr6t">The previously discussed Vision Transformer architectures are computationally intensive and hard to run on mobile devices. The previous state-of-the-art architecture used CNNs for mobile vision tasks. However, CNNs cannot learn global representations, and as a result they perform worse than their transformer counterparts.</p> <p data-svelte-h="svelte-10z2vs1">The MobileViT architecture aims to solve the required problems for vision mobile tasks, such as low-latency and lightweight architecture, while providing the advantages of transformers and CNNs. The mobileViT Architecture was developed by Apple and builds MobileNet from Google’s research team. The MobileViT architecture builds upon the previous MobileNet architecture by adding the MobileViT Block and separable self-attention. These two features allow for lightning-fast latency, reduction of parameters, computational complexity, and deployment of vision ML models on resource-constrained devices.</p> <h2 class="relative group"><a id="mobilevit-architecture" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#mobilevit-architecture"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>MobileViT Architecture</span></h2> <p data-svelte-h="svelte-bmj0ue">The architecture of MobileViT presented in the paper “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer” by Sachin Mehta and Mohammad Rastegari is as follows:
	<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/MobileViT-Architecture.png" alt="MobileViT Architecture"></p> <p data-svelte-h="svelte-13qcall">Some of this should look similar to the previous chapter. The MobileNet blocks, nxn convolutions, downsampling, global pooling, and the final linear layer.</p> <p data-svelte-h="svelte-1ev9nks">As seen by the global pooling layer and the linear layer, the model shown here is for classification. However, the same blocks introduced in this paper can be used for a variety of vision applications.</p> <h2 class="relative group"><a id="mobilevit-block" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#mobilevit-block"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>MobileViT Block</span></h2> <p data-svelte-h="svelte-16bokny">The MobileViT block combines CNN’s local processing and global processing, as seen in transformers. It uses a combination of convolutions and a transformer layer, allowing it to capture spatially local information and global dependencies in the data.</p> <p data-svelte-h="svelte-1ty6jj8">A diagram of the MobileViT Block is shown below:
	<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/MobileViT-MobileViTBlock.png" alt="MobileViT Block"></p> <p data-svelte-h="svelte-18vonjn">Okay, that’s a lot to take in. Let’s break that down.</p> <ul data-svelte-h="svelte-85swkx"><li>The block takes in an image with multiple channels. Let’s say for an RGB image 3 channels, so the block takes in a three channeled image.</li> <li>It then performs a N by N convolution on the channels appending them to the existing channels.</li> <li>The block then creates a linear combination of these channels and adds them to the existing stack of channels.</li> <li>For each channel these images are unfolded into flattened patches.</li> <li>Then these flattened patches are passed through a transformer to project them into new patches.</li> <li>These patches are then folded back together to create an image with d dimensions.</li> <li>Afterwards a pointwise convolution is overlayed on the stitched image.</li> <li>And then the stitched image is then recombined with the original RGB images.</li></ul> <p data-svelte-h="svelte-1ks87bx">This approach allows for a receptive field of H x W (the entire input size) while modeling non-local dependencies and local dependencies through retaining patch locational information. This can be seen in the unfolding and refolding of the patches.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400">A receptive field is the size of a region in an input space that affects the features of a particular layer.</div> <p data-svelte-h="svelte-10ehg4e">This compound approach allows MobileViT to have fewer parameters than traditional CNNs and even better accuracy!
	<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/MobileViT-CNNPreformance.png" alt="MobileViT CNNPreformance"></p> <p data-svelte-h="svelte-1jrpgtq">The main efficiency bottleneck in the original MobileViT architecture is the multi-head self-attention in Transformers, which requires O(k^2) time complexity concerning the input tokens.</p> <p data-svelte-h="svelte-1w3wo1k">Multi-head self-attention also requires costly operations like batch-wise matrix multiplications, which can impact latency on resource-constrained devices.</p> <p data-svelte-h="svelte-svxq7o">These same authors wrote another paper on exactly how to make attention operate faster. They’ve called it separable self-attention.</p> <h1 class="relative group"><a id="separable-self-attention" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#separable-self-attention"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Separable Self-attention</span></h1> <p data-svelte-h="svelte-1cmet3s">In traditional multihead attention, the big O concerning input tokens is quadratic (O(k^2)). Separable self-attention introduced in this paper has a complexity of O(k) concerning input tokens.</p> <p data-svelte-h="svelte-zdrtnh">In addition, the attention method does not use any batch-wise matrix multiplications, which helps reduce latency on resource-constrained devices like mobile phones.</p> <p data-svelte-h="svelte-15m63ew">This is a massive improvement!</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400">There have been many other forms of Attention such that complexity has ranged from O(k), O(ksqrt(k)), O(klog(k)).
	<p data-svelte-h="svelte-1a5ry1c">Separable self-attention was not the first paper to have O(k) complexity. In Linformer, O(k) complexity for Attention was also achieved in <a href="https://arxiv.org/abs/2006.04768" rel="nofollow">Linformer</a> before separable self-attention.</p> <p data-svelte-h="svelte-1ajshxs">However, it still used costly operations like batch-wise matrix multiplications.</p></div> <p data-svelte-h="svelte-3skqhs">A comparison between the attention mechanisms in Transformer, Linformer, and MobileViT is shown below:
	<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/MobileViT-Attention.png" alt="Attention Comparison"></p> <p data-svelte-h="svelte-17i0vc">The image above gives a comparison of each of the individual types of attention between the Transformer, Linformer, and MobileViT v2 architectures.</p> <p data-svelte-h="svelte-1pddr69">For example, in both the transformer and Linformer architectures, the attention computations perform two batch-wise matrix multiplications.</p> <p data-svelte-h="svelte-1gc2js8">However, in the case of separable self-attention, these two batch-wise multiplications are replaced by two separate linear computations. This allows for further boosted inference speed.</p> <h1 class="relative group"><a id="conclusion" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#conclusion"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Conclusion</span></h1> <p data-svelte-h="svelte-1lfv2sg">MobileViT blocks retain spatially local information while developing global representations, combining the strengths of Transformers and CNNs. They provide a receptive field that encompasses the entire image.</p> <p data-svelte-h="svelte-5csdhs">The introduction of separable self-attention into this existing architecture even further boosted both accuracy and inference speed.
	<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/MobileViT-Inference.png" alt="Inference Tests"></p> <p data-svelte-h="svelte-pkj3bq">Tests performed with different architectures on the iPhone 12s exhibited a large jump in performance with the introduction of separable attention, as shown above!</p> <p data-svelte-h="svelte-qdo4d3">Overall, the MobileViT Architecture is an extraordinarily powerful architecture for resource-limited vision tasks that provides fast inference and high accuracy.</p> <h1 class="relative group"><a id="transformers-library" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#transformers-library"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Transformers Library</span></h1> <p data-svelte-h="svelte-2kyg2y">If you want to try out MobileViTv2 locally, you can use it from HuggingFace’s <code>transformers</code> library, here’s how:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip install transformers<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-qgp6vb">Below is a short snippet on how to use MobileViT model to classify an image.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoImageProcessor, MobileViTV2ForImageClassification
	<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
	<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

	url = <span class="hljs-string">"http://images.cocodataset.org/val2017/000000039769.jpg"</span>
	image = Image.<span class="hljs-built_in">open</span>(requests.get(url, stream=<span class="hljs-literal">True</span>).raw)

	image_processor = AutoImageProcessor.from_pretrained(
	<span class="hljs-string">"apple/mobilevitv2-1.0-imagenet1k-256"</span>
	)
	model = MobileViTV2ForImageClassification.from_pretrained(
	<span class="hljs-string">"apple/mobilevitv2-1.0-imagenet1k-256"</span>
	)

	inputs = image_processor(image, return_tensors=<span class="hljs-string">"pt"</span>)

	logits = model(**inputs).logits

	<span class="hljs-comment"># model predicts one of the 1000 ImageNet classes</span>
	predicted_label = logits.argmax(-<span class="hljs-number">1</span>).item()
	<span class="hljs-built_in">print</span>(model.config.id2label[predicted_label])<!-- HTML_TAG_END --></pre></div> <h1 class="relative group"><a id="inference-api" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#inference-api"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Inference API</span></h1> <p data-svelte-h="svelte-1iqg6dj">For an even lighter computer vision setup, you can use the Hugging Face Inference API with MobileViTv2.
	Inference API is an API to interact with many models available on Hugging Face Hub.
	We can query Inference API like following through Python.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> json
	<span class="hljs-keyword">import</span> requests

	headers = {<span class="hljs-string">"Authorization"</span>: <span class="hljs-string">f"Bearer <span class="hljs-subst">{API_TOKEN}</span>"</span>}
	API_URL = (
	<span class="hljs-string">"https://api-inference.huggingface.co/models/apple/mobilevitv2-1.0-imagenet1k-256"</span>
	)


	<span class="hljs-keyword">def</span> <span class="hljs-title function_">query</span>(<span class="hljs-params">filename</span>):
	<span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(filename, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
	data = f.read()
	response = requests.request(<span class="hljs-string">"POST"</span>, API_URL, headers=headers, data=data)
	<span class="hljs-keyword">return</span> json.loads(response.content.decode(<span class="hljs-string">"utf-8"</span>))


	data = query(<span class="hljs-string">"cats.jpg"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1fv3ued">We can do the same with javascript like following.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> fetch <span class="hljs-keyword">from</span> <span class="hljs-string">"node-fetch"</span>;
	<span class="hljs-keyword">import</span> fs <span class="hljs-keyword">from</span> <span class="hljs-string">"fs"</span>;
	<span class="hljs-keyword">async</span> <span class="hljs-keyword">function</span> <span class="hljs-title function_">query</span>(<span class="hljs-params">filename</span>) {
	<span class="hljs-keyword">const</span> data = fs.<span class="hljs-title function_">readFileSync</span>(filename);
	<span class="hljs-keyword">const</span> response = <span class="hljs-keyword">await</span> <span class="hljs-title function_">fetch</span>(
	<span class="hljs-string">"https://api-inference.huggingface.co/models/apple/mobilevitv2-1.0-imagenet1k-256"</span>,
	{
	<span class="hljs-attr">headers</span>: { <span class="hljs-title class_">Authorization</span>: <span class="hljs-string">`Bearer <span class="hljs-subst">${API_TOKEN}</span>`</span> },
	<span class="hljs-attr">method</span>: <span class="hljs-string">"POST"</span>,
	<span class="hljs-attr">body</span>: data,
	}
	);
	<span class="hljs-keyword">const</span> result = <span class="hljs-keyword">await</span> response.<span class="hljs-title function_">json</span>();
	<span class="hljs-keyword">return</span> result;
	}
	<span class="hljs-title function_">query</span>(<span class="hljs-string">"cats.jpg"</span>).<span class="hljs-title function_">then</span>(<span class="hljs-function">(<span class="hljs-params">response</span>) =></span> {
	<span class="hljs-variable language_">console</span>.<span class="hljs-title function_">log</span>(<span class="hljs-title class_">JSON</span>.<span class="hljs-title function_">stringify</span>(response));
	});<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-6diubf">Finally, we can query inference API through curl.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->curl https://api-inference.huggingface.co/models/apple/mobilevitv2-1.0-imagenet1k-256 \
	-X POST \
	--data-binary <span class="hljs-string">'@cats.jpg'</span> \
	-H <span class="hljs-string">"Authorization: Bearer <span class="hljs-variable">${HF_API_TOKEN}</span>"</span><!-- HTML_TAG_END --></pre></div> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit3/vision-transformers/mobilevit.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>

	<script>
	{
	__sveltekit_1p6gie1 = {
	assets: "/docs/computer-vision-course/pr_397/en",
	base: "/docs/computer-vision-course/pr_397/en",
	env: {}
	};

	const element = document.currentScript.parentElement;

	const data = [null,null];

	Promise.all([
	import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"),
	import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js")
	]).then(([kit, app]) => {
	kit.start(app, element, {
	node_ids: [0, 45],
	data,
	form: null,
	error: null
	});
	});
	}
	</script>

Xet Storage Details

Size:: 30.2 kB
Xet hash:: ccd625d8ceaa8a2fb468391847344edadb18d1230fe2ed1cdcf6adaf3bd858d9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.