Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Image Segmentation","local":"image-segmentation","sections":[{"title":"Modern Approach: Vision Transformer-based Segmentation","local":"modern-approach-vision-transformer-based-segmentation","sections":[],"depth":3},{"title":"How to Evaluate a Segmentation Model?","local":"how-to-evaluate-a-segmentation-model","sections":[],"depth":3},{"title":"Resources and Further Reading","local":"resources-and-further-reading","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/73.b1f20b2f.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/CodeBlock.bb61a5a9.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Image Segmentation","local":"image-segmentation","sections":[{"title":"Modern Approach: Vision Transformer-based Segmentation","local":"modern-approach-vision-transformer-based-segmentation","sections":[],"depth":3},{"title":"How to Evaluate a Segmentation Model?","local":"how-to-evaluate-a-segmentation-model","sections":[],"depth":3},{"title":"Resources and Further Reading","local":"resources-and-further-reading","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="image-segmentation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-segmentation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Image Segmentation</span></h1> <p data-svelte-h="svelte-exkdtj">Image segmentation is dividing an image into meaningful segments. It’s all about creating masks that spotlight each object in the picture. | |
| The intuition behind this task is <em>that it can be viewed as a classification for each pixel of the image</em>. | |
| Segmentation models are the | |
| core models in various industries. They can be found in agriculture and autonomous driving. In the farming world, these models are used | |
| for identifying different land sections and assessing the growth stage of crops. They’re also key players for self-driving cars, where | |
| they are used to identify lanes, sidewalks, and other road users.</p> <p data-svelte-h="svelte-1bft94o"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-example.png" alt="Image segmentation"></p> <p data-svelte-h="svelte-1w5fm6j">Different types of segmentations can be applied depending on the context and the intended goal. | |
| The most commonly defined segmentations are the following.</p> <ul data-svelte-h="svelte-n0x2m3"><li><strong>Semantic Segmentation</strong>: This involves assigning the most probable class to each pixel. For example, in semantic segmentation, | |
| the model does not distinguish between two individual cats but rather focuses on the pixel class. It’s all about classification of | |
| each pixel.</li> <li><strong>Instance Segmentation</strong>: This type involves identifying each instance of an object with a unique mask. It combines aspects of | |
| object detection and segmentation to differentiate between individual objects of the same class.</li> <li><strong>Panoptic Segmentation</strong>: A hybrid approach that combines elements of semantic and instance segmentation. It assigns a class and | |
| an instance to each pixel, effectively integrating the <em>what</em> and <em>where</em> aspects of the image.</li></ul> <p data-svelte-h="svelte-1beerg7"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-types.png" alt="Comparison of segmentation types"></p> <p data-svelte-h="svelte-17sd42v">Choosing the right segmentation type depends on the context and the intended goal. One cool thing is that recent models allow you to achieve the three | |
| segmentation types with a single model. We recommend you to check out this <a href="https://huggingface.co/blog/mask2former" rel="nofollow">article</a>, which introduces Mask2former, | |
| a new model by Meta that achieves the three segmentation types with only a Panoptic dataset.</p> <h3 class="relative group"><a id="modern-approach-vision-transformer-based-segmentation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#modern-approach-vision-transformer-based-segmentation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Modern Approach: Vision Transformer-based Segmentation</span></h3> <p data-svelte-h="svelte-19zwezj">You’ve probably heard of U-Net, a popular network used for image segmentation. It’s designed with several convolutional layers and works | |
| in two main phases: the downsampling phase, which compresses the image to understand its features, and the upsampling phase, which expands | |
| the image back to its original size for detailed segmentation.</p> <p data-svelte-h="svelte-1334w5c">Computer vision was once dominated by convolutional models, but it has recently shifted towards the vision transformer approach. | |
| An example is <em><a href="https://arxiv.org/abs/2304.02643" rel="nofollow">Segment anything model (SAM)</a></em> that is a popular prompt based model introduced | |
| in April 2023 by <em>Meta AI Research, FAIR</em>. The model is based on the Vision Transformer (ViT) model and focuses on creating a promptable | |
| (i.e. you can provide words to describe what you would like to segment in the image) segmentation model capable of | |
| zero-shot transfer on new images. The strength of the model comes from its training on the largest dataset available, which includes over | |
| 1 billion masks on 11 million images. I recommend you play with <a href="https://segment-anything.com/" rel="nofollow">Meta’s demo</a> on a few images and even | |
| better you can play with the <a href="https://huggingface.co/ybelkada/segment-anything" rel="nofollow">model</a> in transformers.</p> <p data-svelte-h="svelte-1bdv4ej">Here is an example of how to use the model in transformers. First, we will initialize the <code>mask-generation</code> pipeline. | |
| Then, we will pass the image in pipeline for inference.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline | |
| pipe = pipeline(<span class="hljs-string">"mask-generation"</span>, model=<span class="hljs-string">"facebook/sam-vit-base"</span>, device=<span class="hljs-number">0</span>) | |
| raw_image = Image.<span class="hljs-built_in">open</span>(<span class="hljs-string">"path/to/image"</span>).convert(<span class="hljs-string">"RGB"</span>) | |
| masks = pipe(raw_image)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kwqdq1">More details on how to use the model can be found in the <a href="https://huggingface.co/docs/transformers/main/en/model_doc/sam" rel="nofollow">documentation</a>.</p> <h3 class="relative group"><a id="how-to-evaluate-a-segmentation-model" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#how-to-evaluate-a-segmentation-model"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>How to Evaluate a Segmentation Model?</span></h3> <p data-svelte-h="svelte-17c2odr">You have now seen how to use a segmentation model, but how can you evaluate it? As demonstrated in the previous section, segmentation is | |
| primarily a supervised learning task. This means that the dataset is composed of images and their corresponding masks, which serve as the | |
| ground truth. A few metrics can be used to evaluate your model. The most common ones are:</p> <ul data-svelte-h="svelte-1ltwdtc"><li><strong>The Intersection over Union (IoU) or Jaccard index</strong> metric is the ratio between the intersection and the union of the predicted mask and the ground truth. | |
| IoU is arguably the most common metric used in segmentation tasks. Its advantage lies in being less sensitive to class imbalance, making | |
| it often a good choice when you begin modeling.</li></ul> <p data-svelte-h="svelte-wgpkoc"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/iou.png" alt="IoU"></p> <ul data-svelte-h="svelte-li3e7e"><li><strong>Pixel accuracy</strong>: Pixel accuracy is calculated as the ratio of the number of correctly classified pixels to the total number of pixels. | |
| While being an intuitive metric, it can be misleading due to its sensitivity to class imbalance.</li></ul> <p data-svelte-h="svelte-rdneb"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/pixel-accuracy.png" alt="Pixel accuracy"></p> <ul data-svelte-h="svelte-1gypdy2"><li><strong>Dice coefficient</strong>: It’s the ratio between the double of the intersection and the sum of the predicted mask and the ground truth. | |
| The dice coefficient is simply the percentage of overlap between the prediction and the ground truth. It’s a good metric to use when | |
| you need sensibility to small differences between the overlap.</li></ul> <p data-svelte-h="svelte-138bear"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/dice-coefficient.png" alt="Dice coefficient"></p> <h2 class="relative group"><a id="resources-and-further-reading" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#resources-and-further-reading"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Resources and Further Reading</span></h2> <ul data-svelte-h="svelte-1g8ssyy"><li><a href="https://arxiv.org/abs/2304.02643" rel="nofollow">Segment Anything Paper</a></li> <li><a href="https://huggingface.co/blog/fine-tune-segformer" rel="nofollow">Fine-tuning Segformer blog post</a></li> <li><a href="https://huggingface.co/blog/mask2former" rel="nofollow">Mask2former blog post</a></li> <li><a href="https://huggingface.co/docs/transformers/main/tasks/semantic_segmentation" rel="nofollow">Hugging Face’s documentation on segmentation tasks</a></li> <li>If you want to go deeper into the topic, we recommend you to check out Stanford’s <a href="https://www.youtube.com/watch?v=nDPWywWRIRo" rel="nofollow">lecture on segmentation</a>.</li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit6/basic-cv-tasks/segmentation.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1p6gie1 = { | |
| assets: "/docs/computer-vision-course/pr_397/en", | |
| base: "/docs/computer-vision-course/pr_397/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"), | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 73], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 17.1 kB
- Xet hash:
- f018c964e92913f8da3fde9cd99969dd8f5742c36fee23d740a36db445071851
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.