Buckets:

rtrm's picture
download
raw
17.1 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Image Segmentation&quot;,&quot;local&quot;:&quot;image-segmentation&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Modern Approach: Vision Transformer-based Segmentation&quot;,&quot;local&quot;:&quot;modern-approach-vision-transformer-based-segmentation&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;How to Evaluate a Segmentation Model?&quot;,&quot;local&quot;:&quot;how-to-evaluate-a-segmentation-model&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;Resources and Further Reading&quot;,&quot;local&quot;:&quot;resources-and-further-reading&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/73.b1f20b2f.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/CodeBlock.bb61a5a9.js">
<link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Image Segmentation&quot;,&quot;local&quot;:&quot;image-segmentation&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Modern Approach: Vision Transformer-based Segmentation&quot;,&quot;local&quot;:&quot;modern-approach-vision-transformer-based-segmentation&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;How to Evaluate a Segmentation Model?&quot;,&quot;local&quot;:&quot;how-to-evaluate-a-segmentation-model&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3},{&quot;title&quot;:&quot;Resources and Further Reading&quot;,&quot;local&quot;:&quot;resources-and-further-reading&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="image-segmentation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-segmentation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Image Segmentation</span></h1> <p data-svelte-h="svelte-exkdtj">Image segmentation is dividing an image into meaningful segments. It’s all about creating masks that spotlight each object in the picture.
The intuition behind this task is <em>that it can be viewed as a classification for each pixel of the image</em>.
Segmentation models are the
core models in various industries. They can be found in agriculture and autonomous driving. In the farming world, these models are used
for identifying different land sections and assessing the growth stage of crops. They’re also key players for self-driving cars, where
they are used to identify lanes, sidewalks, and other road users.</p> <p data-svelte-h="svelte-1bft94o"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-example.png" alt="Image segmentation"></p> <p data-svelte-h="svelte-1w5fm6j">Different types of segmentations can be applied depending on the context and the intended goal.
The most commonly defined segmentations are the following.</p> <ul data-svelte-h="svelte-n0x2m3"><li><strong>Semantic Segmentation</strong>: This involves assigning the most probable class to each pixel. For example, in semantic segmentation,
the model does not distinguish between two individual cats but rather focuses on the pixel class. It’s all about classification of
each pixel.</li> <li><strong>Instance Segmentation</strong>: This type involves identifying each instance of an object with a unique mask. It combines aspects of
object detection and segmentation to differentiate between individual objects of the same class.</li> <li><strong>Panoptic Segmentation</strong>: A hybrid approach that combines elements of semantic and instance segmentation. It assigns a class and
an instance to each pixel, effectively integrating the <em>what</em> and <em>where</em> aspects of the image.</li></ul> <p data-svelte-h="svelte-1beerg7"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-types.png" alt="Comparison of segmentation types"></p> <p data-svelte-h="svelte-17sd42v">Choosing the right segmentation type depends on the context and the intended goal. One cool thing is that recent models allow you to achieve the three
segmentation types with a single model. We recommend you to check out this <a href="https://huggingface.co/blog/mask2former" rel="nofollow">article</a>, which introduces Mask2former,
a new model by Meta that achieves the three segmentation types with only a Panoptic dataset.</p> <h3 class="relative group"><a id="modern-approach-vision-transformer-based-segmentation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#modern-approach-vision-transformer-based-segmentation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Modern Approach: Vision Transformer-based Segmentation</span></h3> <p data-svelte-h="svelte-19zwezj">You’ve probably heard of U-Net, a popular network used for image segmentation. It’s designed with several convolutional layers and works
in two main phases: the downsampling phase, which compresses the image to understand its features, and the upsampling phase, which expands
the image back to its original size for detailed segmentation.</p> <p data-svelte-h="svelte-1334w5c">Computer vision was once dominated by convolutional models, but it has recently shifted towards the vision transformer approach.
An example is <em><a href="https://arxiv.org/abs/2304.02643" rel="nofollow">Segment anything model (SAM)</a></em> that is a popular prompt based model introduced
in April 2023 by <em>Meta AI Research, FAIR</em>. The model is based on the Vision Transformer (ViT) model and focuses on creating a promptable
(i.e. you can provide words to describe what you would like to segment in the image) segmentation model capable of
zero-shot transfer on new images. The strength of the model comes from its training on the largest dataset available, which includes over
1 billion masks on 11 million images. I recommend you play with <a href="https://segment-anything.com/" rel="nofollow">Meta’s demo</a> on a few images and even
better you can play with the <a href="https://huggingface.co/ybelkada/segment-anything" rel="nofollow">model</a> in transformers.</p> <p data-svelte-h="svelte-1bdv4ej">Here is an example of how to use the model in transformers. First, we will initialize the <code>mask-generation</code> pipeline.
Then, we will pass the image in pipeline for inference.</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> pipeline
pipe = pipeline(<span class="hljs-string">&quot;mask-generation&quot;</span>, model=<span class="hljs-string">&quot;facebook/sam-vit-base&quot;</span>, device=<span class="hljs-number">0</span>)
raw_image = Image.<span class="hljs-built_in">open</span>(<span class="hljs-string">&quot;path/to/image&quot;</span>).convert(<span class="hljs-string">&quot;RGB&quot;</span>)
masks = pipe(raw_image)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kwqdq1">More details on how to use the model can be found in the <a href="https://huggingface.co/docs/transformers/main/en/model_doc/sam" rel="nofollow">documentation</a>.</p> <h3 class="relative group"><a id="how-to-evaluate-a-segmentation-model" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#how-to-evaluate-a-segmentation-model"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>How to Evaluate a Segmentation Model?</span></h3> <p data-svelte-h="svelte-17c2odr">You have now seen how to use a segmentation model, but how can you evaluate it? As demonstrated in the previous section, segmentation is
primarily a supervised learning task. This means that the dataset is composed of images and their corresponding masks, which serve as the
ground truth. A few metrics can be used to evaluate your model. The most common ones are:</p> <ul data-svelte-h="svelte-1ltwdtc"><li><strong>The Intersection over Union (IoU) or Jaccard index</strong> metric is the ratio between the intersection and the union of the predicted mask and the ground truth.
IoU is arguably the most common metric used in segmentation tasks. Its advantage lies in being less sensitive to class imbalance, making
it often a good choice when you begin modeling.</li></ul> <p data-svelte-h="svelte-wgpkoc"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/iou.png" alt="IoU"></p> <ul data-svelte-h="svelte-li3e7e"><li><strong>Pixel accuracy</strong>: Pixel accuracy is calculated as the ratio of the number of correctly classified pixels to the total number of pixels.
While being an intuitive metric, it can be misleading due to its sensitivity to class imbalance.</li></ul> <p data-svelte-h="svelte-rdneb"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/pixel-accuracy.png" alt="Pixel accuracy"></p> <ul data-svelte-h="svelte-1gypdy2"><li><strong>Dice coefficient</strong>: It’s the ratio between the double of the intersection and the sum of the predicted mask and the ground truth.
The dice coefficient is simply the percentage of overlap between the prediction and the ground truth. It’s a good metric to use when
you need sensibility to small differences between the overlap.</li></ul> <p data-svelte-h="svelte-138bear"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/dice-coefficient.png" alt="Dice coefficient"></p> <h2 class="relative group"><a id="resources-and-further-reading" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#resources-and-further-reading"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Resources and Further Reading</span></h2> <ul data-svelte-h="svelte-1g8ssyy"><li><a href="https://arxiv.org/abs/2304.02643" rel="nofollow">Segment Anything Paper</a></li> <li><a href="https://huggingface.co/blog/fine-tune-segformer" rel="nofollow">Fine-tuning Segformer blog post</a></li> <li><a href="https://huggingface.co/blog/mask2former" rel="nofollow">Mask2former blog post</a></li> <li><a href="https://huggingface.co/docs/transformers/main/tasks/semantic_segmentation" rel="nofollow">Hugging Face’s documentation on segmentation tasks</a></li> <li>If you want to go deeper into the topic, we recommend you to check out Stanford’s <a href="https://www.youtube.com/watch?v=nDPWywWRIRo" rel="nofollow">lecture on segmentation</a>.</li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit6/basic-cv-tasks/segmentation.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1p6gie1 = {
assets: "/docs/computer-vision-course/pr_397/en",
base: "/docs/computer-vision-course/pr_397/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"),
import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 73],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
17.1 kB
·
Xet hash:
f018c964e92913f8da3fde9cd99969dd8f5742c36fee23d740a36db445071851

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.