Buckets:
| # Image Segmentation | |
| Image segmentation is dividing an image into meaningful segments. It's all about creating masks that spotlight each object in the picture. | |
| The intuition behind this task is *that it can be viewed as a classification for each pixel of the image*. | |
| Segmentation models are the | |
| core models in various industries. They can be found in agriculture and autonomous driving. In the farming world, these models are used | |
| for identifying different land sections and assessing the growth stage of crops. They're also key players for self-driving cars, where | |
| they are used to identify lanes, sidewalks, and other road users. | |
|  | |
| Different types of segmentations can be applied depending on the context and the intended goal. | |
| The most commonly defined segmentations are the following. | |
| - **Semantic Segmentation**: This involves assigning the most probable class to each pixel. For example, in semantic segmentation, | |
| the model does not distinguish between two individual cats but rather focuses on the pixel class. It's all about classification of | |
| each pixel. | |
| - **Instance Segmentation**: This type involves identifying each instance of an object with a unique mask. It combines aspects of | |
| object detection and segmentation to differentiate between individual objects of the same class. | |
| - **Panoptic Segmentation**: A hybrid approach that combines elements of semantic and instance segmentation. It assigns a class and | |
| an instance to each pixel, effectively integrating the *what* and *where* aspects of the image. | |
|  | |
| Choosing the right segmentation type depends on the context and the intended goal. One cool thing is that recent models allow you to achieve the three | |
| segmentation types with a single model. We recommend you to check out this [article](https://huggingface.co/blog/mask2former), which introduces Mask2former, | |
| a new model by Meta that achieves the three segmentation types with only a Panoptic dataset. | |
| ### Modern Approach: Vision Transformer-based Segmentation | |
| You've probably heard of U-Net, a popular network used for image segmentation. It's designed with several convolutional layers and works | |
| in two main phases: the downsampling phase, which compresses the image to understand its features, and the upsampling phase, which expands | |
| the image back to its original size for detailed segmentation. | |
| Computer vision was once dominated by convolutional models, but it has recently shifted towards the vision transformer approach. | |
| An example is *[Segment anything model (SAM)](https://arxiv.org/abs/2304.02643)* that is a popular prompt based model introduced | |
| in April 2023 by *Meta AI Research, FAIR*. The model is based on the Vision Transformer (ViT) model and focuses on creating a promptable | |
| (i.e. you can provide words to describe what you would like to segment in the image) segmentation model capable of | |
| zero-shot transfer on new images. The strength of the model comes from its training on the largest dataset available, which includes over | |
| 1 billion masks on 11 million images. I recommend you play with [Meta's demo](https://segment-anything.com/) on a few images and even | |
| better you can play with the [model](https://huggingface.co/ybelkada/segment-anything) in transformers. | |
| Here is an example of how to use the model in transformers. First, we will initialize the `mask-generation` pipeline. | |
| Then, we will pass the image in pipeline for inference. | |
| ```python | |
| from transformers import pipeline | |
| pipe = pipeline("mask-generation", model="facebook/sam-vit-base", device=0) | |
| raw_image = Image.open("path/to/image").convert("RGB") | |
| masks = pipe(raw_image) | |
| ``` | |
| More details on how to use the model can be found in the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/sam). | |
| ### How to Evaluate a Segmentation Model? | |
| You have now seen how to use a segmentation model, but how can you evaluate it? As demonstrated in the previous section, segmentation is | |
| primarily a supervised learning task. This means that the dataset is composed of images and their corresponding masks, which serve as the | |
| ground truth. A few metrics can be used to evaluate your model. The most common ones are: | |
| - **The Intersection over Union (IoU) or Jaccard index** metric is the ratio between the intersection and the union of the predicted mask and the ground truth. | |
| IoU is arguably the most common metric used in segmentation tasks. Its advantage lies in being less sensitive to class imbalance, making | |
| it often a good choice when you begin modeling. | |
|  | |
| - **Pixel accuracy**: Pixel accuracy is calculated as the ratio of the number of correctly classified pixels to the total number of pixels. | |
| While being an intuitive metric, it can be misleading due to its sensitivity to class imbalance. | |
|  | |
| - **Dice coefficient**: It's the ratio between the double of the intersection and the sum of the predicted mask and the ground truth. | |
| The dice coefficient is simply the percentage of overlap between the prediction and the ground truth. It's a good metric to use when | |
| you need sensibility to small differences between the overlap. | |
|  | |
| ## Resources and Further Reading | |
| - [Segment Anything Paper](https://arxiv.org/abs/2304.02643) | |
| - [Fine-tuning Segformer blog post](https://huggingface.co/blog/fine-tune-segformer) | |
| - [Mask2former blog post](https://huggingface.co/blog/mask2former) | |
| - [Hugging Face's documentation on segmentation tasks](https://huggingface.co/docs/transformers/main/tasks/semantic_segmentation) | |
| - If you want to go deeper into the topic, we recommend you to check out Stanford's [lecture on segmentation](https://www.youtube.com/watch?v=nDPWywWRIRo). | |
Xet Storage Details
- Size:
- 6.17 kB
- Xet hash:
- 24eb79cc96245eb64787a0f310b2470cb735a017ff3d6f51f2e871baa229711f
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.