Buckets:

HuggingFaceDocBuilder's picture
|
download
raw
6.17 kB
# Image Segmentation
Image segmentation is dividing an image into meaningful segments. It's all about creating masks that spotlight each object in the picture.
The intuition behind this task is *that it can be viewed as a classification for each pixel of the image*.
Segmentation models are the
core models in various industries. They can be found in agriculture and autonomous driving. In the farming world, these models are used
for identifying different land sections and assessing the growth stage of crops. They're also key players for self-driving cars, where
they are used to identify lanes, sidewalks, and other road users.
![Image segmentation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-example.png)
Different types of segmentations can be applied depending on the context and the intended goal.
The most commonly defined segmentations are the following.
- **Semantic Segmentation**: This involves assigning the most probable class to each pixel. For example, in semantic segmentation,
the model does not distinguish between two individual cats but rather focuses on the pixel class. It's all about classification of
each pixel.
- **Instance Segmentation**: This type involves identifying each instance of an object with a unique mask. It combines aspects of
object detection and segmentation to differentiate between individual objects of the same class.
- **Panoptic Segmentation**: A hybrid approach that combines elements of semantic and instance segmentation. It assigns a class and
an instance to each pixel, effectively integrating the *what* and *where* aspects of the image.
![Comparison of segmentation types](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/segmentation-types.png)
Choosing the right segmentation type depends on the context and the intended goal. One cool thing is that recent models allow you to achieve the three
segmentation types with a single model. We recommend you to check out this [article](https://huggingface.co/blog/mask2former), which introduces Mask2former,
a new model by Meta that achieves the three segmentation types with only a Panoptic dataset.
### Modern Approach: Vision Transformer-based Segmentation
You've probably heard of U-Net, a popular network used for image segmentation. It's designed with several convolutional layers and works
in two main phases: the downsampling phase, which compresses the image to understand its features, and the upsampling phase, which expands
the image back to its original size for detailed segmentation.
Computer vision was once dominated by convolutional models, but it has recently shifted towards the vision transformer approach.
An example is *[Segment anything model (SAM)](https://arxiv.org/abs/2304.02643)* that is a popular prompt based model introduced
in April 2023 by *Meta AI Research, FAIR*. The model is based on the Vision Transformer (ViT) model and focuses on creating a promptable
(i.e. you can provide words to describe what you would like to segment in the image) segmentation model capable of
zero-shot transfer on new images. The strength of the model comes from its training on the largest dataset available, which includes over
1 billion masks on 11 million images. I recommend you play with [Meta's demo](https://segment-anything.com/) on a few images and even
better you can play with the [model](https://huggingface.co/ybelkada/segment-anything) in transformers.
Here is an example of how to use the model in transformers. First, we will initialize the `mask-generation` pipeline.
Then, we will pass the image in pipeline for inference.
```python
from transformers import pipeline
pipe = pipeline("mask-generation", model="facebook/sam-vit-base", device=0)
raw_image = Image.open("path/to/image").convert("RGB")
masks = pipe(raw_image)
```
More details on how to use the model can be found in the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/sam).
### How to Evaluate a Segmentation Model?
You have now seen how to use a segmentation model, but how can you evaluate it? As demonstrated in the previous section, segmentation is
primarily a supervised learning task. This means that the dataset is composed of images and their corresponding masks, which serve as the
ground truth. A few metrics can be used to evaluate your model. The most common ones are:
- **The Intersection over Union (IoU) or Jaccard index** metric is the ratio between the intersection and the union of the predicted mask and the ground truth.
IoU is arguably the most common metric used in segmentation tasks. Its advantage lies in being less sensitive to class imbalance, making
it often a good choice when you begin modeling.
![IoU](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/iou.png)
- **Pixel accuracy**: Pixel accuracy is calculated as the ratio of the number of correctly classified pixels to the total number of pixels.
While being an intuitive metric, it can be misleading due to its sensitivity to class imbalance.
![Pixel accuracy](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/pixel-accuracy.png)
- **Dice coefficient**: It's the ratio between the double of the intersection and the sum of the predicted mask and the ground truth.
The dice coefficient is simply the percentage of overlap between the prediction and the ground truth. It's a good metric to use when
you need sensibility to small differences between the overlap.
![Dice coefficient](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/dice-coefficient.png)
## Resources and Further Reading
- [Segment Anything Paper](https://arxiv.org/abs/2304.02643)
- [Fine-tuning Segformer blog post](https://huggingface.co/blog/fine-tune-segformer)
- [Mask2former blog post](https://huggingface.co/blog/mask2former)
- [Hugging Face's documentation on segmentation tasks](https://huggingface.co/docs/transformers/main/tasks/semantic_segmentation)
- If you want to go deeper into the topic, we recommend you to check out Stanford's [lecture on segmentation](https://www.youtube.com/watch?v=nDPWywWRIRo).

Xet Storage Details

Size:
6.17 kB
·
Xet hash:
24eb79cc96245eb64787a0f310b2470cb735a017ff3d6f51f2e871baa229711f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.