File size: 9,786 Bytes
17c6d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# 마슀크 생성[[mask-generation]]

마슀크 생성(Mask generation)은 이미지에 λŒ€ν•œ 의미 μžˆλŠ” 마슀크λ₯Ό μƒμ„±ν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€. 
이 μž‘μ—…μ€ [이미지 λΆ„ν• ](semantic_segmentation)κ³Ό 맀우 μœ μ‚¬ν•˜μ§€λ§Œ, λ§Žμ€ 차이점이 μžˆμŠ΅λ‹ˆλ‹€. 이미지 λΆ„ν•  λͺ¨λΈμ€ 라벨이 달린 λ°μ΄ν„°μ…‹μœΌλ‘œ ν•™μŠ΅λ˜λ©°, ν•™μŠ΅ 쀑에 λ³Έ ν΄λž˜μŠ€λ“€λ‘œλ§Œ μ œν•œλ©λ‹ˆλ‹€. 이미지가 μ£Όμ–΄μ§€λ©΄, 이미지 λΆ„ν•  λͺ¨λΈμ€ μ—¬λŸ¬ λ§ˆμŠ€ν¬μ™€ 그에 ν•΄λ‹Ήν•˜λŠ” 클래슀λ₯Ό λ°˜ν™˜ν•©λ‹ˆλ‹€. 

반면, 마슀크 생성 λͺ¨λΈμ€ λŒ€λŸ‰μ˜ λ°μ΄ν„°λ‘œ ν•™μŠ΅λ˜λ©° 두 κ°€μ§€ λͺ¨λ“œλ‘œ μž‘λ™ν•©λ‹ˆλ‹€.
- ν”„λ‘¬ν”„νŠΈ λͺ¨λ“œ(Prompting mode): 이 λͺ¨λ“œμ—μ„œλŠ” λͺ¨λΈμ΄ 이미지와 ν”„λ‘¬ν”„νŠΈλ₯Ό μž…λ ₯λ°›μŠ΅λ‹ˆλ‹€. ν”„λ‘¬ν”„νŠΈλŠ” 이미지 λ‚΄ 객체의 2D μ’Œν‘œ(XY μ’Œν‘œ)λ‚˜ 객체λ₯Ό λ‘˜λŸ¬μ‹Ό λ°”μš΄λ”© λ°•μŠ€κ°€ 될 수 μžˆμŠ΅λ‹ˆλ‹€. ν”„λ‘¬ν”„νŠΈ λͺ¨λ“œμ—μ„œλŠ” λͺ¨λΈμ΄ ν”„λ‘¬ν”„νŠΈκ°€ κ°€λ¦¬ν‚€λŠ” 객체의 마슀크만 λ°˜ν™˜ν•©λ‹ˆλ‹€.
- 전체 λΆ„ν•  λͺ¨λ“œ(Segment Everything mode): 이 λͺ¨λ“œμ—μ„œλŠ” μ£Όμ–΄μ§„ 이미지 λ‚΄μ—μ„œ λͺ¨λ“  마슀크λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄ κ·Έλ¦¬λ“œ ν˜•νƒœμ˜ 점듀을 μƒμ„±ν•˜κ³  이λ₯Ό 이미지에 μ˜€λ²„λ ˆμ΄ν•˜μ—¬ μΆ”λ‘ ν•©λ‹ˆλ‹€.

마슀크 생성 μž‘μ—…μ€ [전체 λΆ„ν•  λͺ¨λ“œ(Segment Anything Model, SAM)](model_doc/sam)에 μ˜ν•΄ μ§€μ›λ©λ‹ˆλ‹€. SAM은 Vision Transformer 기반 이미지 인코더, ν”„λ‘¬ν”„νŠΈ 인코더, 그리고 μ–‘λ°©ν–₯ 트랜슀포머 마슀크 λ””μ½”λ”λ‘œ κ΅¬μ„±λœ κ°•λ ₯ν•œ λͺ¨λΈμž…λ‹ˆλ‹€. 이미지와 ν”„λ‘¬ν”„νŠΈλŠ” μΈμ½”λ”©λ˜κ³ , λ””μ½”λ”λŠ” μ΄λŸ¬ν•œ μž„λ² λ”©μ„ λ°›μ•„ μœ νš¨ν•œ 마슀크λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sam.png" alt="SAM Architecture"/>
</div>

SAM은 λŒ€κ·œλͺ¨ 데이터λ₯Ό λ‹€λ£° 수 μžˆλŠ” κ°•λ ₯ν•œ λΆ„ν•  기반 λͺ¨λΈμž…λ‹ˆλ‹€. 이 λͺ¨λΈμ€ 100만 개의 이미지와 11μ–΅ 개의 마슀크λ₯Ό ν¬ν•¨ν•˜λŠ” [SA-1B](https://ai.meta.com/datasets/segment-anything/) 데이터 μ„ΈνŠΈλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 κ°€μ΄λ“œμ—μ„œλŠ” λ‹€μŒκ³Ό 같은 λ‚΄μš©μ„ 배우게 λ©λ‹ˆλ‹€:
- 배치 μ²˜λ¦¬μ™€ ν•¨κ»˜ 전체 λΆ„ν•  λͺ¨λ“œμ—μ„œ μΆ”λ‘ ν•˜λŠ” 방법
- 포인트 ν”„λ‘¬ν”„νŒ… λͺ¨λ“œμ—μ„œ μΆ”λ‘ ν•˜λŠ” 방법
- λ°•μŠ€ ν”„λ‘¬ν”„νŒ… λͺ¨λ“œμ—μ„œ μΆ”λ‘ ν•˜λŠ” 방법

λ¨Όμ €, `transformers`λ₯Ό μ„€μΉ˜ν•΄ λ΄…μ‹œλ‹€:

```bash
pip install -q transformers
```

## 마슀크 생성 νŒŒμ΄ν”„λΌμΈ[[mask-generation-pipeline]]

마슀크 생성 λͺ¨λΈλ‘œ μΆ”λ‘ ν•˜λŠ” κ°€μž₯ μ‰¬μš΄ 방법은 `mask-generation` νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

```python
>>> from transformers import pipeline

>>> checkpoint = "facebook/sam-vit-base"
>>> mask_generator = pipeline(model=checkpoint, task="mask-generation")
```

이미지λ₯Ό μ˜ˆμ‹œλ‘œ λ΄…μ‹œλ‹€.

```python
from PIL import Image
import requests

img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
```

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="Example Image"/>
</div>

μ „μ²΄μ μœΌλ‘œ λΆ„ν• ν•΄λ΄…μ‹œλ‹€. `points-per-batch`λŠ” 전체 λΆ„ν•  λͺ¨λ“œμ—μ„œ μ λ“€μ˜ 병렬 좔둠을 κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 μΆ”λ‘  속도가 λΉ¨λΌμ§€μ§€λ§Œ, 더 λ§Žμ€ λ©”λͺ¨λ¦¬λ₯Ό μ†Œλͺ¨ν•˜κ²Œ λ©λ‹ˆλ‹€. λ˜ν•œ, SAM은 이미지가 μ•„λ‹Œ 점듀에 λŒ€ν•΄μ„œλ§Œ 배치 처리λ₯Ό μ§€μ›ν•©λ‹ˆλ‹€. `pred_iou_thresh`λŠ” IoU μ‹ λ’° μž„κ³„κ°’μœΌλ‘œ, 이 μž„κ³„κ°’μ„ μ΄ˆκ³Όν•˜λŠ” 마슀크만 λ°˜ν™˜λ©λ‹ˆλ‹€.

```python
masks = mask_generator(image, points_per_batch=128, pred_iou_thresh=0.88)
```

`masks` λŠ” λ‹€μŒκ³Ό 같이 μƒκ²ΌμŠ΅λ‹ˆλ‹€:

```bash
{'masks': [array([[False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True],
         ...,
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False]]),
  array([[False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         [False, False, False, ..., False, False, False],
         ...,
'scores': tensor([0.9972, 0.9917,
        ...,
}
```

μœ„ λ‚΄μš©μ„ μ•„λž˜μ™€ 같이 μ‹œκ°ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

```python
import matplotlib.pyplot as plt

plt.imshow(image, cmap='gray')

for i, mask in enumerate(masks["masks"]):
    plt.imshow(mask, cmap='viridis', alpha=0.1, vmin=0, vmax=1)

plt.axis('off')
plt.show()
```

μ•„λž˜λŠ” νšŒμƒ‰μ‘° 원본 이미지에 λ‹€μ±„λ‘œμš΄ μƒ‰μƒμ˜ 맡을 겹쳐놓은 λͺ¨μŠ΅μž…λ‹ˆλ‹€. 맀우 인상적인 κ²°κ³Όμž…λ‹ˆλ‹€.

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee_segmented.png" alt="Visualized"/>
</div>

## λͺ¨λΈ μΆ”λ‘ [[model-inference]]

### 포인트 ν”„λ‘¬ν”„νŒ…[[point-prompting]]

νŒŒμ΄ν”„λΌμΈ 없이도 λͺ¨λΈμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄ λͺ¨λΈκ³Ό ν”„λ‘œμ„Έμ„œλ₯Ό μ΄ˆκΈ°ν™”ν•΄μ•Ό ν•©λ‹ˆλ‹€.

```python
from transformers import SamModel, SamProcessor
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = SamModel.from_pretrained("facebook/sam-vit-base").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
```

포인트 ν”„λ‘¬ν”„νŒ…μ„ ν•˜κΈ° μœ„ν•΄, μž…λ ₯ 포인트λ₯Ό ν”„λ‘œμ„Έμ„œμ— μ „λ‹¬ν•œ λ‹€μŒ, ν”„λ‘œμ„Έμ„œ 좜λ ₯을 λ°›μ•„ λͺ¨λΈμ— μ „λ‹¬ν•˜μ—¬ μΆ”λ‘ ν•©λ‹ˆλ‹€. λͺ¨λΈ 좜λ ₯을 ν›„μ²˜λ¦¬ν•˜λ €λ©΄, 좜λ ₯κ³Ό ν•¨κ»˜ ν”„λ‘œμ„Έμ„œμ˜ 초기 좜λ ₯μ—μ„œ κ°€μ Έμ˜¨ `original_sizes`와 `reshaped_input_sizes`λ₯Ό 전달해야 ν•©λ‹ˆλ‹€. μ™œλƒν•˜λ©΄, ν”„λ‘œμ„Έμ„œκ°€ 이미지 크기λ₯Ό μ‘°μ •ν•˜κ³  좜λ ₯을 μΆ”μ •ν•΄μ•Ό ν•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

```python
input_points = [[[2592, 1728]]] # 벌의 포인트 μœ„μΉ˜

inputs = processor(image, input_points=input_points, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
```

`masks` 좜λ ₯으둜 μ„Έ κ°€μ§€ 마슀크λ₯Ό μ‹œκ°ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```python
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 4, figsize=(15, 5))

axes[0].imshow(image)
axes[0].set_title('Original Image')
mask_list = [masks[0][0][0].numpy(), masks[0][0][1].numpy(), masks[0][0][2].numpy()]

for i, mask in enumerate(mask_list, start=1):
    overlayed_image = np.array(image).copy()

    overlayed_image[:,:,0] = np.where(mask == 1, 255, overlayed_image[:,:,0])
    overlayed_image[:,:,1] = np.where(mask == 1, 0, overlayed_image[:,:,1])
    overlayed_image[:,:,2] = np.where(mask == 1, 0, overlayed_image[:,:,2])

    axes[i].imshow(overlayed_image)
    axes[i].set_title(f'Mask {i}')
for ax in axes:
    ax.axis('off')

plt.show()
```

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/masks.png" alt="Visualized"/>
</div>

### λ°•μŠ€ ν”„λ‘¬ν”„νŒ…[[box-prompting]]

λ°•μŠ€ ν”„λ‘¬ν”„νŒ…λ„ 포인트 ν”„λ‘¬ν”„νŒ…κ³Ό μœ μ‚¬ν•œ λ°©μ‹μœΌλ‘œ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μž…λ ₯ λ°•μŠ€λ₯Ό `[x_min, y_min, x_max, y_max]` ν˜•μ‹μ˜ 리슀트둜 μž‘μ„±ν•˜μ—¬ 이미지와 ν•¨κ»˜ `processor`에 전달할 수 μžˆμŠ΅λ‹ˆλ‹€. ν”„λ‘œμ„Έμ„œ 좜λ ₯을 λ°›μ•„ λͺ¨λΈμ— 직접 μ „λ‹¬ν•œ ν›„, λ‹€μ‹œ 좜λ ₯을 ν›„μ²˜λ¦¬ν•΄μ•Ό ν•©λ‹ˆλ‹€.

```python
# 벌 μ£Όμœ„μ˜ λ°”μš΄λ”© λ°•μŠ€
box = [2350, 1600, 2850, 2100]

inputs = processor(
        image,
        input_boxes=[[[box]]],
        return_tensors="pt"
    ).to("cuda")

with torch.no_grad():
    outputs = model(**inputs)

mask = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)[0][0][0].numpy()
```

이제 μ•„λž˜μ™€ 같이, 벌 μ£Όμœ„μ˜ λ°”μš΄λ”© λ°•μŠ€λ₯Ό μ‹œκ°ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```python
import matplotlib.patches as patches

fig, ax = plt.subplots()
ax.imshow(image)

rectangle = patches.Rectangle((2350, 1600), 500, 500, linewidth=2, edgecolor='r', facecolor='none')
ax.add_patch(rectangle)
ax.axis("off")
plt.show()
```

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bbox.png" alt="Visualized Bbox"/>
</div>

μ•„λž˜μ—μ„œ μΆ”λ‘  κ²°κ³Όλ₯Ό 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

```python
fig, ax = plt.subplots()
ax.imshow(image)
ax.imshow(mask, cmap='viridis', alpha=0.4)

ax.axis("off")
plt.show()
```

<div class="flex justify-center">
     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/box_inference.png" alt="Visualized Inference"/>
</div>