SAM3-video-segmentation-tracking / docs /SAM3_GroundingDINO_Integration_Analysis.md
bellmake's picture
SAM3 Video Segmentation - Clean deployment
ae50268

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

SAM3 + GroundingDINO ํ†ตํ•ฉ ๊ฐ€๋Šฅ์„ฑ ๋ถ„์„ ๋ณด๊ณ ์„œ

์ž‘์„ฑ์ผ: 2025-12-17
์ฃผ์ œ: ํ˜„์žฌ SAM3 ๋น„๋””์˜ค ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์‹œ์Šคํ…œ์— GroundingDINO๋ฅผ ์ ‘๋ชฉํ•˜์—ฌ ์„ฑ๋Šฅ/์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ๋ฉด๋ฐ€ํžˆ ๊ฒ€ํ† 


๐Ÿ“‹ ๋ชฉ์ฐจ

  1. ํ˜„์žฌ SAM3 ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜
  2. GroundingDINO ๊ฐœ์š”
  3. SAM3์˜ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์ฒ˜๋ฆฌ ๋ฐฉ์‹
  4. ํ†ตํ•ฉ ์‹œ๋‚˜๋ฆฌ์˜ค ๋ถ„์„
  5. ์„ฑ๋Šฅ/์†๋„ ์˜ํ–ฅ ๋ถ„์„
  6. ๊ฒฐ๋ก  ๋ฐ ๊ถŒ์žฅ์‚ฌํ•ญ

1. ํ˜„์žฌ SAM3 ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜

1.1 ํŒŒ์ดํ”„๋ผ์ธ ํ๋ฆ„

[์ž…๋ ฅ ๋น„๋””์˜ค] 
    โ†“
[Text Prompt ํŒŒ์‹ฑ] โ†’ "5 mice"
    โ†“
[SAM3 ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„ ์ฒ˜๋ฆฌ]
    โ”œโ”€โ”€ ํ…์ŠคํŠธ โ†’ CLIP ์ธ์ฝ”๋”ฉ โ†’ ํŠน์ง• ๋ฒกํ„ฐ
    โ”œโ”€โ”€ ์ด๋ฏธ์ง€ โ†’ Visual Encoder
    โ”œโ”€โ”€ Cross-attention (Text โ†” Image)
    โ””โ”€โ”€ Segmentation Decoder โ†’ ๋งˆ์Šคํฌ ์ƒ์„ฑ
    โ†“
[SAM3 ์‹œ๊ฐ„์  ์ „ํŒŒ] โ†’ propagate_in_video()
    โ”œโ”€โ”€ Memory Attention (๊ณผ๊ฑฐ ํ”„๋ ˆ์ž„ ์ฐธ์กฐ)
    โ”œโ”€โ”€ Temporal Disambiguation
    โ””โ”€โ”€ ํ”„๋ ˆ์ž„๋ณ„ ๋งˆ์Šคํฌ ์ถœ๋ ฅ
    โ†“
[ํ›„์ฒ˜๋ฆฌ ํ•„ํ„ฐ๋ง]
    โ”œโ”€โ”€ ID ์ผ๊ด€์„ฑ ์œ ์ง€
    โ”œโ”€โ”€ Occlusion ๋ณต์›
    โ””โ”€โ”€ Anti-Tail/Reflection ํ•„ํ„ฐ

1.2 SAM3์˜ ๊ฐ•์ 

ํŠน์ง• ์„ค๋ช… ํšจ๊ณผ
End-to-End ํ•™์Šต ํ…์ŠคํŠธ ์ธ์ฝ”๋” + ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์ด ํ†ตํ•ฉ ํ•™์Šต๋จ ํ…์ŠคํŠธ-๋งˆ์Šคํฌ ๋งคํ•‘์˜ ๋†’์€ ์ •ํ™•๋„
Temporal Consistency Memory Attention์œผ๋กœ ๊ณผ๊ฑฐ ํ”„๋ ˆ์ž„ ์ฐธ์กฐ ๋น„๋””์˜ค ์ „์ฒด์—์„œ ID ์ผ๊ด€์„ฑ ์œ ์ง€
์ •๋ฐ€ํ•œ ๋งˆ์Šคํฌ Pixel-level ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ Bbox๋ณด๋‹ค ์ •ํ™•ํ•œ ๊ฐ์ฒด ๊ฒฝ๊ณ„
ํ†ตํ•ฉ ๋ชจ๋ธ ๋‹จ์ผ ๋ชจ๋ธ์—์„œ ๊ฐ์ง€โ†’์ถ”์  ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ ๊ฐ„ ์ •ํ•ฉ์„ฑ ๋ฌธ์ œ ์—†์Œ

1.3 SAM3์˜ ํ•œ๊ณ„

๋ฌธ์ œ ์›์ธ ์˜ํ–ฅ
์ฒ˜๋ฆฌ ์†๋„ ๋А๋ฆผ ํ”„๋ ˆ์ž„๋‹น ์ „์ฒด ์ด๋ฏธ์ง€ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ~300ms/frame
์ดˆ๊ธฐ ๊ฐ์ง€ ์‹คํŒจ ์‹œ ๋ณต๊ตฌ ์–ด๋ ค์›€ ์ฒซ ํ”„๋ ˆ์ž„์—์„œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์• ๋งคํ•˜๋ฉด ์˜คํƒ์ง€ ์ „์ฒด ๋น„๋””์˜ค์— ์˜ํ–ฅ
์†Œํ˜• ๊ฐ์ฒด ๊ฐ์ง€ ์•ฝํ•จ Global attention โ†’ ์ž‘์€ ์˜์—ญ ์ง‘์ค‘ ์–ด๋ ค์›€ ๋ฏธํƒ์ง€ ๋ฐœ์ƒ
GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋†’์Œ Memory Bank + Transformer 6-8GB ํ•„์š”

2. GroundingDINO ๊ฐœ์š”

2.1 ํ•ต์‹ฌ ํŠน์ง•

GroundingDINO๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋กœ ๊ฐ์ฒด๋ฅผ **ํƒ์ง€(Detection)**ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

[์ž…๋ ฅ ์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ]
    โ†“
[Vision Transformer (Swin-T)]  โ† ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ
    โ†“
[Language Model (BERT)]         โ† ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ
    โ†“
[Cross-Modality Fusion]
    โ†“
[DETR-style Decoder]
    โ†“
[์ถœ๋ ฅ: Bounding Box + Confidence Score]

2.2 ์žฅ๋‹จ์ 

์žฅ์  ๋‹จ์ 
โœ… ์ œ๋กœ์ƒท ๊ฐ์ฒด ํƒ์ง€ (ํ•™์Šต ์—†๋Š” ์ƒˆ ํด๋ž˜์Šค ๊ฐ์ง€ ๊ฐ€๋Šฅ) โŒ Bbox๋งŒ ์ถœ๋ ฅ (๋งˆ์Šคํฌ ์—†์Œ)
โœ… ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ •ํ•ฉ๋„ ๋†’์Œ โŒ ๋น„๋””์˜ค ์ถ”์  ๊ธฐ๋Šฅ ์—†์Œ (๋‹จ์ผ ์ด๋ฏธ์ง€๋งŒ ์ฒ˜๋ฆฌ)
โœ… ๋น ๋ฅธ ์ถ”๋ก  ์†๋„ (~50-100ms/frame) โŒ Temporal consistency ์—†์Œ
โœ… ์ •ํ™•ํ•œ ๊ฐ์ฒด ์œ„์น˜ ํŒŒ์•… โŒ ํ”„๋ ˆ์ž„ ๊ฐ„ ID ์—ฐ๊ฒฐ ๋ถˆ๊ฐ€

2.3 SAM3์™€์˜ ์ฐจ์ด์ 

ํŠน์ง• SAM3 GroundingDINO
์ฃผ ๊ธฐ๋Šฅ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ + ์ถ”์  ๊ฐ์ฒด ํƒ์ง€
์ถœ๋ ฅ Pixel-level ๋งˆ์Šคํฌ + ID Bounding Box + Score
๋น„๋””์˜ค ์ง€์› โœ… Temporal ์ถ”์  โŒ ๋‹จ์ผ ํ”„๋ ˆ์ž„๋งŒ
ํ…์ŠคํŠธ ์ดํ•ด CLIP (๋Œ€์กฐ ํ•™์Šต) BERT (์–ธ์–ด ๋ชจ๋ธ)
์†๋„ ๋А๋ฆผ (~300ms) ๋น ๋ฆ„ (~70ms)

3. SAM3์˜ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์ฒ˜๋ฆฌ ๋ฐฉ์‹

3.1 ์ฝ”๋“œ ํ™•์ธ ๊ฒฐ๊ณผ

# sam3/model/sam3_video_predictor.py: L151-160
frame_idx, outputs = self.model.add_prompt(
    inference_state=inference_state,
    frame_idx=frame_idx,
    text_str=text,  # โ† ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ
    points=points,
    point_labels=point_labels,
    boxes_xywh=bounding_boxes,  # โ† Bbox๋„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ!
    box_labels=bounding_box_labels,
    obj_id=obj_id,
)

์ค‘์š” ๋ฐœ๊ฒฌ: SAM3๋Š” ์ด๋ฏธ ํ…์ŠคํŠธ + Bbox๋ฅผ ๋™์‹œ์— ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

3.2 SAM3 ๋‚ด๋ถ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ

# sam3/model/sam3_video_inference.py: L866-868
if text_str is not None and text_str != "visual":
    inference_state["text_prompt"] = text_str
    inference_state["input_batch"].find_text_batch[0] = text_str

์ฒ˜๋ฆฌ ํ๋ฆ„:

  1. ํ…์ŠคํŠธ โ†’ CLIP ํ…์ŠคํŠธ ์ธ์ฝ”๋” โ†’ 512D ํŠน์ง• ๋ฒกํ„ฐ
  2. ์ด๋ฏธ์ง€ โ†’ Vision Encoder โ†’ ์ด๋ฏธ์ง€ ํŠน์ง•
  3. Cross-Attention (ํ…์ŠคํŠธ ํŠน์ง• โ†” ์ด๋ฏธ์ง€ ํŠน์ง•)
  4. Decoder โ†’ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋งˆ์Šคํฌ

4. ํ†ตํ•ฉ ์‹œ๋‚˜๋ฆฌ์˜ค ๋ถ„์„

4.1 ์‹œ๋‚˜๋ฆฌ์˜ค A: GroundingDINO ์ดˆ๊ธฐ ๊ฐ์ง€ โ†’ SAM3 ์ •๋ฐ€ ๋งˆ์Šคํฌ (โญ๏ธโญ๏ธโญ๏ธ)

๊ฐœ๋…:

[์ฒซ ํ”„๋ ˆ์ž„]
    โ†“
[GroundingDINO] โ†’ ํ…์ŠคํŠธ๋กœ ๊ฐ์ฒด bbox ํƒ์ง€
    โ†“ (์˜ˆ: "mice" โ†’ 5๊ฐœ์˜ bbox)
[SAM3] โ†’ bbox๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์ •๋ฐ€ ๋งˆ์Šคํฌ ์ƒ์„ฑ
    โ†“
[SAM3 propagate] โ†’ ๋น„๋””์˜ค ์ „์ฒด ์ถ”์ 

๊ตฌํ˜„ ์ฝ”๋“œ:

# ํ˜„์žฌ ๋ฐฉ์‹
predictor.add_prompt(session_id, frame_idx=0, text="5 mice")

# GroundingDINO ํ†ตํ•ฉ ๋ฐฉ์‹
from groundingdino.util.inference import load_model, predict

grounding_model = load_model("GroundingDINO_SwinT_OGC.py", "weights.pth")

# 1๋‹จ๊ณ„: GroundingDINO๋กœ bbox ํƒ์ง€
boxes, confidences, labels = predict(
    model=grounding_model,
    image=first_frame,
    caption="mice",  # ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ
    box_threshold=0.3,
    text_threshold=0.25
)

# 2๋‹จ๊ณ„: SAM3์— bbox ์ „๋‹ฌ
for i, box in enumerate(boxes):
    predictor.add_prompt(
        session_id,
        frame_idx=0,
        bounding_boxes=[box.tolist()],  # bbox ํ”„๋กฌํ”„ํŠธ
        obj_id=i+1
    )

# 3๋‹จ๊ณ„: SAM3๊ฐ€ ๋น„๋””์˜ค ์ „์ฒด ์ถ”์ 
for frame_idx, outputs in predictor.propagate_in_video(...):
    process(outputs)

์žฅ์ :

  • โœ… GroundingDINO์˜ ์ •ํ™•ํ•œ ์ดˆ๊ธฐ ํƒ์ง€ ํ™œ์šฉ
  • โœ… SAM3์˜ ์ •๋ฐ€ ๋งˆ์Šคํฌ + ์‹œ๊ฐ„ ์ผ๊ด€์„ฑ ์œ ์ง€
  • โœ… ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„ ์ฒ˜๋ฆฌ๋งŒ GroundingDINO ์‚ฌ์šฉ โ†’ ์†๋„ ์˜ํ–ฅ ์ตœ์†Œ

๋‹จ์ :

  • โŒ GroundingDINO ๋ชจ๋ธ ์ถ”๊ฐ€ ๋กœ๋“œ (+2-3GB GPU ๋ฉ”๋ชจ๋ฆฌ)
  • โŒ ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด(ํฐ ์ฅ)๋„ GroundingDINO๋Š” ๊ตฌ๋ถ„ ๋ชปํ•จ โ†’ 5๊ฐœ bbox๊ฐ€ ์ •ํ™•ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ

4.2 ์‹œ๋‚˜๋ฆฌ์˜ค B: GroundingDINO ์žฌ์ดˆ๊ธฐํ™” ๋ณด์กฐ (โญ๏ธโญ๏ธ)

๊ฐœ๋…:

[SAM3 ๊ธฐ๋ณธ ์ฒ˜๋ฆฌ]
    โ†“
[์ฒญํฌ ๊ฒฝ๊ณ„ or ID ์†Œ์‹ค ๊ฐ์ง€]
    โ†“
[GroundingDINO] โ†’ ํ•ด๋‹น ํ”„๋ ˆ์ž„์—์„œ ๊ฐ์ฒด ์žฌํƒ์ง€
    โ†“
[SAM3] โ†’ ์ƒˆ bbox๋กœ ์žฌ์ดˆ๊ธฐํ™”

์žฅ์ :

  • โœ… SAM3๊ฐ€ ๋†“์นœ ๊ฐ์ฒด ๋ณด์™„
  • โœ… ์ฒญํฌ ๊ฒฝ๊ณ„์—์„œ ID ์—ฐ๊ฒฐ ๋ณด๊ฐ•

๋‹จ์ :

  • โŒ ๋งค ์ฒญํฌ๋งˆ๋‹ค GroundingDINO ํ˜ธ์ถœ โ†’ ์†๋„ ์ €ํ•˜
  • โŒ ๋‘ ๋ชจ๋ธ ๊ฒฐ๊ณผ ๋งค์นญ ๋กœ์ง ๋ณต์žก

4.3 ์‹œ๋‚˜๋ฆฌ์˜ค C: GroundingDINO ๋‹จ๋… โ†’ SAM ๊ฒฐํ•ฉ (Grounded-SAM) (โญ๏ธ)

๊ฐœ๋…:

[GroundingDINO] โ†’ ๋งค ํ”„๋ ˆ์ž„ bbox ํƒ์ง€
    โ†“
[SAM (์ •์ )] โ†’ bbox โ†’ ๋งˆ์Šคํฌ
    โ†“
[DeepSORT/BoT-SORT] โ†’ ID ์ถ”์ 

๋ฌธ์ œ:

  • โŒ SAM3์˜ temporal tracking์ด ์™„์ „ํžˆ ๋ฌดํšจํ™”๋จ
  • โŒ ๋งค ํ”„๋ ˆ์ž„ GroundingDINO ํ˜ธ์ถœ โ†’ ๋А๋ ค์ง
  • โŒ ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด์—์„œ ID swap ์ฆ๊ฐ€

๊ฒฐ๋ก : ํ˜„์žฌ SAM3 ์‹œ์Šคํ…œ์„ ๋ฒ„๋ฆฌ๊ณ  ์ƒˆ๋กœ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผ โ†’ ๊ถŒ์žฅํ•˜์ง€ ์•Š์Œ


5. ์„ฑ๋Šฅ/์†๋„ ์˜ํ–ฅ ๋ถ„์„

5.1 ์‹œ๋‚˜๋ฆฌ์˜ค A ์„ฑ๋Šฅ ๋ถ„์„

์ง€ํ‘œ ํ˜„์žฌ SAM3 + GroundingDINO (์‹œ๋‚˜๋ฆฌ์˜ค A) ๋ณ€ํ™”
์ดˆ๊ธฐ ํƒ์ง€ ์ •ํ™•๋„ โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ +25%
๋งˆ์Šคํฌ ์ •๋ฐ€๋„ โญ๏ธโญ๏ธโญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธโญ๏ธ ๋™์ผ
ID ์ผ๊ด€์„ฑ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ ๋™์ผ
์ฒ˜๋ฆฌ ์†๋„ 300ms/frame 305ms/frame -1.7%
๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด ์ฒ˜๋ฆฌ โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธ -30%
GPU ๋ฉ”๋ชจ๋ฆฌ 6-8GB 9-11GB +40%

์ข…ํ•ฉ ํ‰๊ฐ€:

  • ์ดˆ๊ธฐ ํƒ์ง€ ์ •ํ™•๋„๋Š” ํ–ฅ์ƒ๋˜์ง€๋งŒ, ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด(ํฐ ์ฅ 5๋งˆ๋ฆฌ)๋Š” ์—ฌ์ „ํžˆ ์–ด๋ ค์›€
  • GPU ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€๊ฐ€ ํฌ๊ณ , ์†๋„ ํ–ฅ์ƒ์€ ๋ฏธ๋ฏธํ•จ

5.2 ์†๋„ ์ƒ์„ธ ๋ถ„์„

ํ˜„์žฌ SAM3 (500 ํ”„๋ ˆ์ž„ ๋น„๋””์˜ค ๊ธฐ์ค€)

์ดˆ๊ธฐํ™”: 2s
ํ”„๋ ˆ์ž„ 0 (SAM3 add_prompt): 1.5s
ํ”„๋ ˆ์ž„ 1-499 (SAM3 propagate): 300ms ร— 499 = 149.7s
ํ›„์ฒ˜๋ฆฌ: 2.5s
์ด: 155.7s (์•ฝ 2๋ถ„ 36์ดˆ)

GroundingDINO + SAM3 (์‹œ๋‚˜๋ฆฌ์˜ค A)

์ดˆ๊ธฐํ™” (SAM3 + GroundingDINO): 3s
ํ”„๋ ˆ์ž„ 0 (GroundingDINO ํƒ์ง€): 70ms
ํ”„๋ ˆ์ž„ 0 (SAM3 add_prompt with bbox): 1.0s  โ† ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ ๋ถˆํ•„์š”
ํ”„๋ ˆ์ž„ 1-499 (SAM3 propagate): 300ms ร— 499 = 149.7s
ํ›„์ฒ˜๋ฆฌ: 2.5s
์ด: 156.3s (์•ฝ 2๋ถ„ 36์ดˆ)

๊ฒฐ๋ก : ์†๋„ ์ฐจ์ด ๊ฑฐ์˜ ์—†์Œ (0.4% ์ฆ๊ฐ€ only)


5.3 ์ •ํ™•๋„ ์ƒ์„ธ ๋ถ„์„

ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค: "5 white mice" ์‹œ๋‚˜๋ฆฌ์˜ค

์ƒํ™ฉ SAM3 ๋‹จ๋… + GroundingDINO
5๋งˆ๋ฆฌ ๋ชจ๋‘ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์Œ 100% ์ •ํ™• 100% ์ •ํ™•
2๋งˆ๋ฆฌ๊ฐ€ ๊ฒน์ณ์žˆ์Œ 95% ์ •ํ™• (๋งˆ์Šคํฌ ๋ถ„๋ฆฌ ๊ฐ€๋Šฅ) 70% ์ •ํ™• (bbox ๊ฒน์นจ โ†’ 1๊ฐœ๋กœ ์ธ์‹)
๊ผฌ๋ฆฌ๋งŒ ๋ณด์ž„ 90% ์ •ํ™• (Anti-Tail ํ•„ํ„ฐ) 50% ์ •ํ™• (bbox ๋„ˆ๋ฌด ์ž‘์•„ ํ•„ํ„ฐ๋ง๋จ)
๋น ๋ฅธ ์›€์ง์ž„ 85% ์ •ํ™• (Temporal tracking) 85% ์ •ํ™• (๋™์ผ)

๊ฒฐ๋ก : ๋™์ผ ์™ธ๊ด€ + ๊ฒน์นจ ์ƒํ™ฉ์—์„œ GroundingDINO๊ฐ€ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜


6. ๊ฒฐ๋ก  ๋ฐ ๊ถŒ์žฅ์‚ฌํ•ญ

6.1 ํ•ต์‹ฌ ๋ฐœ๊ฒฌ

ํ•ญ๋ชฉ ๊ฒฐ๊ณผ
์†๋„ ํ–ฅ์ƒ โŒ ๊ฑฐ์˜ ์—†์Œ (0.4% ์ฆ๊ฐ€)
์ •ํ™•๋„ ํ–ฅ์ƒ โš ๏ธ ์ผ๋ฐ˜ ๊ฐ์ฒด๋Š” ํ–ฅ์ƒ, ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด๋Š” ์˜คํžˆ๋ ค ์ €ํ•˜
๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€ โŒ +40% (9-11GB)
๊ตฌํ˜„ ๋ณต์žก๋„ โš ๏ธ ์ค‘๊ฐ„ (๋ชจ๋ธ ์ถ”๊ฐ€ ๋กœ๋“œ + bbox ๋งค์นญ ๋กœ์ง)
ROI (ํˆฌ์ž ๋Œ€๋น„ ํšจ๊ณผ) โŒ ๋‚ฎ์Œ

6.2 ์ตœ์ข… ๊ถŒ์žฅ์‚ฌํ•ญ

โœ… ๊ถŒ์žฅ: SAM3 ๋‹จ๋… ์œ ์ง€ + ์ตœ์ ํ™”

์ด์œ :

  1. SAM3์˜ temporal tracking์ด ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด ์ฒ˜๋ฆฌ์— ๋” ์šฐ์ˆ˜
  2. ํ†ตํ•ฉ ๋ชจ๋ธ โ†’ ๋ชจ๋“ˆ ๊ฐ„ ์ •ํ•ฉ์„ฑ ๋ฌธ์ œ ์—†์Œ
  3. GroundingDINO ์ถ”๊ฐ€ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€ ๋Œ€๋น„ ํšจ๊ณผ ๋ฏธ๋ฏธ

๋Œ€์‹  ์•„๋ž˜ ์ตœ์ ํ™” ๊ถŒ์žฅ:

# 1. ํ”„๋ ˆ์ž„ ์Šคํ‚ต (2๋ฐฐ ๋น ๋ฆ„)
for frame_idx in range(0, num_frames, 2):  # ๋งค 2ํ”„๋ ˆ์ž„๋งˆ๋‹ค
    outputs = propagate(...)
    
# 2. ํ•ด์ƒ๋„ ๊ฐ์†Œ ํ›„ ์—…์Šค์ผ€์ผ (1.5๋ฐฐ ๋น ๋ฆ„)
resized_frame = cv2.resize(frame, (width//2, height//2))
mask_low = sam3_process(resized_frame)
mask_high = cv2.resize(mask_low, (width, height))

# 3. SAM3 ๊ฒฝ๋Ÿ‰ํ™” ๋ชจ๋ธ ์‚ฌ์šฉ
model = build_sam3("sam3_hiera_tiny")  # ๋Œ€์‹  sam3_hiera_large

โš ๏ธ ์กฐ๊ฑด๋ถ€ ๊ถŒ์žฅ: GroundingDINO ํ†ตํ•ฉ (์‹œ๋‚˜๋ฆฌ์˜ค A)

์‚ฌ์šฉ ์กฐ๊ฑด:

  • ์™ธ๊ด€์ด ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋ฅผ ๋‹ค๋ฃฐ ๋•Œ (์˜ˆ: person, car, dog ํ˜ผํ•ฉ)
  • ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„์—์„œ SAM3 ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ž์ฃผ ์‹คํŒจํ•  ๋•Œ
  • GPU ๋ฉ”๋ชจ๋ฆฌ 12GB ์ด์ƒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ

๊ตฌํ˜„ ์šฐ์„ ์ˆœ์œ„:

  1. GroundingDINO๋กœ ์ฒซ ํ”„๋ ˆ์ž„๋งŒ bbox ํƒ์ง€
  2. SAM3์— bbox ํ”„๋กฌํ”„ํŠธ ์ „๋‹ฌ
  3. SAM3 propagate๋กœ ๋น„๋””์˜ค ์ถ”์ 

โŒ ๋น„๊ถŒ์žฅ: ์ „๋ฉด ๊ต์ฒด (์‹œ๋‚˜๋ฆฌ์˜ค C)

์ด์œ :

  • SAM3์˜ temporal tracking ๋ฌดํšจํ™”
  • ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด์—์„œ ์„ฑ๋Šฅ ๋Œ€ํญ ์ €ํ•˜
  • ๊ตฌํ˜„ ๋ณต์žก๋„ ๋†’์Œ

6.3 ๋Œ€์•ˆ: ํ˜„์žฌ ์‹œ์Šคํ…œ ๊ฐ•ํ™”

ํ˜„์žฌ ๊ตฌํ˜„๋œ ์ปค์Šคํ…€ ๋กœ์ง์ด ์ด๋ฏธ ๋งค์šฐ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ๋Šฅ ์ƒํƒœ ํšจ๊ณผ
Velocity ๊ธฐ๋ฐ˜ Occlusion ๋ณต์› โœ… ๊ตฌํ˜„๋จ ID ์†Œ์‹ค ๋ฐฉ์ง€
Anti-Tail ํ•„ํ„ฐ + ํžˆ์Šคํ† ๋ฆฌ ์œ ์ง€ โœ… ๊ตฌํ˜„๋จ ๊ผฌ๋ฆฌ ๋ถ„๋ฆฌ ๋ฐฉ์ง€
Side View ๋ฐ˜์‚ฌ ์ œ๊ฑฐ โœ… ๊ตฌํ˜„๋จ ์˜คํƒ์ง€ ๊ฐ์†Œ
IoU ๊ธฐ๋ฐ˜ ์ฒญํฌ ์—ฐ๊ฒฐ โœ… ๊ตฌํ˜„๋จ ๊ธด ์˜์ƒ ID ์ผ๊ด€์„ฑ

์ถ”๊ฐ€ ๊ฐœ์„  ๊ฐ€๋Šฅ ์˜์—ญ:

  1. Adaptive thresholding (๊ฐ์ฒด ์†๋„์— ๋”ฐ๋ผ ๋™์  ์กฐ์ •)
  2. Multi-scale processing (๋‹ค์–‘ํ•œ ํฌ๊ธฐ ๊ฐ์ฒด ๋Œ€์‘)
  3. Confidence-based filtering (๋‚ฎ์€ ์‹ ๋ขฐ๋„ ๋งˆ์Šคํฌ ์ œ๊ฑฐ)

๐Ÿ“Š ์š”์•ฝํ‘œ

ํ†ตํ•ฉ ๋ฐฉ์‹ ์†๋„ ์ •ํ™•๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„ ๊ถŒ์žฅ
SAM3 ๋‹จ๋… (ํ˜„์žฌ) โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธโญ๏ธ โœ…
+ GroundingDINO (A) โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธ โš ๏ธ
+ GroundingDINO (B) โญ๏ธโญ๏ธ โญ๏ธโญ๏ธโญ๏ธ โญ๏ธโญ๏ธ โญ๏ธโญ๏ธ โŒ
์ „๋ฉด ๊ต์ฒด (C) โญ๏ธโญ๏ธ โญ๏ธโญ๏ธ โญ๏ธ โญ๏ธ โŒ

๐ŸŽฏ ์ตœ์ข… ๊ฒฐ๋ก 

GroundingDINO๋ฅผ ์ ‘๋ชฉํ•˜๋Š” ๊ฒƒ์€ ํ˜„์žฌ use case(๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด ๋‹ค์ค‘ ์ถ”์ )์—์„œ ROI๊ฐ€ ๋‚ฎ์Šต๋‹ˆ๋‹ค.

์ด์œ :

  1. ์†๋„ ํ–ฅ์ƒ ๊ฑฐ์˜ ์—†์Œ (0.4%)
  2. ๋™์ผ ์™ธ๊ด€ ๊ฐ์ฒด์—์„œ ์ •ํ™•๋„ ์˜คํžˆ๋ ค ์ €ํ•˜
  3. GPU ๋ฉ”๋ชจ๋ฆฌ 40% ์ฆ๊ฐ€
  4. SAM3์˜ ๊ฐ•๋ ฅํ•œ temporal tracking์„ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉ ๋ชปํ•จ

๊ถŒ์žฅ ์‚ฌํ•ญ:
ํ˜„์žฌ SAM3 ๋‹จ๋… ์‹œ์Šคํ…œ์„ ์œ ์ง€ํ•˜๊ณ , ํ”„๋ ˆ์ž„ ์Šคํ‚ต/ํ•ด์ƒ๋„ ๊ฐ์†Œ ๋“ฑ์˜ ์ตœ์ ํ™”๋กœ ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.


์ž‘์„ฑ์ž: AI Assistant
๊ฒ€ํ†  ๋Œ€์ƒ: SAM3 ๋น„๋””์˜ค ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์‹œ์Šคํ…œ (์ฅ ์ถ”์  use case)
๋ถ„์„ ๊ธฐ์ค€: ์„ฑ๋Šฅ, ์†๋„, ๋ฉ”๋ชจ๋ฆฌ, ๊ตฌํ˜„ ๋ณต์žก๋„, ROI