transformers / docs /source /ko /model_doc /grounding-dino.md
AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified

Grounding DINO[[grounding-dino]]

PyTorch

๊ฐœ์š”[[overview]]

Grounding DINO ๋ชจ๋ธ์€ Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang์ด Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection์—์„œ ์ œ์•ˆํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Grounding DINO๋Š” ํ์‡„ํ˜• ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์„ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ ํ™•์žฅํ•˜์—ฌ ๊ฐœ๋ฐฉํ˜• ๊ฐ์ฒด ํƒ์ง€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ COCO ์ œ๋กœ์ƒท์—์„œ 52.5 AP์™€ ๊ฐ™์€ ๋†€๋ผ์šด ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ํƒ์ง€๊ธฐ DINO๋ฅผ ๊ธฐ๋ฐ˜ ์‚ฌ์ „ ํ•™์Šต๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ Grounding DINO๋ผ๋Š” ๊ฐœ๋ฐฉํ˜• ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ์ด๋ฆ„์ด๋‚˜ ์ฐธ์กฐ ํ‘œํ˜„ ๋“ฑ์˜ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์œผ๋กœ ์ž„์˜์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐœ๋ฐฉํ˜• ๊ฐ์ฒด ํƒ์ง€์˜ ํ•ต์‹ฌ ํ•ด๊ฒฐ์ฑ…์€ ๊ฐœ๋ฐฉํ˜• ๊ฐœ๋… ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•ด ํ์‡„ํ˜• ํƒ์ง€๊ธฐ์— ์–ธ์–ด๋ฅผ ๋„์ž…ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์–ธ์–ด์™€ ๋น„์ „ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์œตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด, ํ์‡„ํ˜• ํƒ์ง€๊ธฐ๋ฅผ ๊ฐœ๋…์ ์œผ๋กœ ์„ธ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด ํŠน์„ฑ ๊ฐ•ํ™”๊ธฐ, ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ฟผ๋ฆฌ ์„ ํƒ, ๊ต์ฐจ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์œตํ•ฉ์„ ์œ„ํ•œ ๊ต์ฐจ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋””์ฝ”๋”๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธด๋ฐ€ํ•œ ์œตํ•ฉ ์†”๋ฃจ์…˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ์ƒˆ๋กœ์šด ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•œ ๊ฐœ๋ฐฉํ˜• ๊ฐ์ฒด ํƒ์ง€๋ฅผ ํ‰๊ฐ€ํ•œ ๋ฐ˜๋ฉด, ์šฐ๋ฆฌ๋Š” ์†์„ฑ์œผ๋กœ ์ง€์ •๋œ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์ฐธ์กฐ ํ‘œํ˜„ ์ดํ•ด์— ๋Œ€ํ•œ ํ‰๊ฐ€๋„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Grounding DINO๋Š” COCO, LVIS, ODinW, RefCOCO/+/g ๋ฒค์น˜๋งˆํฌ๋ฅผ ํฌํ•จํ•œ ์„ธ ๊ฐ€์ง€ ์„ค์ • ๋ชจ๋‘์—์„œ ๋†€๋ผ์šด ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. Grounding DINO๋Š” COCO ํƒ์ง€ ์ œ๋กœ์ƒท ์ „์ด ๋ฒค์น˜๋งˆํฌ์—์„œ 52.5 AP(Average Precision, ํ‰๊ท  ์ •๋ฐ€๋„)๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, COCO์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ ์—†์ด๋„ ์ด๋Ÿฌํ•œ ์„ฑ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ท  26.1 AP๋กœ ODinW ์ œ๋กœ์ƒท ๋ฒค์น˜๋งˆํฌ์—์„œ ์ƒˆ๋กœ์šด ๊ธฐ๋ก์„ ์„ธ์› ์Šต๋‹ˆ๋‹ค.

drawing

Grounding DINO ๊ฐœ์š”. ์›๋ณธ ๋…ผ๋ฌธ์—์„œ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ EduardoPacheco์™€ nielsr์— ์˜ํ•ด ๊ธฐ์—ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒ[[usage-tips]]

  • [GroundingDinoProcessor]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์ค€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ…์ŠคํŠธ์—์„œ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•  ๋•Œ๋Š” ๋งˆ์นจํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”. ์˜ˆ: "a cat. a dog."
  • ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ(์˜ˆ: "a cat. a dog."), [GroundingDinoProcessor]์˜ post_process_grounded_object_detection์„ ์‚ฌ์šฉํ•ด ์ถœ๋ ฅ์„ ํ›„์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. post_process_object_detection์—์„œ ๋ฐ˜ํ™˜๋˜๋Š” ๋ ˆ์ด๋ธ”์€ prob > threshold์ธ ๋ชจ๋ธ ์ฐจ์›์˜ ์ธ๋ฑ์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ œ๋กœ์ƒท ๊ฐ์ฒด ํƒ์ง€์— ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค:

>>> import requests

>>> import torch
>>> from PIL import Image
>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

>>> model_id = "IDEA-Research/grounding-dino-tiny"
>>> device = "cuda"

>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

>>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(image_url, stream=True).raw)
>>> # ๊ณ ์–‘์ด์™€ ๋ฆฌ๋ชจ์ปจ ํ™•์ธ
>>> text_labels = [["a cat", "a remote control"]]

>>> inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> results = processor.post_process_grounded_object_detection(
...     outputs,
...     inputs.input_ids,
...     box_threshold=0.4,
...     text_threshold=0.3,
...     target_sizes=[image.size[::-1]]
... )

# ์ฒซ ๋ฒˆ์งธ ์ด๋ฏธ์ง€ ๊ฒฐ๊ณผ ๊ฐ€์ ธ์˜ค๊ธฐ
>>> result = results[0]
>>> for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
...     box = [round(x, 2) for x in box.tolist()]
...     print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
Detected a cat with confidence 0.468 at location [344.78, 22.9, 637.3, 373.62]
Detected a cat with confidence 0.426 at location [11.74, 51.55, 316.51, 473.22]

Grounded SAM[[grounded-sam]]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks์—์„œ ์†Œ๊ฐœ๋œ ๋Œ€๋กœ Grounding DINO๋ฅผ Segment Anything ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋งˆ์Šคํฌ ์ƒ์„ฑ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ด ๋ฐ๋ชจ ๋…ธํŠธ๋ถ ๐ŸŒ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

drawing

Grounded SAM ๊ฐœ์š”. ์›๋ณธ ์ €์žฅ์†Œ์—์„œ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

๋ฆฌ์†Œ์Šค[[resources]]

Grounding DINO๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ๊ณต์‹ Hugging Face ๋ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ(๐ŸŒŽ๋กœ ํ‘œ์‹œ) ๋ฆฌ์†Œ์Šค ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ํฌํ•จ๋  ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ์ถœํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด Pull Request๋ฅผ ์ž์œ ๋กญ๊ฒŒ ์—ด์–ด์ฃผ์„ธ์š”. ๊ฒ€ํ† ํ•ด๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค! ๋ฆฌ์†Œ์Šค๋Š” ๊ธฐ์กด ๋ฆฌ์†Œ์Šค๋ฅผ ๋ณต์ œํ•˜๋Š” ๋Œ€์‹  ์ƒˆ๋กœ์šด ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ด ์ด์ƒ์ ์ž…๋‹ˆ๋‹ค.

  • Grounding DINO๋กœ ์ถ”๋ก ํ•˜๊ณ  SAM๊ณผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐ๋ชจ ๋…ธํŠธ๋ถ์€ ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๐ŸŒŽ

GroundingDinoImageProcessor

[[autodoc]] GroundingDinoImageProcessor - preprocess

GroundingDinoImageProcessorFast

[[autodoc]] GroundingDinoImageProcessorFast - preprocess - post_process_object_detection

GroundingDinoProcessor

[[autodoc]] GroundingDinoProcessor - post_process_grounded_object_detection

GroundingDinoConfig

[[autodoc]] GroundingDinoConfig

GroundingDinoModel

[[autodoc]] GroundingDinoModel - forward

GroundingDinoForObjectDetection

[[autodoc]] GroundingDinoForObjectDetection - forward