DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Swin Transformer [[swin-transformer]]

๊ฐœ์š” [[overview]]

Swin Transformer๋Š” Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo๊ฐ€ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ Swin Transformer: Hierarchical Vision Transformer using Shifted Windows์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์ด ๋…ผ๋ฌธ์€ Swin Transformer๋ผ๋Š” ์ƒˆ๋กœ์šด ๋น„์ „ ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ ๋ฒ”์šฉ ๋ฐฑ๋ณธ(backbone)์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์–ธ์–ด์—์„œ ๋น„์ „์œผ๋กœ ์ ์šฉํ•  ๋•Œ์˜ ์–ด๋ ค์›€์€ ๋‘ ๋ถ„์•ผ ๊ฐ„์˜ ์ฐจ์ด์—์„œ ๋น„๋กฏ๋˜๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ์‹œ๊ฐ์  ๊ฐ์ฒด์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๊ฒŒ ๋ณ€๋™ํ•˜๋ฉฐ, ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€ ํ•ด์ƒ๋„๊ฐ€ ํ…์ŠคํŠธ์˜ ๋‹จ์–ด์— ๋น„ํ•ด ๋งค์šฐ ๋†’๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ฐจ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” 'Shifted Windows'๋ฅผ ์ด์šฉํ•ด ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณ„์ธต์  ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Shifted Windows ๋ฐฉ์‹์€ ๊ฒน์น˜์ง€ ์•Š๋Š” ๋กœ์ปฌ ์œˆ๋„์šฐ์—์„œ self-attention ๊ณ„์‚ฐ์„ ์ œํ•œํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๋™์‹œ์— ์œˆ๋„์šฐ ๊ฐ„ ์—ฐ๊ฒฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณ„์ธต์  ๊ตฌ์กฐ๋Š” ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํŒจํ„ด์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์—ฐ์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ, ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋น„๋ก€ํ•œ ์„ ํ˜• ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Swin Transformer์˜ ์ด๋Ÿฌํ•œ ํŠน์ง•๋“ค์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(Imagenet-1K์—์„œ 87.3์˜ top-1 ์ •ํ™•๋„) ๋ฐ ๊ฐ์ฒด ๊ฒ€์ถœ(COCO test-dev์—์„œ 58.7์˜ ๋ฐ•์Šค AP, 51.1์˜ ๋งˆ์Šคํฌ AP)๊ณผ ๊ฐ™์€ ๋ฐ€์ง‘ ์˜ˆ์ธก ์ž‘์—…, ์˜๋ฏธ์  ๋ถ„ํ• (ADE20K val์—์„œ 53.5์˜ mIoU)๊ณผ ๊ฐ™์€ ๊ด‘๋ฒ”์œ„ํ•œ ๋น„์ „ ์ž‘์—…์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ COCO์—์„œ ์ด์ „ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ฐ•์Šค AP์—์„œ +2.7, ๋งˆ์Šคํฌ AP์—์„œ +2.6, ADE20K์—์„œ mIoU์—์„œ +3.2๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๋น„์ „ ๋ฐฑ๋ณธ์œผ๋กœ์„œ์˜ ์ž ์žฌ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ณ„์ธต์  ์„ค๊ณ„์™€ Shifted Windows ๋ฐฉ์‹์€ ์ˆœ์ˆ˜ MLP ์•„ํ‚คํ…์ฒ˜์—๋„ ์œ ๋ฆฌํ•˜๊ฒŒ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

drawing

Swin Transformer ์•„ํ‚คํ…์ฒ˜. ์›๋ณธ ๋…ผ๋ฌธ์—์„œ ๋ฐœ์ทŒ.

์ด ๋ชจ๋ธ์€ novice03์ด ๊ธฐ์—ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. Tensorflow ๋ฒ„์ „์€ amyeroberts๊ฐ€ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒ [[usage-tips]]

  • Swin์€ ์ž…๋ ฅ์˜ ๋†’์ด์™€ ๋„ˆ๋น„๊ฐ€ 32๋กœ ๋‚˜๋ˆ„์–ด์งˆ ์ˆ˜ ์žˆ์œผ๋ฉด ์–ด๋–ค ํฌ๊ธฐ๋“  ์ง€์›ํ•  ์ˆ˜ ์žˆ๋„๋ก ํŒจ๋”ฉ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • Swin์€ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. output_hidden_states = True๋กœ ์„ค์ •ํ•˜๋ฉด, hidden_states์™€ reshaped_hidden_states๋ฅผ ๋ชจ๋‘ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. reshaped_hidden_states๋Š” (batch, num_channels, height, width) ํ˜•์‹์„ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” (batch_size, sequence_length, num_channels) ํ˜•์‹๊ณผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

๋ฆฌ์†Œ์Šค [[resources]]

Swin Transformer์˜ ์‚ฌ์šฉ์„ ๋„์šธ ์ˆ˜ ์žˆ๋Š” Hugging Face ๋ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ(๐ŸŒŽ๋กœ ํ‘œ์‹œ)์˜ ๊ณต์‹ ์ž๋ฃŒ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ:

์ƒˆ๋กœ์šด ์ž๋ฃŒ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด, ์–ธ์ œ๋“ ์ง€ Pull Request๋ฅผ ์—ด์–ด์ฃผ์„ธ์š”! ์ €ํฌ๊ฐ€ ๊ฒ€ํ† ํ•ด ๋“œ๋ฆด๊ฒŒ์š”. ์ด๋•Œ, ์ถ”๊ฐ€ํ•˜๋Š” ์ž๋ฃŒ๋Š” ๊ธฐ์กด ์ž๋ฃŒ์™€ ์ค‘๋ณต๋˜์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ๋‚ด์šฉ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ž๋ฃŒ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

SwinConfig [[transformers.SwinConfig]]

[[autodoc]] SwinConfig

SwinModel [[transformers.SwinModel]]

[[autodoc]] SwinModel - forward

SwinForMaskedImageModeling [[transformers.SwinForMaskedImageModeling]]

[[autodoc]] SwinForMaskedImageModeling - forward

SwinForImageClassification [[transformers.SwinForImageClassification]]

[[autodoc]] transformers.SwinForImageClassification - forward

TFSwinModel [[transformers.TFSwinModel]]

[[autodoc]] TFSwinModel - call

TFSwinForMaskedImageModeling [[transformers.TFSwinForMaskedImageModeling]]

[[autodoc]] TFSwinForMaskedImageModeling - call

TFSwinForImageClassification [[transformers.TFSwinForImageClassification]]

[[autodoc]] transformers.TFSwinForImageClassification - call