DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Swin Transformer V2 [[swin-transformer-v2]]

๊ฐœ์š” [[overview]]

Swin Transformer V2๋Š” Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo๊ฐ€ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ Swin Transformer V2: Scaling Up Capacity and Resolution์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋Œ€๊ทœ๋ชจ NLP ๋ชจ๋ธ๋“ค์€ ์–ธ์–ด ์ž‘์—…์—์„œ์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒํ•˜๋ฉฐ, ์„ฑ๋Šฅ์ด ํฌํ™”ํ•˜๋Š” ์ง•ํ›„๋ฅผ ๋ณด์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋žŒ๊ณผ ์œ ์‚ฌํ•œ few-shot ํ•™์Šต ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ํƒ๊ตฌํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€ํ˜• ๋น„์ „ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ  ์ ์šฉํ•˜๋Š” ๋ฐ ์žˆ์–ด ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค: ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ, ์‚ฌ์ „ ํ•™์Šต๊ณผ ํŒŒ์ธํŠœ๋‹ ๊ฐ„์˜ ํ•ด์ƒ๋„ ์ฐจ์ด, ๊ทธ๋ฆฌ๊ณ  ๋ ˆ์ด๋ธ”์ด ๋‹ฌ๋ฆฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋†’์€ ์š”๊ตฌ์ž…๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค: 1) ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ residual-post-norm ๋ฐฉ๋ฒ•๊ณผ cosine attention์˜ ๊ฒฐํ•ฉ; 2) ์ €ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ณ ํ•ด์ƒ๋„ ์ž…๋ ฅ์œผ๋กœ ์ „์ดํ•  ์ˆ˜ ์žˆ๋Š” log-spaced continuous position bias ๋ฐฉ๋ฒ•; 3) ๋ ˆ์ด๋ธ”์ด ๋‹ฌ๋ฆฐ ๋ฐฉ๋Œ€ํ•œ ์ด๋ฏธ์ง€์˜ ํ•„์š”์„ฑ์„ ์ค„์ด๊ธฐ ์œ„ํ•œ self-supervised ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ•์ธ SimMIM์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํ†ตํ•ด 30์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ Swin Transformer V2 ๋ชจ๋ธ์„ ์„ฑ๊ณต์ ์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ํ˜„์žฌ๊นŒ์ง€ ๊ฐ€์žฅ ํฌ๊ณ  ๊ณ ๋ฐ€๋„์˜ ๋น„์ „ ๋ชจ๋ธ๋กœ, ์ตœ๋Œ€ 1,536ร—1,536 ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ImageNet-V2 ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, COCO ๊ฐ์ฒด ํƒ์ง€, ADE20K ์˜๋ฏธ๋ก ์  ๋ถ„ํ• , Kinetics-400 ๋น„๋””์˜ค ํ–‰๋™ ๋ถ„๋ฅ˜ ๋“ฑ ๋„ค ๊ฐ€์ง€ ๋Œ€ํ‘œ์ ์ธ ๋น„์ „ ์ž‘์—…์—์„œ ์ƒˆ๋กœ์šด ์„ฑ๋Šฅ ๊ธฐ๋ก์„ ์„ธ์› ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ์˜ ํ›ˆ๋ จ์€ Google์˜ billion-level ๋น„์ „ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•ด 40๋ฐฐ ์ ์€ ๋ ˆ์ด๋ธ”์ด ๋‹ฌ๋ฆฐ ๋ฐ์ดํ„ฐ์™€ 40๋ฐฐ ์ ์€ ํ›ˆ๋ จ ์‹œ๊ฐ„์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๋‹ค๋Š” ์ ์—์„œ ํ›จ์”ฌ ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ nandwalritik์ด ๊ธฐ์—ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฆฌ์†Œ์Šค [[resources]]

Swin Transformer v2์˜ ์‚ฌ์šฉ์„ ๋„์šธ ์ˆ˜ ์žˆ๋Š” Hugging Face ๋ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ(๐ŸŒŽ๋กœ ํ‘œ์‹œ)์˜ ๊ณต์‹ ์ž๋ฃŒ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ:

์ƒˆ๋กœ์šด ์ž๋ฃŒ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด, ์–ธ์ œ๋“ ์ง€ Pull Request๋ฅผ ์—ด์–ด์ฃผ์„ธ์š”! ์ €ํฌ๊ฐ€ ๊ฒ€ํ† ํ•ด ๋“œ๋ฆด๊ฒŒ์š”. ์ด๋•Œ, ์ถ”๊ฐ€ํ•˜๋Š” ์ž๋ฃŒ๋Š” ๊ธฐ์กด ์ž๋ฃŒ์™€ ์ค‘๋ณต๋˜์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ๋‚ด์šฉ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ž๋ฃŒ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Swinv2Config [[transformers.Swinv2Config]]

[[autodoc]] Swinv2Config

Swinv2Model [[transformers.Swinv2Model]]

[[autodoc]] Swinv2Model - forward

Swinv2ForMaskedImageModeling [[transformers.Swinv2ForMaskedImageModeling]]

[[autodoc]] Swinv2ForMaskedImageModeling - forward

Swinv2ForImageClassification [[transformers.Swinv2ForImageClassification]]

[[autodoc]] transformers.Swinv2ForImageClassification - forward