DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Vision Transformer (ViT) [[vision-transformer-vit]]

๊ฐœ์š” [[overview]]

Vision Transformer (ViT) ๋ชจ๋ธ์€ Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby๊ฐ€ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Transformer ์ธ์ฝ”๋”๋ฅผ ImageNet์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ํ›ˆ๋ จ์‹œํ‚จ ์ฒซ ๋ฒˆ์งธ ๋…ผ๋ฌธ์œผ๋กœ, ๊ธฐ์กด์˜ ์ž˜ ์•Œ๋ ค์ง„ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(CNN) ๊ตฌ์กฐ์™€ ๋น„๊ตํ•ด ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Transformer ์•„ํ‚คํ…์ฒ˜๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž‘์—…์—์„œ ์‚ฌ์‹ค์ƒ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ ์žก์•˜์œผ๋‚˜, ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ์˜ ์ ์šฉ์€ ์—ฌ์ „ํžˆ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค. ๋น„์ „์—์„œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์ข…์ข… ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(CNN)๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜, ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ํŠน์ • ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ CNN ์˜์กด์„ฑ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ๋ฐ›๋Š” ์ˆœ์ˆ˜ํ•œ Transformer๊ฐ€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ํ›„, ImageNet, CIFAR-100, VTAB ๋“ฑ ๋‹ค์–‘ํ•œ ์ค‘์†Œํ˜• ์ด๋ฏธ์ง€ ์ธ์‹ ๋ฒค์น˜๋งˆํฌ์— ์ ์šฉํ•˜๋ฉด Vision Transformer(ViT)๋Š” ์ตœ์‹  ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง๊ณผ ๋น„๊ตํ•ด ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋ฉด์„œ๋„ ํ›ˆ๋ จ์— ํ•„์š”ํ•œ ๊ณ„์‚ฐ ์ž์›์„ ์ƒ๋‹นํžˆ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

drawing

ViT ์•„ํ‚คํ…์ฒ˜. ์›๋ณธ ๋…ผ๋ฌธ์—์„œ ๋ฐœ์ทŒ.

์›๋ž˜์˜ Vision Transformer์— ์ด์–ด, ์—ฌ๋Ÿฌ ํ›„์† ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • DeiT (Data-efficient Image Transformers) (Facebook AI ๊ฐœ๋ฐœ). DeiT ๋ชจ๋ธ์€ distilled vision transformers์ž…๋‹ˆ๋‹ค. DeiT์˜ ์ €์ž๋“ค์€ ๋” ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จ๋œ ViT ๋ชจ๋ธ๋„ ๊ณต๊ฐœํ–ˆ์œผ๋ฉฐ, ์ด๋Š” [ViTModel] ๋˜๋Š” [ViTForImageClassification]์— ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—๋Š” 3๊ฐ€์ง€ ํฌ๊ธฐ๋กœ 4๊ฐœ์˜ ๋ณ€ํ˜•์ด ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค: facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ์— ์ด๋ฏธ์ง€๋ฅผ ์ค€๋น„ํ•˜๋ ค๋ฉด [DeiTImageProcessor]๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์— ์œ ์˜ํ•˜์‹ญ์‹œ์˜ค.

  • BEiT (BERT pre-training of Image Transformers) (Microsoft Research ๊ฐœ๋ฐœ). BEiT ๋ชจ๋ธ์€ BERT (masked image modeling)์— ์˜๊ฐ์„ ๋ฐ›๊ณ  VQ-VAE์— ๊ธฐ๋ฐ˜ํ•œ self-supervised ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ supervised pre-trained vision transformers๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

  • DINO (Vision Transformers์˜ self-supervised ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•) (Facebook AI ๊ฐœ๋ฐœ). DINO ๋ฐฉ๋ฒ•์œผ๋กœ ํ›ˆ๋ จ๋œ Vision Transformer๋Š” ํ•™์Šต๋˜์ง€ ์•Š์€ ์ƒํƒœ์—์„œ๋„ ๊ฐ์ฒด๋ฅผ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ๋Š” ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์—์„œ๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ๋งค์šฐ ํฅ๋ฏธ๋กœ์šด ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. DINO ์ฒดํฌํฌ์ธํŠธ๋Š” hub์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • MAE (Masked Autoencoders) (Facebook AI ๊ฐœ๋ฐœ). Vision Transformer๋ฅผ ๋น„๋Œ€์นญ ์ธ์ฝ”๋”-๋””์ฝ”๋” ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์Šคํฌ๋œ ํŒจ์น˜์˜ ๋†’์€ ๋น„์œจ(75%)์—์„œ ํ”ฝ์…€ ๊ฐ’์„ ์žฌ๊ตฌ์„ฑํ•˜๋„๋ก ์‚ฌ์ „ ํ•™์Šตํ•จ์œผ๋กœ์จ, ์ €์ž๋“ค์€ ์ด ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์ด ๋ฏธ์„ธ ์กฐ์ • ํ›„ supervised ๋ฐฉ์‹์˜ ์‚ฌ์ „ ํ•™์Šต์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ nielsr์— ์˜ํ•ด ๊ธฐ์—ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ(JAX๋กœ ์ž‘์„ฑ๋จ)์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ, ์šฐ๋ฆฌ๋Š” Ross Wightman์˜ timm ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ JAX์—์„œ PyTorch๋กœ ๋ณ€ํ™˜๋œ ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค์‹œ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ณต๋กœ๋Š” ๊ทธ์—๊ฒŒ ๋Œ๋ฆฝ๋‹ˆ๋‹ค!

์‚ฌ์šฉ ํŒ [[usage-tips]]

  • Transformer ์ธ์ฝ”๋”์— ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ ์ด๋ฏธ์ง€๋Š” ๊ณ ์ • ํฌ๊ธฐ์˜ ๊ฒน์น˜์ง€ ์•Š๋Š” ํŒจ์น˜๋“ค๋กœ ๋ถ„ํ• ๋œ ํ›„ ์„ ํ˜• ์ž„๋ฒ ๋”ฉ๋ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” [CLS] ํ† ํฐ์ด ์ถ”๊ฐ€๋˜์–ด, ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๋˜ํ•œ ์ ˆ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ, ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ƒ์„ฑ๋œ ๋ฒกํ„ฐ ์‹œํ€€์Šค๋ฅผ ํ‘œ์ค€ Transformer ์ธ์ฝ”๋”์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • Vision Transformer๋Š” ๋ชจ๋“  ์ด๋ฏธ์ง€๊ฐ€ ๋™์ผํ•œ ํฌ๊ธฐ(ํ•ด์ƒ๋„)์—ฌ์•ผ ํ•˜๋ฏ€๋กœ, [ViTImageProcessor]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์— ๋งž๊ฒŒ ๋ฆฌ์‚ฌ์ด์ฆˆ(๋˜๋Š” ๋ฆฌ์Šค์ผ€์ผ)ํ•˜๊ณ  ์ •๊ทœํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์‚ฌ์ „ ํ•™์Šต์ด๋‚˜ ๋ฏธ์„ธ ์กฐ์ • ์‹œ ์‚ฌ์šฉ๋œ ํŒจ์น˜ ํ•ด์ƒ๋„์™€ ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋Š” ๊ฐ ์ฒดํฌํฌ์ธํŠธ์˜ ์ด๋ฆ„์— ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, google/vit-base-patch16-224๋Š” ํŒจ์น˜ ํ•ด์ƒ๋„๊ฐ€ 16x16์ด๊ณ  ๋ฏธ์„ธ ์กฐ์ • ํ•ด์ƒ๋„๊ฐ€ 224x224์ธ ๊ธฐ๋ณธ ํฌ๊ธฐ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ชจ๋“  ์ฒดํฌํฌ์ธํŠธ๋Š” hub์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฒดํฌํฌ์ธํŠธ๋Š” (1) ImageNet-21k (1,400๋งŒ ๊ฐœ์˜ ์ด๋ฏธ์ง€์™€ 21,000๊ฐœ์˜ ํด๋ž˜์Šค)์—์„œ๋งŒ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ๊ฑฐ๋‚˜, ๋˜๋Š” (2) ImageNet (ILSVRC 2012, 130๋งŒ ๊ฐœ์˜ ์ด๋ฏธ์ง€์™€ 1,000๊ฐœ์˜ ํด๋ž˜์Šค)์—์„œ ์ถ”๊ฐ€๋กœ ๋ฏธ์„ธ ์กฐ์ •๋œ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.
  • Vision Transformer๋Š” 224x224 ํ•ด์ƒ๋„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฏธ์„ธ ์กฐ์ • ์‹œ, ์‚ฌ์ „ ํ•™์Šต๋ณด๋‹ค ๋” ๋†’์€ ํ•ด์ƒ๋„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค ((Touvron et al., 2019), (Kolesnikovet al., 2020). ๋” ๋†’์€ ํ•ด์ƒ๋„๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋“ค์€ ์›๋ณธ ์ด๋ฏธ์ง€์—์„œ์˜ ์œ„์น˜์— ๋”ฐ๋ผ ์‚ฌ์ „ ํ•™์Šต๋œ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์˜ 2D ๋ณด๊ฐ„(interpolation)์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ตœ๊ณ ์˜ ๊ฒฐ๊ณผ๋Š” supervised ๋ฐฉ์‹์˜ ์‚ฌ์ „ ํ•™์Šต์—์„œ ์–ป์–ด์กŒ์œผ๋ฉฐ, ์ด๋Š” NLP์—์„œ๋Š” ํ•ด๋‹น๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๋งˆ์Šคํฌ๋œ ํŒจ์น˜ ์˜ˆ์ธก(๋งˆ์Šคํฌ๋œ ์–ธ์–ด ๋ชจ๋ธ๋ง์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ self-supervised ์‚ฌ์ „ ํ•™์Šต ๋ชฉํ‘œ)์„ ์‚ฌ์šฉํ•œ ์‹คํ—˜๋„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ๋” ์ž‘์€ ViT-B/16 ๋ชจ๋ธ์€ ImageNet์—์„œ 79.9%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๊ฒƒ๋ณด๋‹ค 2% ๊ฐœ์„ ๋œ ๊ฒฐ๊ณผ์ด์ง€๋งŒ, ์—ฌ์ „ํžˆ supervised ์‚ฌ์ „ ํ•™์Šต๋ณด๋‹ค 4% ๋‚ฎ์Šต๋‹ˆ๋‹ค.

Scaled Dot Product Attention (SDPA) ์‚ฌ์šฉํ•˜๊ธฐ [[using-scaled-dot-product-attention-sdpa]]

PyTorch๋Š” torch.nn.functional์˜ ์ผ๋ถ€๋กœ์„œ native scaled dot-product attention (SDPA) ์—ฐ์‚ฐ์ž๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ ๋ฐ ์‚ฌ์šฉ ์ค‘์ธ ํ•˜๋“œ์›จ์–ด์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ตฌํ˜„ ๋ฐฉ์‹์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.์ž์„ธํ•œ ๋‚ด์šฉ์€ ๊ณต์‹ ๋ฌธ์„œ๋‚˜ GPU ์ถ”๋ก  ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

SDPA๋Š” torch>=2.1.1์—์„œ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€๋งŒ, from_pretrained()์—์„œ attn_implementation="sdpa"๋กœ ์„ค์ •ํ•˜์—ฌ SDPA๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์š”์ฒญํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import ViTForImageClassification
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224", attn_implementation="sdpa", torch_dtype=torch.float16)
...

์ตœ์ ์˜ ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ๋ฐ˜์ •๋ฐ€๋„(์˜ˆ: torch.float16 ๋˜๋Š” torch.bfloat16)๋กœ ๋กœ๋“œํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

๋กœ์ปฌ ๋ฒค์น˜๋งˆํฌ(A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04)์—์„œ float32์™€ google/vit-base-patch16-224 ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ถ”๋ก  ์‹œ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์†๋„ ํ–ฅ์ƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

Batch size Average inference time (ms), eager mode Average inference time (ms), sdpa model Speed up, Sdpa / Eager (x)
1 7 6 1.17
2 8 6 1.33
4 8 6 1.33
8 8 6 1.33

๋ฆฌ์†Œ์Šค [[resources]]

ViT์˜ ์ถ”๋ก  ๋ฐ ์ปค์Šคํ…€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ฏธ์„ธ ์กฐ์ •๊ณผ ๊ด€๋ จ๋œ ๋ฐ๋ชจ ๋…ธํŠธ๋ถ์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Hugging Face์—์„œ ๊ณต์‹์ ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” ์ž๋ฃŒ์™€ ์ปค๋ฎค๋‹ˆํ‹ฐ(๐ŸŒŽ๋กœ ํ‘œ์‹œ๋œ) ์ž๋ฃŒ ๋ชฉ๋ก์€ ViT๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋ชฉ๋ก์— ํฌํ•จ๋  ์ž๋ฃŒ๋ฅผ ์ œ์ถœํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด Pull Request๋ฅผ ์—ด์–ด ์ฃผ์‹œ๋ฉด ๊ฒ€ํ† ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋‚ด์šฉ์„ ์„ค๋ช…ํ•˜๋Š” ์ž๋ฃŒ๊ฐ€ ๊ฐ€์žฅ ์ด์ƒ์ ์ด๋ฉฐ, ๊ธฐ์กด ์ž๋ฃŒ๋ฅผ ์ค‘๋ณตํ•˜์ง€ ์•Š๋„๋ก ํ•ด์ฃผ์‹ญ์‹œ์˜ค.

ViTForImageClassification ์€ ๋‹ค์Œ์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค:

โš—๏ธ ์ตœ์ ํ™”

โšก๏ธ ์ถ”๋ก 

๐Ÿš€ ๋ฐฐํฌ

ViTConfig [[transformers.ViTConfig]]

[[autodoc]] ViTConfig

ViTFeatureExtractor [[transformers.ViTFeatureExtractor]]

[[autodoc]] ViTFeatureExtractor - call

ViTImageProcessor [[transformers.ViTImageProcessor]]

[[autodoc]] ViTImageProcessor - preprocess

ViTImageProcessorFast [[transformers.ViTImageProcessorFast]]

[[autodoc]] ViTImageProcessorFast - preprocess

ViTModel [[transformers.ViTModel]]

[[autodoc]] ViTModel - forward

ViTForMaskedImageModeling [[transformers.ViTForMaskedImageModeling]]

[[autodoc]] ViTForMaskedImageModeling - forward

ViTForImageClassification [[transformers.ViTForImageClassification]]

[[autodoc]] ViTForImageClassification - forward

TFViTModel [[transformers.TFViTModel]]

[[autodoc]] TFViTModel - call

TFViTForImageClassification [[transformers.TFViTForImageClassification]]

[[autodoc]] TFViTForImageClassification - call

FlaxVitModel [[transformers.FlaxViTModel]]

[[autodoc]] FlaxViTModel - call

FlaxViTForImageClassification [[transformers.FlaxViTForImageClassification]]

[[autodoc]] FlaxViTForImageClassification - call