DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Video Vision Transformer (ViViT) [[video-vision-transformer-vivit]]

๊ฐœ์š” [[overview]]

Vivit ๋ชจ๋ธ์€ Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luฤiฤ‡, Cordelia Schmid๊ฐ€ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ ViViT: A Video Vision Transformer์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ์œ„ํ•œ pure-transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ ์ง‘ํ•ฉ ์ค‘์—์„œ ์ตœ์ดˆ๋กœ ์„ฑ๊ณตํ•œ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์šฐ๋ฆฌ๋Š” ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ ์ตœ๊ทผ ์„ฑ๊ณต์„ ๊ฑฐ๋‘” ์ˆœ์ˆ˜ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋น„๋””์˜ค ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ ์ž…๋ ฅ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์‹œ๊ณต๊ฐ„ ํ† ํฐ์„ ์ถ”์ถœํ•œ ํ›„, ์ด๋ฅผ ์ผ๋ จ์˜ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. ๋น„๋””์˜ค์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ธด ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, ์ž…๋ ฅ์˜ ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ฐ„ ์ฐจ์›์„ ๋ถ„๋ฆฌํ•˜๋Š” ์—ฌ๋Ÿฌ ํšจ์œจ์ ์ธ ๋ชจ๋ธ ๋ณ€ํ˜•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์—์„œ๋งŒ ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด์ง€๋งŒ, ์šฐ๋ฆฌ๋Š” ํ•™์Šต ์ค‘ ๋ชจ๋ธ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ •๊ทœํ™”ํ•˜๊ณ , ์‚ฌ์ „ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ฒ ์ €ํ•œ ์†Œ๊ฑฐ(ablation) ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  Kinetics 400 ๋ฐ 600, Epic Kitchens, Something-Something v2, Moments in Time์„ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ ๋น„๋””์˜ค ๋ถ„๋ฅ˜ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ, ๊ธฐ์กด์˜ 3D ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ jegormeister๊ฐ€ ๊ธฐ์—ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ(JAX๋กœ ์ž‘์„ฑ๋จ)๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

VivitConfig [[transformers.VivitConfig]]

[[autodoc]] VivitConfig

VivitImageProcessor [[transformers.VivitImageProcessor]]

[[autodoc]] VivitImageProcessor - preprocess

VivitModel [[transformers.VivitModel]]

[[autodoc]] VivitModel - forward

VivitForVideoClassification [[transformers.VivitForVideoClassification]]

[[autodoc]] transformers.VivitForVideoClassification - forward