| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # ์์ ๋ถ๋ฅ [[video-classification]] | |
| [[open-in-colab]] | |
| ์์ ๋ถ๋ฅ๋ ์์ ์ ์ฒด์ ๋ ์ด๋ธ ๋๋ ํด๋์ค๋ฅผ ์ง์ ํ๋ ์์ ์ ๋๋ค. ๊ฐ ์์์๋ ํ๋์ ํด๋์ค๊ฐ ์์ ๊ฒ์ผ๋ก ์์๋ฉ๋๋ค. ์์ ๋ถ๋ฅ ๋ชจ๋ธ์ ์์์ ์ ๋ ฅ์ผ๋ก ๋ฐ์ ์ด๋ ํด๋์ค์ ์ํ๋์ง์ ๋ํ ์์ธก์ ๋ฐํํฉ๋๋ค. ์ด๋ฌํ ๋ชจ๋ธ์ ์์์ด ์ด๋ค ๋ด์ฉ์ธ์ง ๋ถ๋ฅํ๋ ๋ฐ ์ฌ์ฉ๋ ์ ์์ต๋๋ค. ์์ ๋ถ๋ฅ์ ์ค์ ์์ฉ ์๋ ํผํธ๋์ค ์ฑ์์ ์ ์ฉํ ๋์ / ์ด๋ ์ธ์ ์๋น์ค๊ฐ ์์ต๋๋ค. ์ด๋ ๋ํ ์๊ฐ ์ฅ์ ์ธ์ด ์ด๋ํ ๋ ๋ณด์กฐํ๋๋ฐ ์ฌ์ฉ๋ ์ ์์ต๋๋ค | |
| ์ด ๊ฐ์ด๋์์๋ ๋ค์์ ์ํํ๋ ๋ฐฉ๋ฒ์ ๋ณด์ฌ์ค๋๋ค: | |
| 1. [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) ๋ฐ์ดํฐ ์ธํธ์ ํ์ ์งํฉ์ ํตํด [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ๊ธฐ. | |
| 2. ๋ฏธ์ธ ์กฐ์ ํ ๋ชจ๋ธ์ ์ถ๋ก ์ ์ฌ์ฉํ๊ธฐ. | |
| <Tip> | |
| ์ด ํํ ๋ฆฌ์ผ์์ ์ค๋ช ํ๋ ์์ ์ ๋ค์ ๋ชจ๋ธ ์ํคํ ์ฒ์์ ์ง์๋ฉ๋๋ค: | |
| <!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | |
| [TimeSformer](../model_doc/timesformer), [VideoMAE](../model_doc/videomae) | |
| <!--End of the generated tip--> | |
| </Tip> | |
| ์์ํ๊ธฐ ์ ์ ํ์ํ ๋ชจ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ์ค์น๋์๋์ง ํ์ธํ์ธ์: | |
| ```bash | |
| pip install -q pytorchvideo transformers evaluate | |
| ``` | |
| ์์์ ์ฒ๋ฆฌํ๊ณ ์ค๋นํ๊ธฐ ์ํด [PyTorchVideo](https://pytorchvideo.org/)(์ดํ `pytorchvideo`)๋ฅผ ์ฌ์ฉํฉ๋๋ค. | |
| ์ปค๋ฎค๋ํฐ์ ๋ชจ๋ธ์ ์ ๋ก๋ํ๊ณ ๊ณต์ ํ ์ ์๋๋ก Hugging Face ๊ณ์ ์ ๋ก๊ทธ์ธํ๋ ๊ฒ์ ๊ถ์ฅํฉ๋๋ค. ํ๋กฌํํธ๊ฐ ๋ํ๋๋ฉด ํ ํฐ์ ์ ๋ ฅํ์ฌ ๋ก๊ทธ์ธํ์ธ์: | |
| ```py | |
| >>> from huggingface_hub import notebook_login | |
| >>> notebook_login() | |
| ``` | |
| ## UCF101 ๋ฐ์ดํฐ์ ๋ถ๋ฌ์ค๊ธฐ [[load-ufc101-dataset]] | |
| [UCF-101](https://www.crcv.ucf.edu/data/UCF101.php) ๋ฐ์ดํฐ ์ธํธ์ ํ์ ์งํฉ(subset)์ ๋ถ๋ฌ์ค๋ ๊ฒ์ผ๋ก ์์ํ ์ ์์ต๋๋ค. ์ ์ฒด ๋ฐ์ดํฐ ์ธํธ๋ฅผ ํ์ตํ๋๋ฐ ๋ ๋ง์ ์๊ฐ์ ํ ์ ํ๊ธฐ ์ ์ ๋ฐ์ดํฐ์ ํ์ ์งํฉ์ ๋ถ๋ฌ์ ๋ชจ๋ ๊ฒ์ด ์ ์๋ํ๋์ง ์คํํ๊ณ ํ์ธํ ์ ์์ต๋๋ค. | |
| ```py | |
| >>> from huggingface_hub import hf_hub_download | |
| >>> hf_dataset_identifier = "sayakpaul/ucf101-subset" | |
| >>> filename = "UCF101_subset.tar.gz" | |
| >>> file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset") | |
| ``` | |
| ๋ฐ์ดํฐ ์ธํธ์ ํ์ ์งํฉ์ด ๋ค์ด๋ก๋ ๋๋ฉด, ์์ถ๋ ํ์ผ์ ์์ถ์ ํด์ ํด์ผ ํฉ๋๋ค: | |
| ```py | |
| >>> import tarfile | |
| >>> with tarfile.open(file_path) as t: | |
| ... t.extractall(".") | |
| ``` | |
| ์ ์ฒด ๋ฐ์ดํฐ ์ธํธ๋ ๋ค์๊ณผ ๊ฐ์ด ๊ตฌ์ฑ๋์ด ์์ต๋๋ค. | |
| ```bash | |
| UCF101_subset/ | |
| train/ | |
| BandMarching/ | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| Archery | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| ... | |
| val/ | |
| BandMarching/ | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| Archery | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| ... | |
| test/ | |
| BandMarching/ | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| Archery | |
| video_1.mp4 | |
| video_2.mp4 | |
| ... | |
| ... | |
| ``` | |
| ์ ๋ ฌ๋ ์์์ ๊ฒฝ๋ก๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค: | |
| ```bash | |
| ... | |
| 'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi', | |
| 'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi', | |
| 'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi', | |
| 'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi', | |
| 'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi' | |
| ... | |
| ``` | |
| ๋์ผํ ๊ทธ๋ฃน/์ฅ๋ฉด์ ์ํ๋ ์์ ํด๋ฆฝ์ ํ์ผ ๊ฒฝ๋ก์์ `g`๋ก ํ์๋์ด ์์ต๋๋ค. ์๋ฅผ ๋ค๋ฉด, `v_ApplyEyeMakeup_g07_c04.avi`์ `v_ApplyEyeMakeup_g07_c06.avi` ์ด ์์ต๋๋ค. ์ด ๋์ ๊ฐ์ ๊ทธ๋ฃน์ ๋๋ค. | |
| ๊ฒ์ฆ ๋ฐ ํ๊ฐ ๋ฐ์ดํฐ ๋ถํ ์ ํ ๋, [๋ฐ์ดํฐ ๋์ถ(data leakage)](https://www.kaggle.com/code/alexisbcook/data-leakage)์ ๋ฐฉ์งํ๊ธฐ ์ํด ๋์ผํ ๊ทธ๋ฃน / ์ฅ๋ฉด์ ์์ ํด๋ฆฝ์ ์ฌ์ฉํ์ง ์์์ผ ํฉ๋๋ค. ์ด ํํ ๋ฆฌ์ผ์์ ์ฌ์ฉํ๋ ํ์ ์งํฉ์ ์ด๋ฌํ ์ ๋ณด๋ฅผ ๊ณ ๋ คํ๊ณ ์์ต๋๋ค. | |
| ๊ทธ ๋ค์์ผ๋ก, ๋ฐ์ดํฐ ์ธํธ์ ์กด์ฌํ๋ ๋ผ๋ฒจ์ ์ถ์ถํฉ๋๋ค. ๋ํ, ๋ชจ๋ธ์ ์ด๊ธฐํํ ๋ ๋์์ด ๋ ๋์ ๋๋ฆฌ(dictionary data type)๋ฅผ ์์ฑํฉ๋๋ค. | |
| * `label2id`: ํด๋์ค ์ด๋ฆ์ ์ ์์ ๋งคํํฉ๋๋ค. | |
| * `id2label`: ์ ์๋ฅผ ํด๋์ค ์ด๋ฆ์ ๋งคํํฉ๋๋ค. | |
| ```py | |
| >>> class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths}) | |
| >>> label2id = {label: i for i, label in enumerate(class_labels)} | |
| >>> id2label = {i: label for label, i in label2id.items()} | |
| >>> print(f"Unique classes: {list(label2id.keys())}.") | |
| # Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress']. | |
| ``` | |
| ์ด ๋ฐ์ดํฐ ์ธํธ์๋ ์ด 10๊ฐ์ ๊ณ ์ ํ ํด๋์ค๊ฐ ์์ต๋๋ค. ๊ฐ ํด๋์ค๋ง๋ค 30๊ฐ์ ์์์ด ํ๋ จ ์ธํธ์ ์์ต๋๋ค | |
| ## ๋ฏธ์ธ ์กฐ์ ํ๊ธฐ ์ํด ๋ชจ๋ธ ๊ฐ์ ธ์ค๊ธฐ [[load-a-model-to-fine-tune]] | |
| ์ฌ์ ํ๋ จ๋ ์ฒดํฌํฌ์ธํธ์ ์ฒดํฌํฌ์ธํธ์ ์ฐ๊ด๋ ์ด๋ฏธ์ง ํ๋ก์ธ์๋ฅผ ์ฌ์ฉํ์ฌ ์์ ๋ถ๋ฅ ๋ชจ๋ธ์ ์ธ์คํด์คํํฉ๋๋ค. ๋ชจ๋ธ์ ์ธ์ฝ๋์๋ ๋ฏธ๋ฆฌ ํ์ต๋ ๋งค๊ฐ๋ณ์๊ฐ ์ ๊ณต๋๋ฉฐ, ๋ถ๋ฅ ํค๋(๋ฐ์ดํฐ๋ฅผ ๋ถ๋ฅํ๋ ๋ง์ง๋ง ๋ ์ด์ด)๋ ๋ฌด์์๋ก ์ด๊ธฐํ๋ฉ๋๋ค. ๋ฐ์ดํฐ ์ธํธ์ ์ ์ฒ๋ฆฌ ํ์ดํ๋ผ์ธ์ ์์ฑํ ๋๋ ์ด๋ฏธ์ง ํ๋ก์ธ์๊ฐ ์ ์ฉํฉ๋๋ค. | |
| ```py | |
| >>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification | |
| >>> model_ckpt = "MCG-NJU/videomae-base" | |
| >>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt) | |
| >>> model = VideoMAEForVideoClassification.from_pretrained( | |
| ... model_ckpt, | |
| ... label2id=label2id, | |
| ... id2label=id2label, | |
| ... ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint | |
| ... ) | |
| ``` | |
| ๋ชจ๋ธ์ ๊ฐ์ ธ์ค๋ ๋์, ๋ค์๊ณผ ๊ฐ์ ๊ฒฝ๊ณ ๋ฅผ ๋ง์ฃผ์น ์ ์์ต๋๋ค: | |
| ```bash | |
| Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight'] | |
| - This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). | |
| - This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). | |
| Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight'] | |
| You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. | |
| ``` | |
| ์ ๊ฒฝ๊ณ ๋ ์ฐ๋ฆฌ๊ฐ ์ผ๋ถ ๊ฐ์ค์น(์: `classifier` ์ธต์ ๊ฐ์ค์น์ ํธํฅ)๋ฅผ ๋ฒ๋ฆฌ๊ณ ์๋ก์ด `classifier` ์ธต์ ๊ฐ์ค์น์ ํธํฅ์ ๋ฌด์์๋ก ์ด๊ธฐํํ๊ณ ์๋ค๋ ๊ฒ์ ์๋ ค์ค๋๋ค. ์ด ๊ฒฝ์ฐ์๋ ๋ฏธ๋ฆฌ ํ์ต๋ ๊ฐ์ค์น๊ฐ ์๋ ์๋ก์ด ํค๋๋ฅผ ์ถ๊ฐํ๊ณ ์์ผ๋ฏ๋ก, ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ๋ชจ๋ธ์ ์ถ๋ก ์ ์ฌ์ฉํ๊ธฐ ์ ์ ๋ฏธ์ธ ์กฐ์ ํ๋ผ๊ณ ๊ฒฝ๊ณ ๋ฅผ ๋ณด๋ด๋ ๊ฒ์ ๋น์ฐํฉ๋๋ค. ๊ทธ๋ฆฌ๊ณ ์ด์ ์ฐ๋ฆฌ๋ ์ด ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ ์์ ์ ๋๋ค. | |
| **์ฐธ๊ณ ** ์ด [์ฒดํฌํฌ์ธํธ](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics)๋ ๋๋ฉ์ธ์ด ๋ง์ด ์ค์ฒฉ๋ ์ ์ฌํ ๋ค์ด์คํธ๋ฆผ ์์ ์ ๋ํด ๋ฏธ์ธ ์กฐ์ ํ์ฌ ์ป์ ์ฒดํฌํฌ์ธํธ์ด๋ฏ๋ก ์ด ์์ ์์ ๋ ๋์ ์ฑ๋ฅ์ ๋ณด์ผ ์ ์์ต๋๋ค. `MCG-NJU/videomae-base-finetuned-kinetics` ๋ฐ์ดํฐ ์ธํธ๋ฅผ ๋ฏธ์ธ ์กฐ์ ํ์ฌ ์ป์ [์ฒดํฌํฌ์ธํธ](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset)๋ ์์ต๋๋ค. | |
| ## ํ๋ จ์ ์ํ ๋ฐ์ดํฐ ์ธํธ ์ค๋นํ๊ธฐ[[prepare-the-datasets-for-training]] | |
| ์์ ์ ์ฒ๋ฆฌ๋ฅผ ์ํด [PyTorchVideo ๋ผ์ด๋ธ๋ฌ๋ฆฌ](https://pytorchvideo.org/)๋ฅผ ํ์ฉํ ๊ฒ์ ๋๋ค. ํ์ํ ์ข ์์ฑ์ ๊ฐ์ ธ์ค๋ ๊ฒ์ผ๋ก ์์ํ์ธ์. | |
| ```py | |
| >>> import pytorchvideo.data | |
| >>> from pytorchvideo.transforms import ( | |
| ... ApplyTransformToKey, | |
| ... Normalize, | |
| ... RandomShortSideScale, | |
| ... RemoveKey, | |
| ... ShortSideScale, | |
| ... UniformTemporalSubsample, | |
| ... ) | |
| >>> from torchvision.transforms import ( | |
| ... Compose, | |
| ... Lambda, | |
| ... RandomCrop, | |
| ... RandomHorizontalFlip, | |
| ... Resize, | |
| ... ) | |
| ``` | |
| ํ์ต ๋ฐ์ดํฐ ์ธํธ ๋ณํ์๋ '๊ท ์ผํ ์๊ฐ ์ํ๋ง(uniform temporal subsampling)', 'ํฝ์ ์ ๊ทํ(pixel normalization)', '๋๋ค ์๋ผ๋ด๊ธฐ(random cropping)' ๋ฐ '๋๋ค ์ํ ๋ค์ง๊ธฐ(random horizontal flipping)'์ ์กฐํฉ์ ์ฌ์ฉํฉ๋๋ค. ๊ฒ์ฆ ๋ฐ ํ๊ฐ ๋ฐ์ดํฐ ์ธํธ ๋ณํ์๋ '๋๋ค ์๋ผ๋ด๊ธฐ'์ '๋๋ค ๋ค์ง๊ธฐ'๋ฅผ ์ ์ธํ ๋์ผํ ๋ณํ ์ฒด์ธ์ ์ ์งํฉ๋๋ค. ์ด๋ฌํ ๋ณํ์ ๋ํด ์์ธํ ์์๋ณด๋ ค๋ฉด [PyTorchVideo ๊ณต์ ๋ฌธ์](https://pytorchvideo.org)๋ฅผ ํ์ธํ์ธ์. | |
| ์ฌ์ ํ๋ จ๋ ๋ชจ๋ธ๊ณผ ๊ด๋ จ๋ ์ด๋ฏธ์ง ํ๋ก์ธ์๋ฅผ ์ฌ์ฉํ์ฌ ๋ค์ ์ ๋ณด๋ฅผ ์ป์ ์ ์์ต๋๋ค: | |
| * ์์ ํ๋ ์ ํฝ์ ์ ์ ๊ทํํ๋ ๋ฐ ์ฌ์ฉ๋๋ ์ด๋ฏธ์ง ํ๊ท ๊ณผ ํ์ค ํธ์ฐจ | |
| * ์์ ํ๋ ์์ด ์กฐ์ ๋ ๊ณต๊ฐ ํด์๋ | |
| ๋จผ์ , ๋ช ๊ฐ์ง ์์๋ฅผ ์ ์ํฉ๋๋ค. | |
| ```py | |
| >>> mean = image_processor.image_mean | |
| >>> std = image_processor.image_std | |
| >>> if "shortest_edge" in image_processor.size: | |
| ... height = width = image_processor.size["shortest_edge"] | |
| >>> else: | |
| ... height = image_processor.size["height"] | |
| ... width = image_processor.size["width"] | |
| >>> resize_to = (height, width) | |
| >>> num_frames_to_sample = model.config.num_frames | |
| >>> sample_rate = 4 | |
| >>> fps = 30 | |
| >>> clip_duration = num_frames_to_sample * sample_rate / fps | |
| ``` | |
| ์ด์ ๋ฐ์ดํฐ ์ธํธ์ ํนํ๋ ์ ์ฒ๋ฆฌ(transform)๊ณผ ๋ฐ์ดํฐ ์ธํธ ์์ฒด๋ฅผ ์ ์ํฉ๋๋ค. ๋จผ์ ํ๋ จ ๋ฐ์ดํฐ ์ธํธ๋ก ์์ํฉ๋๋ค: | |
| ```py | |
| >>> train_transform = Compose( | |
| ... [ | |
| ... ApplyTransformToKey( | |
| ... key="video", | |
| ... transform=Compose( | |
| ... [ | |
| ... UniformTemporalSubsample(num_frames_to_sample), | |
| ... Lambda(lambda x: x / 255.0), | |
| ... Normalize(mean, std), | |
| ... RandomShortSideScale(min_size=256, max_size=320), | |
| ... RandomCrop(resize_to), | |
| ... RandomHorizontalFlip(p=0.5), | |
| ... ] | |
| ... ), | |
| ... ), | |
| ... ] | |
| ... ) | |
| >>> train_dataset = pytorchvideo.data.Ucf101( | |
| ... data_path=os.path.join(dataset_root_path, "train"), | |
| ... clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration), | |
| ... decode_audio=False, | |
| ... transform=train_transform, | |
| ... ) | |
| ``` | |
| ๊ฐ์ ๋ฐฉ์์ ์์ ํ๋ฆ์ ๊ฒ์ฆ๊ณผ ํ๊ฐ ์ธํธ์๋ ์ ์ฉํ ์ ์์ต๋๋ค. | |
| ```py | |
| >>> val_transform = Compose( | |
| ... [ | |
| ... ApplyTransformToKey( | |
| ... key="video", | |
| ... transform=Compose( | |
| ... [ | |
| ... UniformTemporalSubsample(num_frames_to_sample), | |
| ... Lambda(lambda x: x / 255.0), | |
| ... Normalize(mean, std), | |
| ... Resize(resize_to), | |
| ... ] | |
| ... ), | |
| ... ), | |
| ... ] | |
| ... ) | |
| >>> val_dataset = pytorchvideo.data.Ucf101( | |
| ... data_path=os.path.join(dataset_root_path, "val"), | |
| ... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), | |
| ... decode_audio=False, | |
| ... transform=val_transform, | |
| ... ) | |
| >>> test_dataset = pytorchvideo.data.Ucf101( | |
| ... data_path=os.path.join(dataset_root_path, "test"), | |
| ... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), | |
| ... decode_audio=False, | |
| ... transform=val_transform, | |
| ... ) | |
| ``` | |
| **์ฐธ๊ณ **: ์์ ๋ฐ์ดํฐ ์ธํธ์ ํ์ดํ๋ผ์ธ์ [๊ณต์ ํ์ดํ ์น ์์ ](https://pytorchvideo.org/docs/tutorial_classification#dataset)์์ ๊ฐ์ ธ์จ ๊ฒ์ ๋๋ค. ์ฐ๋ฆฌ๋ UCF-101 ๋ฐ์ดํฐ์ ์ ๋ง๊ฒ [`pytorchvideo.data.Ucf101()`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.Ucf101) ํจ์๋ฅผ ์ฌ์ฉํ๊ณ ์์ต๋๋ค. ๋ด๋ถ์ ์ผ๋ก ์ด ํจ์๋ [`pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.LabeledVideoDataset) ๊ฐ์ฒด๋ฅผ ๋ฐํํฉ๋๋ค. `LabeledVideoDataset` ํด๋์ค๋ PyTorchVideo ๋ฐ์ดํฐ์ ์์ ๋ชจ๋ ์์ ๊ด๋ จ ์์ ์ ๊ธฐ๋ณธ ํด๋์ค์ ๋๋ค. ๋ฐ๋ผ์ PyTorchVideo์์ ๋ฏธ๋ฆฌ ์ ๊ณตํ์ง ์๋ ์ฌ์ฉ์ ์ง์ ๋ฐ์ดํฐ ์ธํธ๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด, ์ด ํด๋์ค๋ฅผ ์ ์ ํ๊ฒ ํ์ฅํ๋ฉด ๋ฉ๋๋ค. ๋ ์์ธํ ์ฌํญ์ด ์๊ณ ์ถ๋ค๋ฉด `data` API [๋ฌธ์](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html) ๋ฅผ ์ฐธ๊ณ ํ์ธ์. ๋ํ ์์ ์์์ ์ ์ฌํ ๊ตฌ์กฐ๋ฅผ ๊ฐ๋ ๋ฐ์ดํฐ ์ธํธ๋ฅผ ์ฌ์ฉํ๊ณ ์๋ค๋ฉด, `pytorchvideo.data.Ucf101()` ํจ์๋ฅผ ์ฌ์ฉํ๋ ๋ฐ ๋ฌธ์ ๊ฐ ์์ ๊ฒ์ ๋๋ค. | |
| ๋ฐ์ดํฐ ์ธํธ์ ์์์ ๊ฐ์๋ฅผ ์๊ธฐ ์ํด `num_videos` ์ธ์์ ์ ๊ทผํ ์ ์์ต๋๋ค. | |
| ```py | |
| >>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos) | |
| # (300, 30, 75) | |
| ``` | |
| ## ๋ ๋์ ๋๋ฒ๊น ์ ์ํด ์ ์ฒ๋ฆฌ ์์ ์๊ฐํํ๊ธฐ[[visualize-the-preprocessed-video-for-better-debugging]] | |
| ```py | |
| >>> import imageio | |
| >>> import numpy as np | |
| >>> from IPython.display import Image | |
| >>> def unnormalize_img(img): | |
| ... """Un-normalizes the image pixels.""" | |
| ... img = (img * std) + mean | |
| ... img = (img * 255).astype("uint8") | |
| ... return img.clip(0, 255) | |
| >>> def create_gif(video_tensor, filename="sample.gif"): | |
| ... """Prepares a GIF from a video tensor. | |
| ... | |
| ... The video tensor is expected to have the following shape: | |
| ... (num_frames, num_channels, height, width). | |
| ... """ | |
| ... frames = [] | |
| ... for video_frame in video_tensor: | |
| ... frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy()) | |
| ... frames.append(frame_unnormalized) | |
| ... kargs = {"duration": 0.25} | |
| ... imageio.mimsave(filename, frames, "GIF", **kargs) | |
| ... return filename | |
| >>> def display_gif(video_tensor, gif_name="sample.gif"): | |
| ... """Prepares and displays a GIF from a video tensor.""" | |
| ... video_tensor = video_tensor.permute(1, 0, 2, 3) | |
| ... gif_filename = create_gif(video_tensor, gif_name) | |
| ... return Image(filename=gif_filename) | |
| >>> sample_video = next(iter(train_dataset)) | |
| >>> video_tensor = sample_video["video"] | |
| >>> display_gif(video_tensor) | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sample_gif.gif" alt="Person playing basketball"/> | |
| </div> | |
| ## ๋ชจ๋ธ ํ๋ จํ๊ธฐ[[train-the-model]] | |
| ๐ค Transformers์ [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer)๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ์ ํ๋ จ์์ผ๋ณด์ธ์. `Trainer`๋ฅผ ์ธ์คํด์คํํ๋ ค๋ฉด ํ๋ จ ์ค์ ๊ณผ ํ๊ฐ ์งํ๋ฅผ ์ ์ํด์ผ ํฉ๋๋ค. ๊ฐ์ฅ ์ค์ํ ๊ฒ์ [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)์ ๋๋ค. ์ด ํด๋์ค๋ ํ๋ จ์ ๊ตฌ์ฑํ๋ ๋ชจ๋ ์์ฑ์ ํฌํจํ๋ฉฐ, ํ๋ จ ์ค ์ฒดํฌํฌ์ธํธ๋ฅผ ์ ์ฅํ ์ถ๋ ฅ ํด๋ ์ด๋ฆ์ ํ์๋ก ํฉ๋๋ค. ๋ํ ๐ค Hub์ ๋ชจ๋ธ ์ ์ฅ์์ ๋ชจ๋ ์ ๋ณด๋ฅผ ๋๊ธฐํํ๋ ๋ฐ ๋์์ด ๋ฉ๋๋ค. | |
| ๋๋ถ๋ถ์ ํ๋ จ ์ธ์๋ ๋ฐ๋ก ์ค๋ช ํ ํ์๋ ์์ต๋๋ค. ํ์ง๋ง ์ฌ๊ธฐ์์ ์ค์ํ ์ธ์๋ `remove_unused_columns=False` ์ ๋๋ค. ์ด ์ธ์๋ ๋ชจ๋ธ์ ํธ์ถ ํจ์์์ ์ฌ์ฉ๋์ง ์๋ ๋ชจ๋ ์์ฑ ์ด(columns)์ ์ญ์ ํฉ๋๋ค. ๊ธฐ๋ณธ๊ฐ์ ์ผ๋ฐ์ ์ผ๋ก True์ ๋๋ค. ์ด๋ ์ฌ์ฉ๋์ง ์๋ ๊ธฐ๋ฅ ์ด์ ์ญ์ ํ๋ ๊ฒ์ด ์ด์์ ์ด๋ฉฐ, ์ ๋ ฅ์ ๋ชจ๋ธ์ ํธ์ถ ํจ์๋ก ํ๊ธฐ(unpack)๊ฐ ์ฌ์์ง๊ธฐ ๋๋ฌธ์ ๋๋ค. ํ์ง๋ง ์ด ๊ฒฝ์ฐ์๋ `pixel_values`(๋ชจ๋ธ์ ์ ๋ ฅ์ผ๋ก ํ์์ ์ธ ํค)๋ฅผ ์์ฑํ๊ธฐ ์ํด ์ฌ์ฉ๋์ง ์๋ ๊ธฐ๋ฅ('video'๊ฐ ํนํ ๊ทธ๋ ์ต๋๋ค)์ด ํ์ํฉ๋๋ค. ๋ฐ๋ผ์ remove_unused_columns์ False๋ก ์ค์ ํด์ผ ํฉ๋๋ค. | |
| ```py | |
| >>> from transformers import TrainingArguments, Trainer | |
| >>> model_name = model_ckpt.split("/")[-1] | |
| >>> new_model_name = f"{model_name}-finetuned-ucf101-subset" | |
| >>> num_epochs = 4 | |
| >>> args = TrainingArguments( | |
| ... new_model_name, | |
| ... remove_unused_columns=False, | |
| ... evaluation_strategy="epoch", | |
| ... save_strategy="epoch", | |
| ... learning_rate=5e-5, | |
| ... per_device_train_batch_size=batch_size, | |
| ... per_device_eval_batch_size=batch_size, | |
| ... warmup_ratio=0.1, | |
| ... logging_steps=10, | |
| ... load_best_model_at_end=True, | |
| ... metric_for_best_model="accuracy", | |
| ... push_to_hub=True, | |
| ... max_steps=(train_dataset.num_videos // batch_size) * num_epochs, | |
| ... ) | |
| ``` | |
| `pytorchvideo.data.Ucf101()` ํจ์๋ก ๋ฐํ๋๋ ๋ฐ์ดํฐ ์ธํธ๋ `__len__` ๋ฉ์๋๊ฐ ์ด์๋์ด ์์ง ์์ต๋๋ค. ๋ฐ๋ผ์, `TrainingArguments`๋ฅผ ์ธ์คํด์คํํ ๋ `max_steps`๋ฅผ ์ ์ํด์ผ ํฉ๋๋ค. | |
| ๋ค์์ผ๋ก, ํ๊ฐ์งํ๋ฅผ ๋ถ๋ฌ์ค๊ณ , ์์ธก๊ฐ์์ ํ๊ฐ์งํ๋ฅผ ๊ณ์ฐํ ํจ์๋ฅผ ์ ์ํฉ๋๋ค. ํ์ํ ์ ์ฒ๋ฆฌ ์์ ์ ์์ธก๋ ๋ก์ง(logits)์ argmax ๊ฐ์ ์ทจํ๋ ๊ฒ๋ฟ์ ๋๋ค: | |
| ```py | |
| import evaluate | |
| metric = evaluate.load("accuracy") | |
| def compute_metrics(eval_pred): | |
| predictions = np.argmax(eval_pred.predictions, axis=1) | |
| return metric.compute(predictions=predictions, references=eval_pred.label_ids) | |
| ``` | |
| **ํ๊ฐ์ ๋ํ ์ฐธ๊ณ ์ฌํญ**: | |
| [VideoMAE ๋ ผ๋ฌธ](https://arxiv.org/abs/2203.12602)์์ ์ ์๋ ๋ค์๊ณผ ๊ฐ์ ํ๊ฐ ์ ๋ต์ ์ฌ์ฉํฉ๋๋ค. ํ ์คํธ ์์์์ ์ฌ๋ฌ ํด๋ฆฝ์ ์ ํํ๊ณ ๊ทธ ํด๋ฆฝ์ ๋ค์ํ ํฌ๋กญ์ ์ ์ฉํ์ฌ ์ง๊ณ ์ ์๋ฅผ ๋ณด๊ณ ํฉ๋๋ค. ๊ทธ๋ฌ๋ ์ด๋ฒ ํํ ๋ฆฌ์ผ์์๋ ๊ฐ๋จํจ๊ณผ ๊ฐ๊ฒฐํจ์ ์ํด ํด๋น ์ ๋ต์ ๊ณ ๋ คํ์ง ์์ต๋๋ค. | |
| ๋ํ, ์์ ๋ฅผ ๋ฌถ์ด์ ๋ฐฐ์น๋ฅผ ํ์ฑํ๋ `collate_fn`์ ์ ์ํด์ผํฉ๋๋ค. ๊ฐ ๋ฐฐ์น๋ `pixel_values`์ `labels`๋ผ๋ 2๊ฐ์ ํค๋ก ๊ตฌ์ฑ๋ฉ๋๋ค. | |
| ```py | |
| >>> def collate_fn(examples): | |
| ... # permute to (num_frames, num_channels, height, width) | |
| ... pixel_values = torch.stack( | |
| ... [example["video"].permute(1, 0, 2, 3) for example in examples] | |
| ... ) | |
| ... labels = torch.tensor([example["label"] for example in examples]) | |
| ... return {"pixel_values": pixel_values, "labels": labels} | |
| ``` | |
| ๊ทธ๋ฐ ๋ค์ ์ด ๋ชจ๋ ๊ฒ์ ๋ฐ์ดํฐ ์ธํธ์ ํจ๊ป `Trainer`์ ์ ๋ฌํ๊ธฐ๋ง ํ๋ฉด ๋ฉ๋๋ค: | |
| ```py | |
| >>> trainer = Trainer( | |
| ... model, | |
| ... args, | |
| ... train_dataset=train_dataset, | |
| ... eval_dataset=val_dataset, | |
| ... tokenizer=image_processor, | |
| ... compute_metrics=compute_metrics, | |
| ... data_collator=collate_fn, | |
| ... ) | |
| ``` | |
| ๋ฐ์ดํฐ๋ฅผ ์ด๋ฏธ ์ฒ๋ฆฌํ๋๋ฐ๋ ๋ถ๊ตฌํ๊ณ `image_processor`๋ฅผ ํ ํฌ๋์ด์ ์ธ์๋ก ๋ฃ์ ์ด์ ๋ JSON์ผ๋ก ์ ์ฅ๋๋ ์ด๋ฏธ์ง ํ๋ก์ธ์ ๊ตฌ์ฑ ํ์ผ์ด Hub์ ์ ์ฅ์์ ์ ๋ก๋๋๋๋ก ํ๊ธฐ ์ํจ์ ๋๋ค. | |
| `train` ๋ฉ์๋๋ฅผ ํธ์ถํ์ฌ ๋ชจ๋ธ์ ๋ฏธ์ธ ์กฐ์ ํ์ธ์: | |
| ```py | |
| >>> train_results = trainer.train() | |
| ``` | |
| ํ์ต์ด ์๋ฃ๋๋ฉด, ๋ชจ๋ธ์ [`~transformers.Trainer.push_to_hub`] ๋ฉ์๋๋ฅผ ์ฌ์ฉํ์ฌ ํ๋ธ์ ๊ณต์ ํ์ฌ ๋๊ตฌ๋ ๋ชจ๋ธ์ ์ฌ์ฉํ ์ ์๋๋ก ํฉ๋๋ค: | |
| ```py | |
| >>> trainer.push_to_hub() | |
| ``` | |
| ## ์ถ๋ก ํ๊ธฐ[[inference]] | |
| ์ข์ต๋๋ค. ์ด์ ๋ฏธ์ธ ์กฐ์ ๋ ๋ชจ๋ธ์ ์ถ๋ก ํ๋ ๋ฐ ์ฌ์ฉํ ์ ์์ต๋๋ค. | |
| ์ถ๋ก ์ ์ฌ์ฉํ ์์์ ๋ถ๋ฌ์ค์ธ์: | |
| ```py | |
| >>> sample_test_video = next(iter(test_dataset)) | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/sample_gif_two.gif" alt="Teams playing basketball"/> | |
| </div> | |
| ๋ฏธ์ธ ์กฐ์ ๋ ๋ชจ๋ธ์ ์ถ๋ก ์ ์ฌ์ฉํ๋ ๊ฐ์ฅ ๊ฐ๋จํ ๋ฐฉ๋ฒ์ [`pipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.VideoClassificationPipeline)์์ ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๊ฒ์ ๋๋ค. ๋ชจ๋ธ๋ก ์์ ๋ถ๋ฅ๋ฅผ ํ๊ธฐ ์ํด `pipeline`์ ์ธ์คํด์คํํ๊ณ ์์์ ์ ๋ฌํ์ธ์: | |
| ```py | |
| >>> from transformers import pipeline | |
| >>> video_cls = pipeline(model="my_awesome_video_cls_model") | |
| >>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi") | |
| [{'score': 0.9272987842559814, 'label': 'BasketballDunk'}, | |
| {'score': 0.017777055501937866, 'label': 'BabyCrawling'}, | |
| {'score': 0.01663011871278286, 'label': 'BalanceBeam'}, | |
| {'score': 0.009560945443809032, 'label': 'BandMarching'}, | |
| {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}] | |
| ``` | |
| ๋ง์ฝ ์ํ๋ค๋ฉด ์๋์ผ๋ก `pipeline`์ ๊ฒฐ๊ณผ๋ฅผ ์ฌํํ ์ ์์ต๋๋ค: | |
| ```py | |
| >>> def run_inference(model, video): | |
| ... # (num_frames, num_channels, height, width) | |
| ... perumuted_sample_test_video = video.permute(1, 0, 2, 3) | |
| ... inputs = { | |
| ... "pixel_values": perumuted_sample_test_video.unsqueeze(0), | |
| ... "labels": torch.tensor( | |
| ... [sample_test_video["label"]] | |
| ... ), # this can be skipped if you don't have labels available. | |
| ... } | |
| ... device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| ... inputs = {k: v.to(device) for k, v in inputs.items()} | |
| ... model = model.to(device) | |
| ... # forward pass | |
| ... with torch.no_grad(): | |
| ... outputs = model(**inputs) | |
| ... logits = outputs.logits | |
| ... return logits | |
| ``` | |
| ๋ชจ๋ธ์ ์ ๋ ฅ๊ฐ์ ๋ฃ๊ณ `logits`์ ๋ฐํ๋ฐ์ผ์ธ์: | |
| ```py | |
| >>> logits = run_inference(trained_model, sample_test_video["video"]) | |
| ``` | |
| `logits`์ ๋์ฝ๋ฉํ๋ฉด, ์ฐ๋ฆฌ๋ ๋ค์ ๊ฒฐ๊ณผ๋ฅผ ์ป์ ์ ์์ต๋๋ค: | |
| ```py | |
| >>> predicted_class_idx = logits.argmax(-1).item() | |
| >>> print("Predicted class:", model.config.id2label[predicted_class_idx]) | |
| # Predicted class: BasketballDunk | |
| ``` | |