Buckets:
hf-doc-build/doc-dev / computer-vision-course /pr_397 /en /unit7 /video-processing /cnn-based-video-model.html
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"CNN Based Video Models","local":"cnn-based-video-models","sections":[{"title":"General Trend:","local":"general-trend","sections":[],"depth":3}],"depth":1}"> | |
| <link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/74.87793a49.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"CNN Based Video Models","local":"cnn-based-video-models","sections":[{"title":"General Trend:","local":"general-trend","sections":[],"depth":3}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="cnn-based-video-models" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#cnn-based-video-models"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>CNN Based Video Models</span></h1> <h3 class="relative group"><a id="general-trend" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#general-trend"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>General Trend:</span></h3> <p data-svelte-h="svelte-1rkwn6u">The success of Deep Learning, particularly CNNs trained on massive datasets like ImageNet, revolutionized image recognition. This trend continues in video processing. However, video data introduces another dimension compared to static images: time. This simple change introduced a new set of challenges that CNNs trained in static images were not built to deal with.</p> <h1 class="relative group"><a id="previous-sota-models-in-video-processing" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#previous-sota-models-in-video-processing"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Previous SOTA Models in Video Processing</span></h1> <h2 class="relative group"><a id="two-stream-network2014" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#two-stream-network2014"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Two-Stream Network(2014)</span></h2> <div class="flex justify-center" data-svelte-h="svelte-mgjpl6"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/previous sota models/SOTA Models Two-Stream architecture for video classification.png" alt="Two-Stream architecture for video classification"></div> <p data-svelte-h="svelte-1ad76us">This paper extended Deep Convolutional Networks(ConvNets) to perform action-recognition in video data.</p> <p data-svelte-h="svelte-dprwzq">The proposed architecture is called Two-Stream Network. It uses two separate pathways within a neural network:</p> <ul data-svelte-h="svelte-1o607gt"><li><strong>Spatial Stream:</strong> A standard 2D CNN processes individual frames to capture appearance information.</li> <li><strong>Temporal Stream:</strong> A 2D CNN, or another network, that processes several frame sequences (optical flow) to capture motion information.</li> <li><strong>Fusion:</strong> The outputs from both streams are then combined to leverage both appearance and motion cues for tasks like action recognition.</li></ul> <h2 class="relative group"><a id="3d-resnets2017" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#3d-resnets2017"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>3D ResNets(2017)</span></h2> <div class="flex justify-center" data-svelte-h="svelte-un2g90"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/previous sota models/SOTA Models Residual block. Shortcut connections bypass a signal from the top of the block to the tail. Signals are summed at the tail..png" alt="Residual block. Shortcut connections bypass a signal from the top of the block to the tail. Signals are summed at the tail."></div> <p data-svelte-h="svelte-o58140">Standard 3D CNNs extend the concept to simultaneously capture spatial and temporal information using 3D kernels (2D spatial information + temporal information). A drawback of this model is that the large number of parameters result in the training being more computationally intensive and hence slower than the 2D version. Therefore, the 3D version of the ConvNets typically has fewer layers than the deeper architectures of 2D CNNs.</p> <p data-svelte-h="svelte-okfwgr">In this paper, the authors applied the ResNet architecture to the 3D CNNs. This approach introduces deeper models for 3D CNNs and achieves higher accuracy.</p> <p data-svelte-h="svelte-1eszh10">Experiments showed that the 3D ResNets (especially deeper ones like the ResNet-34) outperform models like the <a href="https://arxiv.org/abs/1412.0767" rel="nofollow">C3D</a>, particularly on larger datasets. Pretrained models like Sports-1M C3D can help mitigate overfitting on smaller datasets. Overall, 3D ResNets effectively leverage deeper architectures to capture complex spatiotemporal patterns in the video data.</p> <table data-svelte-h="svelte-vd3gi2"><thead><tr><th>Method</th> <th>Validation set</th> <th></th> <th></th> <th>Testing set</th> <th></th> <th></th></tr></thead> <tbody><tr><td></td> <td>Top-1</td> <td>Top-5</td> <td>Average</td> <td>Top-1</td> <td>Top-5</td> <td>Average</td></tr> <tr><td>3D ResNet-34</td> <td>58.0</td> <td>81.3</td> <td><strong>69.7</strong></td> <td>-</td> <td>-</td> <td><strong>68.9</strong></td></tr> <tr><td>C3D*</td> <td>55.6</td> <td>79.1</td> <td>67.4</td> <td>56.1</td> <td>79.5</td> <td>67.8</td></tr> <tr><td>C3D w/ BN</td> <td>56.1</td> <td>79.5</td> <td>67.8</td> <td>-</td> <td>-</td> <td>-</td></tr> <tr><td>RGB-I3D w/o ImageNet</td> <td>-</td> <td>-</td> <td>68.4</td> <td>88.0</td> <td><strong>78.2</strong></td> <td></td></tr></tbody></table> <h2 class="relative group"><a id="21d-resnets2017" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#21d-resnets2017"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>(2+1)D ResNets(2017)</span></h2> <div class="flex justify-center" data-svelte-h="svelte-3jw7no"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/previous sota models/SOTA Models 3D vs (2+1)D convolution..png" alt="3D vs (2+1)D convolution."></div> <p data-svelte-h="svelte-1ykrspd">(2+1)D ResNets are inspired by the 3D ResNets. However, a key difference lies in how the layers are structured. This architecture introduces a combination of 2D convolution and 1D convolution:</p> <ul data-svelte-h="svelte-1641fjx"><li>The 2D convolution captures the spatial features within a frame.</li> <li>The 1D convolution captures the motion information across the consecutive frames.</li></ul> <p data-svelte-h="svelte-zsre7z">This model can learn spatiotemporal features directly from video data, potentially leading to better performance in video analysis tasks like action recognition.</p> <ul data-svelte-h="svelte-1xocyot"><li>Benefits:<ul><li>The addition of nonlinear rectification (ReLU) between two operations doubles the number of non-linearities compared to a network using full 3D convolution for the same number of parameters, thus rendering the model capable of representing more complex functions.</li> <li>Decomposition facilitates the optimization, yielding in lower train loss and test loss in practice.</li></ul></li></ul> <table data-svelte-h="svelte-h1jeox"><thead><tr><th>Method</th> <th>Clip@1 Accuracy</th> <th>Video@1 Accuracy</th> <th>Video@5 Accuracy</th></tr></thead> <tbody><tr><td>DeepVideo</td> <td>41.9</td> <td>60.9</td> <td>80.2</td></tr> <tr><td>C3D</td> <td>46.1</td> <td>61.1</td> <td>85.2</td></tr> <tr><td>2D ResNet-152</td> <td>46.5</td> <td>64.6</td> <td>86.4</td></tr> <tr><td>Conv pooling</td> <td>-</td> <td>71.7</td> <td>90.4</td></tr> <tr><td>P3D</td> <td>47.9</td> <td>66.4</td> <td>87.4</td></tr> <tr><td>R3D-RGB-8frame</td> <td>53.8</td> <td>-</td> <td>-</td></tr> <tr><td>R(2+1)D-RGB-8frame</td> <td>56.1</td> <td>72.0</td> <td>91.2</td></tr> <tr><td>R(2+1)D-Flow-8frame</td> <td>44.5</td> <td>65.5</td> <td>87.2</td></tr> <tr><td>R(2+1)D-Two-Stream-8frame</td> <td>-</td> <td>72.2</td> <td>91.4</td></tr> <tr><td>R(2+1)D-RGB-32frame</td> <td><strong>57.0</strong></td> <td><strong>73.0</strong></td> <td><strong>91.5</strong></td></tr> <tr><td>R(2+1)D-Flow-32frame</td> <td>46.4</td> <td>68.4</td> <td>88.7</td></tr> <tr><td>R(2+1)D-Two-Stream-32frame</td> <td>-</td> <td><strong>73.3</strong></td> <td><strong>91.9</strong></td></tr></tbody></table> <h1 class="relative group"><a id="current-research" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#current-research"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Current Research</span></h1> <p data-svelte-h="svelte-27hz17">Currently, researchers are exploring deeper 3D CNN architectures. Another promising approach is combining 3D CNNs with other techniques like attention mechanisms. Alongside that, there is a push for developing larger video datasets like <a href="https://github.com/google-deepmind/kinetics-i3d" rel="nofollow">Kinetics</a>. | |
| The Kinetics dataset is a large-scale high-quality video dataset commonly used for human action recognition research. It contains hundreds of thousands of video clips that cover a wide range of human activities.</p> <h1 class="relative group"><a id="current-research" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#current-research"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Current Research</span></h1> <h3 class="relative group"><a id="self-supervised-learning-moco-momentum-contrast" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#self-supervised-learning-moco-momentum-contrast"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Self-Supervised Learning: MoCo (Momentum Contrast)</span></h3> <div class="flex justify-center" data-svelte-h="svelte-1610om4"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/unit7 CNN based model/Self-Supervised Learning_MoCo.png" alt="3D vs (2+1)D convolution."></div> <p data-svelte-h="svelte-9pdvmw"><strong>Overview</strong></p> <p data-svelte-h="svelte-ddqe35"><a href="https://arxiv.org/abs/1911.05722" rel="nofollow">MoCo</a> is a prominent model in the Self-Supervised Learning domain, using a contrastive learning approach to extract features from unlabeled video clips. By utilizing a momentum-based queue, it effectively learns from large-scale video datasets, making it ideal for tasks such as action recognition and event detection.</p> <p data-svelte-h="svelte-1i4nuav"><strong>Key Features</strong></p> <ul data-svelte-h="svelte-1jdd9qh"><li><strong>Momentum Encoder</strong>: Uses a momentum-updated encoder to maintain consistency in the representation space, enhancing training stability.</li> <li><strong>Dynamic Dictionary</strong>: Employs a queue-based dictionary that provides a large and consistent set of negative samples for contrastive learning.</li> <li><strong>Contrastive Loss Function</strong>: Leverages contrastive loss to learn invariant features by comparing positive and negative pairs.</li></ul> <h3 class="relative group"><a id="efficient-video-models-x3d-expanded-3d-networks" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#efficient-video-models-x3d-expanded-3d-networks"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Efficient Video Models: X3D (Expanded 3D Networks)</span></h3> <div class="flex justify-center" data-svelte-h="svelte-u161mf"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/unit7 CNN based model/Efficient Video Models X3D (Expanded 3D Networks).png" alt="3D vs (2+1)D convolution."></div> <p data-svelte-h="svelte-9pdvmw"><strong>Overview</strong></p> <p data-svelte-h="svelte-1sworc4"><a href="https://arxiv.org/abs/2004.04730" rel="nofollow">X3D</a> is a lightweight 3D ConvNet model designed for video recognition tasks. It builds on the concept of 3D CNNs but optimizes for fewer parameters and lower computational cost while maintaining high performance. This makes it suitable for real-time video analysis and deployment on mobile or edge devices.</p> <p data-svelte-h="svelte-1i4nuav"><strong>Key Features</strong></p> <ul data-svelte-h="svelte-4p9xmb"><li><strong>Efficiency</strong>: Achieves high accuracy with significantly fewer parameters and reduced computational cost.</li> <li><strong>Progressive Expansion</strong>: Utilizes a systematic approach to expand network dimensions (e.g., depth, width) for optimal performance.</li> <li><strong>Deployment-Friendly</strong>: Designed for easy deployment on devices with limited computational resources.</li></ul> <h3 class="relative group"><a id="real-time-video-processing-st-gcn-spatial-temporal-graph-convolutional-networks" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#real-time-video-processing-st-gcn-spatial-temporal-graph-convolutional-networks"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Real-time Video Processing: ST-GCN (Spatial-Temporal Graph Convolutional Networks)</span></h3> <div class="flex justify-center" data-svelte-h="svelte-u161mf"><img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/unit7 CNN based model/Efficient Video Models X3D (Expanded 3D Networks).png" alt="3D vs (2+1)D convolution."></div> <p data-svelte-h="svelte-9pdvmw"><strong>Overview</strong></p> <p data-svelte-h="svelte-1rg33dn"><a href="https://arxiv.org/abs/1801.07455" rel="nofollow">ST-GCN</a> is a model tailored for real-time action recognition, particularly in analyzing human movements in video sequences. It models spatio-temporal data using a graph structure, effectively capturing human joint positions and movements. This model is widely used in applications like surveillance and sports analysis for real-time action detection.</p> <p data-svelte-h="svelte-3m389y">These cutting-edge models are playing a crucial role in advancing video processing, excelling in areas such as video classification, action recognition, and real-time processing.</p> <p data-svelte-h="svelte-1i4nuav"><strong>Key Features</strong></p> <ul data-svelte-h="svelte-spi2cv"><li><strong>Graph-Based Modeling</strong>: Represents human skeletal data as graphs, allowing for natural modeling of joint connections.</li> <li><strong>Spatio-Temporal Convolutions</strong>: Integrates spatial and temporal graph convolutions to capture dynamic movement patterns.</li> <li><strong>Real-Time Performance</strong>: Optimized for fast computation, making it suitable for real-time applications.</li></ul> <h1 class="relative group"><a id="conclusion" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#conclusion"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Conclusion</span></h1> <p data-svelte-h="svelte-123m8qy">The evolution of video analysis models has been fascinating to witness. These models were heavily influenced by other SOTA models. For example, Two-StreamNets was motivated by the ConvNets and (2+1)D ResNets were inspired by the 3D ResNets. As the research progresses, one can expect even more advanced architectures and techniques to emerge in the future.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit7/video-processing/cnn-based-video-model.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1p6gie1 = { | |
| assets: "/docs/computer-vision-course/pr_397/en", | |
| base: "/docs/computer-vision-course/pr_397/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"), | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 74], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 27.4 kB
- Xet hash:
- 970bef4ebed1be75dc9d618496b98a134206b3b40050314aeb1c15b936e34f87
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.