Buckets:
hf-doc-build/doc-dev / computer-vision-course /pr_397 /en /unit5 /generative-models /introduction /introduction.html
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[{"title":"Definition","local":"definition","sections":[],"depth":2},{"title":"Evaluation of generative models in computer vision","local":"evaluation-of-generative-models-in-computer-vision","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/67.db589436.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[{"title":"Definition","local":"definition","sections":[],"depth":2},{"title":"Evaluation of generative models in computer vision","local":"evaluation-of-generative-models-in-computer-vision","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="introduction" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#introduction"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Introduction</span></h1> <p data-svelte-h="svelte-5t1j91">In the last unit we have learned about multimodality and especially about how to fuse vision and language models to harness the best of the two worlds and outperform simple vision models in tasks like Zero-Shot Image Classification. | |
| Another area where multimodal models have had an significant impact, are generative vision models. In this unit, we will have a deeper look at these types of Neural Networks.</p> <h2 class="relative group"><a id="definition" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#definition"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Definition</span></h2> <p data-svelte-h="svelte-a9gcyb">What are generative vision models and how do they differ from other models?</p> <p data-svelte-h="svelte-11rkjc">Mathematical models can generally be separated into two large families, generative models and discriminative models. | |
| The main difference between discriminative models and generative models is that discriminative models learn boundaries that separate different classes, while generative models learn the distribution of different classes.</p> <p data-svelte-h="svelte-12t5as0">Discriminative models can be applied to standard computer vision tasks such as classification and regression, | |
| these tasks can be expanded into more complex processes such as semantic segmentation or object detection.</p> <p data-svelte-h="svelte-11051fc">For the sake of brevity, in this chapter, we will consider generative models that solve these tasks:</p> <ul data-svelte-h="svelte-1g76a1s"><li>noise to image (DCGAN)</li> <li>text to image (diffusion models)</li> <li>image to image (StyleGAN, cycleGAN, diffusion models)</li></ul> <p data-svelte-h="svelte-wz5sfs">This section will cover 2 kinds of generative models. GAN-based models, and diffusion-based models.</p> <h2 class="relative group"><a id="evaluation-of-generative-models-in-computer-vision" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#evaluation-of-generative-models-in-computer-vision"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Evaluation of generative models in computer vision</span></h2> <p data-svelte-h="svelte-j77g8n">Generally, it is really hard to come up with meaningful metrics for evaluating generative models. Because often you don’t have a solid “ground truth”, and it is difficult to quantify the quality of an image. FID is the most commonly used metric, but it is not perfect.</p> <p data-svelte-h="svelte-1uidrbk">Let’s quickly go over FID. FID stands for Fréchet Inception Distance, it is an improvement on the Inception Score and was introduced in <a href="https://arxiv.org/pdf/1706.08500.pdf" rel="nofollow">GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium</a>. FID is considered to be resistant to noise and certain artifacts that can be present in generated images. The lower the FID, the better.</p> <p data-svelte-h="svelte-1e0dguq">It is calculated by constructing 2 distributions from the Inception-v3 features. The first distribution is calculated from the training data features, and the second distribution is calculated from the generated image features. Then the Fréchet distance between these 2 distributions is calculated, and that is your FID score. The lower this score, the better the perceived quality of the generated images. Here is a <a href="https://www.youtube.com/watch?v=9zTwSzXxNDo&t=398s" rel="nofollow">short explanation</a> on FID.</p> <p data-svelte-h="svelte-xc2bb6">Some other metrics you might come across are SSIM, PSNR, IS(Inception Score), and the recently introduced CLIP Score.</p> <ul data-svelte-h="svelte-1kh2lr0"><li><p>PSNR (peak signal-to-noise ratio) can be interpreted almost as mean-squared-error. Generally, values from [25,34] are okay results while 34+ is very good.</p></li> <li><p>SSIM (Structural Similarity Index) is a metric in the range [0, 1] where 1 is a perfect match. The final index is calculated from 3 components: luminance, contrast, and structure. <a href="https://arxiv.org/pdf/2006.13846.pdf" rel="nofollow">this paper</a> analyzes SSIM and its components if you’re really interested.</p></li> <li><p>Inception score was introduced in <a href="https://arxiv.org/pdf/1606.03498.pdf" rel="nofollow">Improved Techniques for Training GANs</a>. It is calculated using the features on the inceptionv3 model. The higher the better. It is a mathematically very interesting metric, but has recently fallen out of favor.</p></li> <li><p>CLIP Score, this metric was introduced in <a href="https://arxiv.org/pdf/2104.08718.pdf" rel="nofollow">CLIPScore: A Reference-free Evaluation Metric for Image Captioning</a> is used to evaluate the quality of text to image models. It is calculated by using the CLIP model to calculate the cosine similarity between the generated image and the text prompt. Its range is [0, 100], the higher the better.</p> <p>If you’re <em>really curious</em> about FID. <a href="https://arxiv.org/pdf/2203.06026.pdf" rel="nofollow">The Role of ImageNet Classes in Fréchet Inception Distance</a> tries to analyze what FID considers important in an image, and how the features pretrained on imagenet affect the FID score. It is a very interesting read.</p></li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit5/generative-models/introduction/introduction.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1p6gie1 = { | |
| assets: "/docs/computer-vision-course/pr_397/en", | |
| base: "/docs/computer-vision-course/pr_397/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"), | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 67], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 11.4 kB
- Xet hash:
- a024ca3caa249d9322380f6fbd6fd8554c5ed76b5aa61aa148ac8c3a16ce9aae
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.