Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[{"title":"What is synthetic data?","local":"what-is-synthetic-data","sections":[],"depth":2},{"title":"Why would you use synthetic data?","local":"why-would-you-use-synthetic-data","sections":[],"depth":2},{"title":"How to generate synthetic data?","local":"how-to-generate-synthetic-data","sections":[],"depth":2},{"title":"Challenges with synthetic data","local":"challenges-with-synthetic-data","sections":[],"depth":2},{"title":"Resources","local":"resources","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/computer-vision-course/pr_397/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/scheduler.7bc62968.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/singletons.b15acae1.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/paths.11cdc4b4.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.2f8492b0.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/0.e37092e8.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/nodes/17.236e393c.js"> | |
| <link rel="modulepreload" href="/docs/computer-vision-course/pr_397/en/_app/immutable/chunks/index.514d62da.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Introduction","local":"introduction","sections":[{"title":"What is synthetic data?","local":"what-is-synthetic-data","sections":[],"depth":2},{"title":"Why would you use synthetic data?","local":"why-would-you-use-synthetic-data","sections":[],"depth":2},{"title":"How to generate synthetic data?","local":"how-to-generate-synthetic-data","sections":[],"depth":2},{"title":"Challenges with synthetic data","local":"challenges-with-synthetic-data","sections":[],"depth":2},{"title":"Resources","local":"resources","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="introduction" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#introduction"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Introduction</span></h1> <p data-svelte-h="svelte-o1332z">Have you ever tried to get hold of some data for your problem, be it a machine learning problem or some other development-related problem, and you just couldn’t find enough data? Either the data is closed-source and unavailable to you, or it is prohibitively costly or time-consuming to acquire. How do we deal with such a situation?</p> <p data-svelte-h="svelte-nw5bt9">Well, one solution is synthetic data. Synthetic data is generated by a model to be used in place of real data or with real data. Here, by model, we don’t mean only machine learning or deep learning models; they can be simple mathematical or statistical models too, like a set of (stochastic) differential equations modeling a physical or economic <a href="https://link.springer.com/book/10.1007/978-3-319-56436-4" rel="nofollow">system</a>. Feeling excited yet? Let’s dive more into the details of synthetic data: what it is, how it is generated, and its benefits. You might be able to answer the last one a little by now ;)</p> <h2 class="relative group"><a id="what-is-synthetic-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#what-is-synthetic-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>What is synthetic data?</span></h2> <p data-svelte-h="svelte-18jz4hw">As <a href="https://arxiv.org/abs/2205.03257" rel="nofollow">Royal Society</a> defines, synthetic data is the data generated using a purpose-built mathematical model or algorithm to solve a (set of) data science task(s). Keep in mind that synthetic data only mimics the real data and is not generated by real events. Ideally, the synthetic data should have the same statistical properties as the real data it is supplementing. It has many uses, such as improving AI models, protecting sensitive data, and mitigating bias.</p> <h2 class="relative group"><a id="why-would-you-use-synthetic-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#why-would-you-use-synthetic-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Why would you use synthetic data?</span></h2> <p data-svelte-h="svelte-1k2ejb3">Before answering this question, let’s talk a little bit about why real data is not sufficient anymore. Some of the non-exhaustive problems with real data are:</p> <ul data-svelte-h="svelte-eey89z"><li>It can be messy and very hard to deal with.</li> <li>Inter-company data sharing might not be possible due to privacy issues.</li> <li>Medical data is confidential and hence cannot be shared openly.</li> <li>It can be biased.</li> <li>Data collection and annotation can be expensive.</li></ul> <p data-svelte-h="svelte-18sbrq7">Most of the above-mentioned problems can potentially be solved by synthetic data:</p> <ul data-svelte-h="svelte-10bnq1j"><li>Synthetic data are generated in a structured form, and hence, they are easy to deal with.</li> <li>Companies can train synthetic data generation models that learn the distribution of the original data but don’t reveal anything about individual data points in the original data and hence maintain privacy. A similar approach can be taken for medical data.</li> <li>We can train the data generator model to generate de-biased data.</li> <li>Synthetic data can be augmented with real data to make the models or applications more robust.</li></ul> <h2 class="relative group"><a id="how-to-generate-synthetic-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#how-to-generate-synthetic-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>How to generate synthetic data?</span></h2> <p data-svelte-h="svelte-p5z7gv">Here, we mention some of the ways to generate synthetic data:</p> <ul data-svelte-h="svelte-rfai8v"><li>CAD & Blender: Allows the creation of photorealistic image datasets of 3D scenes while controlling parameters. It enables computing metrics by comparing the synthesized data to the ground truth (generation parameters). It is a very robust method but limited in generation quality, diversity, and quantity. Use cases include using <a href="https://amazon-berkeley-objects.s3.amazonaws.com/static_html/ABO_CVPR2022.pdf" rel="nofollow">commercial applications</a>, generating <a href="https://arxiv.org/abs/2109.15102" rel="nofollow">synthetic faces</a>, and <a href="https://openaccess.thecvf.com/content_CVPR_2020/papers/Mu_Learning_From_Synthetic_Animals_CVPR_2020_paper.pdf" rel="nofollow">monitoring wildlife</a>.</li> <li>Deep generative models (Transformers/GANs/Diffusion models): Allow expanding a dataset, tackling data imbalance, and solving privacy issues. Very convenient and powerful but can create datasets with biases, incoherence, and repetitiveness, which induces an important overtraining risk and produces a restricted set of predictions. Use cases include <a href="https://rdcu.be/dokei" rel="nofollow">medical image generation</a>, <a href="https://www.mdpi.com/2073-4395/12/10/2395" rel="nofollow">efficient plant disease identification</a>, <a href="https://arxiv.org/abs/2303.14828" rel="nofollow">industrial waste sorting</a>, <a href="https://arxiv.org/abs/2101.04927" rel="nofollow">traffic sign recognition</a>, and <a href="https://computer-vision-in-the-wild.github.io/eccv-2022/static/eccv2022/camera_ready/ECCV_2022_cvinw_Domain_Compatible_Synthetic_Data_Generation.pdf" rel="nofollow">detection of emergency vehicles for an autonomous driving car application</a>.</li></ul> <p data-svelte-h="svelte-wy9mdv">In this unit, we will introduce the following methods to generate synthetic data: physically-based rendering, point clouds, and GANs.</p> <h2 class="relative group"><a id="challenges-with-synthetic-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#challenges-with-synthetic-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Challenges with synthetic data</span></h2> <p data-svelte-h="svelte-3g69fp">Now that we have seen the power and uses of synthetic data, let’s take some time out to discuss its challenges:</p> <ul data-svelte-h="svelte-qdi0lm"><li>Synthetic data is not inherently private: Synthetic data can also leak information about the data it was derived from and is vulnerable to privacy attacks. Significant care is required to generate private synthetic data.</li> <li>Outliers can be hard to capture privately: Outliers and low probability events, as are often found in real data, are particularly difficult to capture and to be privately included in a synthetic dataset.</li> <li>Empirically evaluating the privacy of a single dataset can be problematic: Rigorous notions of privacy (e.g., differential privacy) are a requirement on the mechanism that generated a synthetic dataset rather than on the dataset itself.</li> <li>Black box models can be particularly opaque when it comes to generating synthetic data: Overparameterised generative models excel in producing high-dimensional synthetic data, but the levels of accuracy and privacy of these datasets are hard to estimate and can vary significantly across produced data points.</li></ul> <h2 class="relative group"><a id="resources" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#resources"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Resources</span></h2> <ul data-svelte-h="svelte-1cesuwc"><li><a href="https://arxiv.org/abs/2302.04062" rel="nofollow">Machine Learning for Synthetic Data Generation: A Review</a></li> <li><a href="https://arxiv.org/abs/2205.03257" rel="nofollow">Synthetic Data — what, why and how?</a></li> <li>One very interesting application of synthetic data: <a href="https://www.thispersondoesnotexist.com/" rel="nofollow">this person does not exist</a></li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/computer-vision-course/blob/main/chapters/en/unit10/introduction.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1p6gie1 = { | |
| assets: "/docs/computer-vision-course/pr_397/en", | |
| base: "/docs/computer-vision-course/pr_397/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/start.7f209408.js"), | |
| import("/docs/computer-vision-course/pr_397/en/_app/immutable/entry/app.32e8338e.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 17], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 17.2 kB
- Xet hash:
- a29772c3ff6740b3c0b59431f0d84b30cfb5b6cd07833d162deded30dbc61a3e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.