Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Optimizations","local":"optimizations","sections":[{"title":"Lazy vs Eager","local":"lazy-vs-eager","sections":[],"depth":2},{"title":"Example","local":"example","sections":[{"title":"Eager","local":"eager","sections":[],"depth":3},{"title":"Lazy","local":"lazy","sections":[],"depth":3},{"title":"Timings","local":"timings","sections":[],"depth":3}],"depth":2}],"depth":1}"> | |
| <link href="/docs/hub/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/scheduler.d6170356.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/singletons.d032f1eb.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/paths.752f1c6b.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/index.fcd4cc08.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/0.f045427f.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/39.2c0a9a8f.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/CodeBlock.7b16bdef.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/EditOnGithub.da2b595c.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Optimizations","local":"optimizations","sections":[{"title":"Lazy vs Eager","local":"lazy-vs-eager","sections":[],"depth":2},{"title":"Example","local":"example","sections":[{"title":"Eager","local":"eager","sections":[],"depth":3},{"title":"Lazy","local":"lazy","sections":[],"depth":3},{"title":"Timings","local":"timings","sections":[],"depth":3}],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="optimizations" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#optimizations"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Optimizations</span></h1> <p data-svelte-h="svelte-o4igm">We briefly touched upon the difference between lazy and eager evaluation. On this page we will show how the lazy API can be used to get huge performance benefits.</p> <h2 class="relative group"><a id="lazy-vs-eager" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#lazy-vs-eager"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Lazy vs Eager</span></h2> <p data-svelte-h="svelte-16m8tgd">Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it’s ‘needed’. Deferring the execution to the last minute can have significant performance advantages and is why the lazy API is preferred in most non-interactive cases.</p> <h2 class="relative group"><a id="example" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#example"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Example</span></h2> <p data-svelte-h="svelte-1qs4ke0">We will be using the example from the previous page to show the performance benefits of using the lazy API. The code below will compute the number of uploads from <code>archive.org</code>.</p> <h3 class="relative group"><a id="eager" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#eager"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Eager</span></h3> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl | |
| <span class="hljs-keyword">import</span> datetime | |
| df = pl.read_csv(<span class="hljs-string">"hf://datasets/commoncrawl/statistics/tlds.csv"</span>, try_parse_dates=<span class="hljs-literal">True</span>) | |
| df = df.select(<span class="hljs-string">"suffix"</span>, <span class="hljs-string">"crawl"</span>, <span class="hljs-string">"date"</span>, <span class="hljs-string">"tld"</span>, <span class="hljs-string">"pages"</span>, <span class="hljs-string">"domains"</span>) | |
| df = df.<span class="hljs-built_in">filter</span>( | |
| (pl.col(<span class="hljs-string">"date"</span>) >= datetime.date(<span class="hljs-number">2020</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)) | | |
| pl.col(<span class="hljs-string">"crawl"</span>).<span class="hljs-built_in">str</span>.contains(<span class="hljs-string">"CC"</span>) | |
| ) | |
| df = df.with_columns( | |
| (pl.col(<span class="hljs-string">"pages"</span>) / pl.col(<span class="hljs-string">"domains"</span>)).alias(<span class="hljs-string">"pages_per_domain"</span>) | |
| ) | |
| df = df.group_by(<span class="hljs-string">"tld"</span>, <span class="hljs-string">"date"</span>).agg( | |
| pl.col(<span class="hljs-string">"pages"</span>).<span class="hljs-built_in">sum</span>(), | |
| pl.col(<span class="hljs-string">"domains"</span>).<span class="hljs-built_in">sum</span>(), | |
| ) | |
| df = df.group_by(<span class="hljs-string">"tld"</span>).agg( | |
| pl.col(<span class="hljs-string">"date"</span>).unique().count().alias(<span class="hljs-string">"number_of_scrapes"</span>), | |
| pl.col(<span class="hljs-string">"domains"</span>).mean().alias(<span class="hljs-string">"avg_number_of_domains"</span>), | |
| pl.col(<span class="hljs-string">"pages"</span>).sort_by(<span class="hljs-string">"date"</span>).pct_change().mean().alias(<span class="hljs-string">"avg_page_growth_rate"</span>), | |
| ).sort(<span class="hljs-string">"avg_number_of_domains"</span>, descending=<span class="hljs-literal">True</span>).head(<span class="hljs-number">10</span>)<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="lazy" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#lazy"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Lazy</span></h3> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl | |
| <span class="hljs-keyword">import</span> datetime | |
| lf = ( | |
| pl.scan_csv(<span class="hljs-string">"hf://datasets/commoncrawl/statistics/tlds.csv"</span>, try_parse_dates=<span class="hljs-literal">True</span>) | |
| .<span class="hljs-built_in">filter</span>( | |
| (pl.col(<span class="hljs-string">"date"</span>) >= datetime.date(<span class="hljs-number">2020</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>)) | | |
| pl.col(<span class="hljs-string">"crawl"</span>).<span class="hljs-built_in">str</span>.contains(<span class="hljs-string">"CC"</span>) | |
| ).with_columns( | |
| (pl.col(<span class="hljs-string">"pages"</span>) / pl.col(<span class="hljs-string">"domains"</span>)).alias(<span class="hljs-string">"pages_per_domain"</span>) | |
| ).group_by(<span class="hljs-string">"tld"</span>, <span class="hljs-string">"date"</span>).agg( | |
| pl.col(<span class="hljs-string">"pages"</span>).<span class="hljs-built_in">sum</span>(), | |
| pl.col(<span class="hljs-string">"domains"</span>).<span class="hljs-built_in">sum</span>(), | |
| ).group_by(<span class="hljs-string">"tld"</span>).agg( | |
| pl.col(<span class="hljs-string">"date"</span>).unique().count().alias(<span class="hljs-string">"number_of_scrapes"</span>), | |
| pl.col(<span class="hljs-string">"domains"</span>).mean().alias(<span class="hljs-string">"avg_number_of_domains"</span>), | |
| pl.col(<span class="hljs-string">"pages"</span>).sort_by(<span class="hljs-string">"date"</span>).pct_change().mean().alias(<span class="hljs-string">"avg_page_growth_rate"</span>), | |
| ).sort(<span class="hljs-string">"avg_number_of_domains"</span>, descending=<span class="hljs-literal">True</span>).head(<span class="hljs-number">10</span>) | |
| ) | |
| df = lf.collect()<!-- HTML_TAG_END --></pre></div> <h3 class="relative group"><a id="timings" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#timings"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Timings</span></h3> <p data-svelte-h="svelte-7wcll">Running both queries leads to following run times on a regular laptop with a household internet connection:</p> <ul data-svelte-h="svelte-1ptfoh2"><li>Eager: <code>1.96</code> seconds</li> <li>Lazy: <code>410</code> milliseconds</li></ul> <p data-svelte-h="svelte-13w92en">The lazy query is ~5 times faster than the eager one. The reason for this is the query optimizer: if we delay <code>collect</code>-ing our dataset until the end, Polars will be able to reason about which columns and rows are required and apply filters as early as possible when reading the data. For file formats such as Parquet that contain metadata (e.g. min, max in a certain group of rows) the difference can even be bigger as Polars can skip entire row groups based on the filters and the metadata without sending the data over the wire.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-polars-optimizations.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1vatp3t = { | |
| assets: "/docs/hub/main/en", | |
| base: "/docs/hub/main/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"), | |
| import("/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 39], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 18.4 kB
- Xet hash:
- 52998c15d09d65dfcc000679df7cc062ff53289d8f48b22b6c159180301e7587
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.