Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Spark","local":"spark","sections":[{"title":"Installation","local":"installation","sections":[],"depth":2},{"title":"Authentication","local":"authentication","sections":[],"depth":2},{"title":"Read","local":"read","sections":[],"depth":2},{"title":"Write","local":"write","sections":[],"depth":2},{"title":"Run in JupyterLab on Hugging Face Spaces","local":"run-in-jupyterlab-on-hugging-face-spaces","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/hub/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/scheduler.d6170356.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/singletons.d032f1eb.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/paths.752f1c6b.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/index.fcd4cc08.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/0.f045427f.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/40.7c2a8617.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/CodeBlock.7b16bdef.js"> | |
| <link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/EditOnGithub.da2b595c.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Spark","local":"spark","sections":[{"title":"Installation","local":"installation","sections":[],"depth":2},{"title":"Authentication","local":"authentication","sections":[],"depth":2},{"title":"Read","local":"read","sections":[],"depth":2},{"title":"Write","local":"write","sections":[],"depth":2},{"title":"Run in JupyterLab on Hugging Face Spaces","local":"run-in-jupyterlab-on-hugging-face-spaces","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="spark" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#spark"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Spark</span></h1> <p data-svelte-h="svelte-s0r53o">Spark enables real-time, large-scale data processing in a distributed environment.</p> <p data-svelte-h="svelte-12z8vci">In particular you can use <code>huggingface_hub</code> to access Hugging Face datasets repositories in PySpark</p> <h2 class="relative group"><a id="installation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#installation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Installation</span></h2> <p data-svelte-h="svelte-13h0lcm">To be able to read and write to Hugging Face URLs (e.g. <code>hf://datasets/username/dataset/data.parquet</code>), you need to install the <code>huggingface_hub</code> library:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip <span class="hljs-keyword">install</span> huggingface_hub<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ml7hwl">You also need to install <code>pyarrow</code> to read/write Parquet / JSON / CSV / etc. files using the filesystem API provided by <code>huggingFace_hub</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip <span class="hljs-keyword">install</span> pyarrow<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="authentication" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#authentication"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Authentication</span></h2> <p data-svelte-h="svelte-aqw7kt">You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories.</p> <p data-svelte-h="svelte-pdivv8">You can use the CLI for example:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->huggingface-<span class="hljs-keyword">cli</span> login<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1pzllua">It’s also possible to provide your Hugging Face token with the <code>HF_TOKEN</code> environment variable or passing the <code>storage_options</code> parameter to helper functions below:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->storage_options = {<span class="hljs-string">"token"</span>: <span class="hljs-string">"hf_xxx"</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ygg3x4">For more details about authentication, check out <a href="https://huggingface.co/docs/huggingface_hub/quick-start#authentication" rel="nofollow">this guide</a>.</p> <h2 class="relative group"><a id="read" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#read"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Read</span></h2> <p data-svelte-h="svelte-7gpw7a">PySpark doesn’t have an official support for Hugging Face paths, so we provide a helper function to read datasets in a distributed manner.</p> <p data-svelte-h="svelte-4fzrqm">For example you can read Parquet files from Hugging Face in an optimized way using PyArrow by defining this <code>read_parquet</code> helper function:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial | |
| <span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Iterator, <span class="hljs-type">Optional</span>, <span class="hljs-type">Union</span> | |
| <span class="hljs-keyword">import</span> pyarrow <span class="hljs-keyword">as</span> pa | |
| <span class="hljs-keyword">import</span> pyarrow.parquet <span class="hljs-keyword">as</span> pq | |
| <span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> HfFileSystem | |
| <span class="hljs-keyword">from</span> pyspark.sql.dataframe <span class="hljs-keyword">import</span> DataFrame | |
| <span class="hljs-keyword">from</span> pyspark.sql.pandas.types <span class="hljs-keyword">import</span> from_arrow_schema | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_read</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], columns: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>]], filters: <span class="hljs-type">Optional</span>[<span class="hljs-type">Union</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>], <span class="hljs-built_in">list</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>]]]], **kwargs</span>) -> Iterator[pa.RecordBatch]: | |
| <span class="hljs-keyword">for</span> batch <span class="hljs-keyword">in</span> iterator: | |
| paths = batch[<span class="hljs-number">0</span>].to_pylist() | |
| ds = pq.ParquetDataset(paths, **kwargs) | |
| <span class="hljs-keyword">yield</span> <span class="hljs-keyword">from</span> ds._dataset.to_batches(columns=columns, <span class="hljs-built_in">filter</span>=pq.filters_to_expression(filters) <span class="hljs-keyword">if</span> filters <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">read_parquet</span>(<span class="hljs-params"> | |
| path: <span class="hljs-built_in">str</span>, | |
| columns: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>]] = <span class="hljs-literal">None</span>, | |
| filters: <span class="hljs-type">Optional</span>[<span class="hljs-type">Union</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>], <span class="hljs-built_in">list</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>]]]] = <span class="hljs-literal">None</span>, | |
| **kwargs, | |
| </span>) -> DataFrame: | |
| <span class="hljs-string">""" | |
| Loads Parquet files from Hugging Face using PyArrow, returning a PySPark `DataFrame`. | |
| It reads Parquet files in a distributed manner. | |
| Access private or gated repositories using `huggingface-cli login` or passing a token | |
| using the `storage_options` argument: `storage_options={"token": "hf_xxx"}` | |
| Parameters | |
| ---------- | |
| path : str | |
| Path to the file. Prefix with a protocol like `hf://` to read from Hugging Face. | |
| You can read from multiple files if you pass a globstring. | |
| columns : list, default None | |
| If not None, only these columns will be read from the file. | |
| filters : List[Tuple] or List[List[Tuple]], default None | |
| To filter out data. | |
| Filter syntax: [[(column, op, val), ...],...] | |
| where op is [==, =, >, >=, <, <=, !=, in, not in] | |
| The innermost tuples are transposed into a set of filters applied | |
| through an `AND` operation. | |
| The outer list combines these sets of filters through an `OR` | |
| operation. | |
| A single list of tuples can also be used, meaning that no `OR` | |
| operation between set of filters is to be conducted. | |
| **kwargs | |
| Any additional kwargs are passed to pyarrow.parquet.ParquetDataset. | |
| Returns | |
| ------- | |
| DataFrame | |
| DataFrame based on parquet file. | |
| Examples | |
| -------- | |
| >>> path = "hf://datasets/username/dataset/data.parquet" | |
| >>> pd.DataFrame({"foo": range(5), "bar": range(5, 10)}).to_parquet(path) | |
| >>> read_parquet(path).show() | |
| +---+---+ | |
| |foo|bar| | |
| +---+---+ | |
| | 0| 5| | |
| | 1| 6| | |
| | 2| 7| | |
| | 3| 8| | |
| | 4| 9| | |
| +---+---+ | |
| >>> read_parquet(path, columns=["bar"]).show() | |
| +---+ | |
| |bar| | |
| +---+ | |
| | 5| | |
| | 6| | |
| | 7| | |
| | 8| | |
| | 9| | |
| +---+ | |
| >>> sel = [("foo", ">", 2)] | |
| >>> read_parquet(path, filters=sel).show() | |
| +---+---+ | |
| |foo|bar| | |
| +---+---+ | |
| | 3| 8| | |
| | 4| 9| | |
| +---+---+ | |
| """</span> | |
| filesystem: HfFileSystem = kwargs.pop(<span class="hljs-string">"filesystem"</span>) <span class="hljs-keyword">if</span> <span class="hljs-string">"filesystem"</span> <span class="hljs-keyword">in</span> kwargs <span class="hljs-keyword">else</span> HfFileSystem(**kwargs.pop(<span class="hljs-string">"storage_options"</span>, {})) | |
| paths = filesystem.glob(path) | |
| <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> paths: | |
| <span class="hljs-keyword">raise</span> FileNotFoundError(<span class="hljs-string">f"Counldn't find any file at <span class="hljs-subst">{path}</span>"</span>) | |
| rdd = spark.sparkContext.parallelize([{<span class="hljs-string">"path"</span>: path} <span class="hljs-keyword">for</span> path <span class="hljs-keyword">in</span> paths], <span class="hljs-built_in">len</span>(paths)) | |
| df = spark.createDataFrame(rdd) | |
| arrow_schema = pq.read_schema(filesystem.<span class="hljs-built_in">open</span>(paths[<span class="hljs-number">0</span>])) | |
| schema = pa.schema([field <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> arrow_schema <span class="hljs-keyword">if</span> (columns <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> field.name <span class="hljs-keyword">in</span> columns)], metadata=arrow_schema.metadata) | |
| <span class="hljs-keyword">return</span> df.mapInArrow( | |
| partial(_read, columns=columns, filters=filters, filesystem=filesystem, schema=arrow_schema, **kwargs), | |
| from_arrow_schema(schema), | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1vx9tic">Here is how we can use this on the <a href="https://huggingface.co/datasets/BAAI/Infinity-Instruct" rel="nofollow">BAAI/Infinity-Instruct</a> dataset. | |
| It is a gated repository, users have to accept the terms of use before accessing it.</p> <div class="flex justify-center" data-svelte-h="svelte-1ct11n9"><img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-7M-min.png"> <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-7M-dark-min.png"></div> <p data-svelte-h="svelte-1p4gwvr">We use the <code>read_parquet</code> function to read data from the dataset, compute the number of dialogue per language and filter the dataset.</p> <p data-svelte-h="svelte-19d7o0j">After logging-in to access the gated repository, we can run:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession | |
| <span class="hljs-meta">>>> </span>spark = SparkSession.builder.appName(<span class="hljs-string">"demo"</span>).getOrCreate() | |
| <span class="hljs-meta">>>> </span>df = read_parquet(<span class="hljs-string">"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet"</span>) | |
| <span class="hljs-meta">>>> </span>df.show() | |
| +---+----------------------------+-----+----------+--------------------+ | |
| | <span class="hljs-built_in">id</span>| conversations|label|langdetect| source| | |
| +---+----------------------------+-----+----------+--------------------+ | |
| | <span class="hljs-number">0</span>| [{human, <span class="hljs-keyword">def</span> <span class="hljs-title function_">exti</span>...| | en| code_exercises| | |
| | <span class="hljs-number">1</span>| [{human, See the ...| | en| flan| | |
| | <span class="hljs-number">2</span>| [{human, This <span class="hljs-keyword">is</span> ...| | en| flan| | |
| | <span class="hljs-number">3</span>| [{human, If you d...| | en| flan| | |
| | <span class="hljs-number">4</span>| [{human, In a Uni...| | en| flan| | |
| | <span class="hljs-number">5</span>| [{human, Read the...| | en| flan| | |
| | <span class="hljs-number">6</span>| [{human, You are ...| | en| code_bagel| | |
| | <span class="hljs-number">7</span>| [{human, I want y...| | en| Subjective| | |
| | <span class="hljs-number">8</span>| [{human, Given th...| | en| flan| | |
| | <span class="hljs-number">9</span>|[{human, 因果联系原则是法...| | zh-cn| Subjective| | |
| | <span class="hljs-number">10</span>| [{human, Provide ...| | en|self-oss-instruct...| | |
| | <span class="hljs-number">11</span>| [{human, The univ...| | en| flan| | |
| | <span class="hljs-number">12</span>| [{human, Q: I am ...| | en| flan| | |
| | <span class="hljs-number">13</span>| [{human, What <span class="hljs-keyword">is</span> ...| | en| OpenHermes-<span class="hljs-number">2.5</span>| | |
| | <span class="hljs-number">14</span>| [{human, In react...| | en| flan| | |
| | <span class="hljs-number">15</span>| [{human, Write Py...| | en| code_exercises| | |
| | <span class="hljs-number">16</span>| [{human, Find the...| | en| MetaMath| | |
| | <span class="hljs-number">17</span>| [{human, Three of...| | en| MetaMath| | |
| | <span class="hljs-number">18</span>| [{human, Chandra ...| | en| MetaMath| | |
| | <span class="hljs-number">19</span>|[{human, 用经济学知识分析...| | zh-cn| Subjective| | |
| +---+----------------------------+-----+----------+--------------------+<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-q1u30o">To compute the number of dialogues per language we run this code. | |
| The <code>columns</code> argument is useful to only load the data we need, since PySpark doesn’t enable predicate push-down in this case. | |
| There is also a <code>filters</code> argument to only load data with values within a certain range.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span>df_langdetect_only = read_parquet(<span class="hljs-string">"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet"</span>, columns=[<span class="hljs-string">"langdetect"</span>]) | |
| <span class="hljs-meta">>>> </span>df_langdetect_only.groupBy(<span class="hljs-string">"langdetect"</span>).count().show() | |
| +----------+-------+ | |
| |langdetect| count| | |
| +----------+-------+ | |
| | en|<span class="hljs-number">6697793</span>| | |
| | zh-cn| <span class="hljs-number">751313</span>| | |
| +----------+-------+<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-e2gcj5">To filter the dataset and only keep dialogues in Chinese:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span>criteria = [(<span class="hljs-string">"langdetect"</span>, <span class="hljs-string">"="</span>, <span class="hljs-string">"zh-cn"</span>)] | |
| <span class="hljs-meta">>>> </span>df_chinese_only = read_parquet(<span class="hljs-string">"hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet"</span>, filters=criteria) | |
| <span class="hljs-meta">>>> </span>df_chinese_only.show() | |
| +---+----------------------------+-----+----------+----------+ | |
| | <span class="hljs-built_in">id</span>| conversations|label|langdetect| source| | |
| +---+----------------------------+-----+----------+----------+ | |
| | <span class="hljs-number">9</span>|[{human, 因果联系原则是法...| | zh-cn|Subjective| | |
| | <span class="hljs-number">19</span>|[{human, 用经济学知识分析...| | zh-cn|Subjective| | |
| | <span class="hljs-number">38</span>| [{human, 某个考试共有A、...| | zh-cn|Subjective| | |
| | <span class="hljs-number">39</span>|[{human, 撰写一篇关于斐波...| | zh-cn|Subjective| | |
| | <span class="hljs-number">57</span>|[{human, 总结世界历史上的...| | zh-cn|Subjective| | |
| | <span class="hljs-number">61</span>|[{human, 生成一则广告词。...| | zh-cn|Subjective| | |
| | <span class="hljs-number">66</span>|[{human, 描述一个有效的团...| | zh-cn|Subjective| | |
| | <span class="hljs-number">94</span>|[{human, 如果比利和蒂芙尼...| | zh-cn|Subjective| | |
| |<span class="hljs-number">102</span>|[{human, 生成一句英文名言...| | zh-cn|Subjective| | |
| |<span class="hljs-number">106</span>|[{human, 写一封感谢信,感...| | zh-cn|Subjective| | |
| |<span class="hljs-number">118</span>| [{human, 生成一个故事。}...| | zh-cn|Subjective| | |
| |<span class="hljs-number">174</span>|[{human, 高胆固醇水平的后...| | zh-cn|Subjective| | |
| |<span class="hljs-number">180</span>|[{human, 基于以下角色信息...| | zh-cn|Subjective| | |
| |<span class="hljs-number">192</span>|[{human, 请写一篇文章,概...| | zh-cn|Subjective| | |
| |<span class="hljs-number">221</span>|[{human, 以诗歌形式表达对...| | zh-cn|Subjective| | |
| |<span class="hljs-number">228</span>|[{human, 根据给定的指令,...| | zh-cn|Subjective| | |
| |<span class="hljs-number">236</span>|[{human, 打开一个新的生成...| | zh-cn|Subjective| | |
| |<span class="hljs-number">260</span>|[{human, 生成一个有关未来...| | zh-cn|Subjective| | |
| |<span class="hljs-number">268</span>|[{human, 如果有一定数量的...| | zh-cn|Subjective| | |
| |<span class="hljs-number">273</span>| [{human, 题目:小明有<span class="hljs-number">5</span>个...| | zh-cn|Subjective| | |
| +---+----------------------------+-----+----------+----------+<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="write" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#write"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Write</span></h2> <p data-svelte-h="svelte-nrb4bc">We also provide a helper function to write datasets in a distributed manner to a Hugging Face repository.</p> <p data-svelte-h="svelte-104rxy3">You can write a PySpark Dataframe to Hugging Face using this <code>write_parquet</code> helper function based on the <code>huggingface_hub</code> API. | |
| In particular it uses the <code>preupload_lfs_files</code> utility to upload Parquet files in parallel in a distributed manner, and only commits the files once they’re all uploaded:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> math | |
| <span class="hljs-keyword">import</span> pickle | |
| <span class="hljs-keyword">import</span> tempfile | |
| <span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial | |
| <span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Iterator, <span class="hljs-type">Optional</span> | |
| <span class="hljs-keyword">import</span> pyarrow <span class="hljs-keyword">as</span> pa | |
| <span class="hljs-keyword">import</span> pyarrow.parquet <span class="hljs-keyword">as</span> pq | |
| <span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> CommitOperationAdd, HfFileSystem | |
| <span class="hljs-keyword">from</span> pyspark.sql.dataframe <span class="hljs-keyword">import</span> DataFrame | |
| <span class="hljs-keyword">from</span> pyspark.sql.pandas.types <span class="hljs-keyword">import</span> from_arrow_schema, to_arrow_schema | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_preupload</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], path: <span class="hljs-built_in">str</span>, schema: pa.Schema, filesystem: HfFileSystem, row_group_size: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">int</span>] = <span class="hljs-literal">None</span>, **kwargs</span>) -> Iterator[pa.RecordBatch]: | |
| resolved_path = filesystem.resolve_path(path) | |
| <span class="hljs-keyword">with</span> tempfile.NamedTemporaryFile(suffix=<span class="hljs-string">".parquet"</span>) <span class="hljs-keyword">as</span> temp_file: | |
| <span class="hljs-keyword">with</span> pq.ParquetWriter(temp_file.name, schema=schema, **kwargs) <span class="hljs-keyword">as</span> writer: | |
| <span class="hljs-keyword">for</span> batch <span class="hljs-keyword">in</span> iterator: | |
| writer.write_batch(batch, row_group_size=row_group_size) | |
| addition = CommitOperationAdd(path_in_repo=temp_file.name, path_or_fileobj=temp_file.name) | |
| filesystem._api.preupload_lfs_files(repo_id=resolved_path.repo_id, additions=[addition], repo_type=resolved_path.repo_type, revision=resolved_path.revision) | |
| <span class="hljs-keyword">yield</span> pa.record_batch({<span class="hljs-string">"addition"</span>: [pickle.dumps(addition)]}, schema=pa.schema({<span class="hljs-string">"addition"</span>: pa.binary()})) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">_commit</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], path: <span class="hljs-built_in">str</span>, filesystem: HfFileSystem, max_operations_per_commit=<span class="hljs-number">50</span></span>) -> Iterator[pa.RecordBatch]: | |
| resolved_path = filesystem.resolve_path(path) | |
| additions: <span class="hljs-built_in">list</span>[CommitOperationAdd] = [pickle.loads(addition) <span class="hljs-keyword">for</span> addition <span class="hljs-keyword">in</span> pa.Table.from_batches(iterator, schema=pa.schema({<span class="hljs-string">"addition"</span>: pa.binary()}))[<span class="hljs-number">0</span>].to_pylist()] | |
| num_commits = math.ceil(<span class="hljs-built_in">len</span>(additions) / max_operations_per_commit) | |
| <span class="hljs-keyword">for</span> shard_idx, addition <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(additions): | |
| addition.path_in_repo = resolved_path.path_in_repo.replace(<span class="hljs-string">"{shard_idx:05d}"</span>, <span class="hljs-string">f"<span class="hljs-subst">{shard_idx:05d}</span>"</span>) | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, num_commits): | |
| operations = additions[i * max_operations_per_commit : (i + <span class="hljs-number">1</span>) * max_operations_per_commit] | |
| commit_message = <span class="hljs-string">"Upload using PySpark"</span> + (<span class="hljs-string">f" (part <span class="hljs-subst">{i:05d}</span>-of-<span class="hljs-subst">{num_commits:05d}</span>)"</span> <span class="hljs-keyword">if</span> num_commits > <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>) | |
| filesystem._api.create_commit(repo_id=resolved_path.repo_id, repo_type=resolved_path.repo_type, revision=resolved_path.revision, operations=operations, commit_message=commit_message) | |
| <span class="hljs-keyword">yield</span> pa.record_batch({<span class="hljs-string">"path"</span>: [addition.path_in_repo <span class="hljs-keyword">for</span> addition <span class="hljs-keyword">in</span> operations]}, schema=pa.schema({<span class="hljs-string">"path"</span>: pa.string()})) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">write_parquet</span>(<span class="hljs-params">df: DataFrame, path: <span class="hljs-built_in">str</span>, **kwargs</span>) -> <span class="hljs-literal">None</span>: | |
| <span class="hljs-string">""" | |
| Write Parquet files to Hugging Face using PyArrow. | |
| It uploads Parquet files in a distributed manner in two steps: | |
| 1. Preupload the Parquet files in parallel in a distributed banner | |
| 2. Commit the preuploaded files | |
| Authenticate using `huggingface-cli login` or passing a token | |
| using the `storage_options` argument: `storage_options={"token": "hf_xxx"}` | |
| Parameters | |
| ---------- | |
| path : str | |
| Path of the file or directory. Prefix with a protocol like `hf://` to read from Hugging Face. | |
| It writes Parquet files in the form "part-xxxxx.parquet", or to a single file if `path ends with ".parquet". | |
| **kwargs | |
| Any additional kwargs are passed to pyarrow.parquet.ParquetWriter. | |
| Returns | |
| ------- | |
| DataFrame | |
| DataFrame based on parquet file. | |
| Examples | |
| -------- | |
| >>> spark.createDataFrame(pd.DataFrame({"foo": range(5), "bar": range(5, 10)})) | |
| >>> # Save to one file | |
| >>> write_parquet(df, "hf://datasets/username/dataset/data.parquet") | |
| >>> # OR save to a directory (possibly in many files) | |
| >>> write_parquet(df, "hf://datasets/username/dataset") | |
| """</span> | |
| filesystem: HfFileSystem = kwargs.pop(<span class="hljs-string">"filesystem"</span>, HfFileSystem(**kwargs.pop(<span class="hljs-string">"storage_options"</span>, {}))) | |
| <span class="hljs-keyword">if</span> path.endswith(<span class="hljs-string">".parquet"</span>) <span class="hljs-keyword">or</span> path.endswith(<span class="hljs-string">".pq"</span>): | |
| df = df.coalesce(<span class="hljs-number">1</span>) | |
| <span class="hljs-keyword">else</span>: | |
| path += <span class="hljs-string">"/part-{shard_idx:05d}.parquet"</span> | |
| df.mapInArrow( | |
| partial(_preupload, path=path, schema=to_arrow_schema(df.schema), filesystem=filesystem, **kwargs), | |
| from_arrow_schema(pa.schema({<span class="hljs-string">"addition"</span>: pa.binary()})), | |
| ).repartition(<span class="hljs-number">1</span>).mapInArrow( | |
| partial(_commit, path=path, filesystem=filesystem), | |
| from_arrow_schema(pa.schema({<span class="hljs-string">"path"</span>: pa.string()})), | |
| ).collect()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1d35p3o">Here is how we can use this function to write the filtered version of the <a href="https://huggingface.co/datasets/BAAI/Infinity-Instruct" rel="nofollow">BAAI/Infinity-Instruct</a> dataset back to Hugging Face.</p> <p data-svelte-h="svelte-wkzo7">First you need to <a href="https://huggingface.co/new-dataset" rel="nofollow">create a dataset repository</a>, e.g. <code>username/Infinity-Instruct-Chinese-Only</code> (you can set it to private if you want). | |
| Then, make sure you are authenticated and you can run:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">>>> </span>write_parquet(df_chinese_only, <span class="hljs-string">"hf://datasets/username/Infinity-Instruct-Chinese-Only"</span>) | |
| tmph9jwu9py.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:03<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">14.6</span>MB/s] | |
| tmp0oqt99nc.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.8</span>M/<span class="hljs-number">50.8</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">17.9</span>MB/s] | |
| tmpgnizkwqp.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">19.6</span>MB/s] | |
| tmpanm04k4n.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">51.4</span>M/<span class="hljs-number">51.4</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">22.9</span>MB/s] | |
| tmp14uy9oqb.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.4</span>M/<span class="hljs-number">50.4</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.0</span>MB/s] | |
| tmpcp8t_qdl.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.4</span>M/<span class="hljs-number">50.4</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.5</span>MB/s] | |
| tmpjui5mns8.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.3</span>M/<span class="hljs-number">50.3</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">24.1</span>MB/s] | |
| tmpydqh6od1.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.9</span>M/<span class="hljs-number">50.9</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.8</span>MB/s] | |
| tmp52f2t8tu.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.7</span>MB/s] | |
| tmpg7egv3ye.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.1</span>M/<span class="hljs-number">50.1</span>M [<span class="hljs-number">00</span>:06<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">7.68</span>MB/s] | |
| tmp2s0fq2hm.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.8</span>M/<span class="hljs-number">50.8</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">18.1</span>MB/s] | |
| tmpmj97ab30.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">71.3</span>M/<span class="hljs-number">71.3</span>M [<span class="hljs-number">00</span>:02<<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.9</span>MB/s]<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-tmflad"><img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-min.png"> <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-dark-min.png"></div> <h2 class="relative group"><a id="run-in-jupyterlab-on-hugging-face-spaces" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#run-in-jupyterlab-on-hugging-face-spaces"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Run in JupyterLab on Hugging Face Spaces</span></h2> <p data-svelte-h="svelte-172t2p0">You can duplicate the <a href="https://huggingface.co/spaces/lhoestq/Spark-on-HF-JupyterLab" rel="nofollow">Spark on HF JupyterLab</a> Space to get a Notebook with PySpark and those helper functions pre-installed.</p> <p data-svelte-h="svelte-1rabvow">Click on “Duplicate Space”, choose a name for your Space, select your hardware and you are ready:</p> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spark-on-hf-jupyterlab-screenshot-min.png"> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-spark.md" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1vatp3t = { | |
| assets: "/docs/hub/main/en", | |
| base: "/docs/hub/main/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"), | |
| import("/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 40], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 52.8 kB
- Xet hash:
- f206e99ed43ad1d744745434cf309f8ee7aa989d0ad3ba95052058d54f62261c
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.