Buckets:

hf-doc-build/doc-dev / hub /main /en /datasets-spark.html
rtrm's picture
download
raw
52.8 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Spark&quot;,&quot;local&quot;:&quot;spark&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Installation&quot;,&quot;local&quot;:&quot;installation&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Authentication&quot;,&quot;local&quot;:&quot;authentication&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Read&quot;,&quot;local&quot;:&quot;read&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Write&quot;,&quot;local&quot;:&quot;write&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Run in JupyterLab on Hugging Face Spaces&quot;,&quot;local&quot;:&quot;run-in-jupyterlab-on-hugging-face-spaces&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/hub/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/scheduler.d6170356.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/singletons.d032f1eb.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/paths.752f1c6b.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/index.fcd4cc08.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/0.f045427f.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/40.7c2a8617.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/CodeBlock.7b16bdef.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/EditOnGithub.da2b595c.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Spark&quot;,&quot;local&quot;:&quot;spark&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Installation&quot;,&quot;local&quot;:&quot;installation&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Authentication&quot;,&quot;local&quot;:&quot;authentication&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Read&quot;,&quot;local&quot;:&quot;read&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Write&quot;,&quot;local&quot;:&quot;write&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Run in JupyterLab on Hugging Face Spaces&quot;,&quot;local&quot;:&quot;run-in-jupyterlab-on-hugging-face-spaces&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="spark" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#spark"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Spark</span></h1> <p data-svelte-h="svelte-s0r53o">Spark enables real-time, large-scale data processing in a distributed environment.</p> <p data-svelte-h="svelte-12z8vci">In particular you can use <code>huggingface_hub</code> to access Hugging Face datasets repositories in PySpark</p> <h2 class="relative group"><a id="installation" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#installation"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Installation</span></h2> <p data-svelte-h="svelte-13h0lcm">To be able to read and write to Hugging Face URLs (e.g. <code>hf://datasets/username/dataset/data.parquet</code>), you need to install the <code>huggingface_hub</code> library:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip <span class="hljs-keyword">install</span> huggingface_hub<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ml7hwl">You also need to install <code>pyarrow</code> to read/write Parquet / JSON / CSV / etc. files using the filesystem API provided by <code>huggingFace_hub</code>:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pip <span class="hljs-keyword">install</span> pyarrow<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="authentication" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#authentication"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Authentication</span></h2> <p data-svelte-h="svelte-aqw7kt">You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories.</p> <p data-svelte-h="svelte-pdivv8">You can use the CLI for example:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->huggingface-<span class="hljs-keyword">cli</span> login<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1pzllua">It’s also possible to provide your Hugging Face token with the <code>HF_TOKEN</code> environment variable or passing the <code>storage_options</code> parameter to helper functions below:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->storage_options = {<span class="hljs-string">&quot;token&quot;</span>: <span class="hljs-string">&quot;hf_xxx&quot;</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ygg3x4">For more details about authentication, check out <a href="https://huggingface.co/docs/huggingface_hub/quick-start#authentication" rel="nofollow">this guide</a>.</p> <h2 class="relative group"><a id="read" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#read"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Read</span></h2> <p data-svelte-h="svelte-7gpw7a">PySpark doesn’t have an official support for Hugging Face paths, so we provide a helper function to read datasets in a distributed manner.</p> <p data-svelte-h="svelte-4fzrqm">For example you can read Parquet files from Hugging Face in an optimized way using PyArrow by defining this <code>read_parquet</code> helper function:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Iterator, <span class="hljs-type">Optional</span>, <span class="hljs-type">Union</span>
<span class="hljs-keyword">import</span> pyarrow <span class="hljs-keyword">as</span> pa
<span class="hljs-keyword">import</span> pyarrow.parquet <span class="hljs-keyword">as</span> pq
<span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> HfFileSystem
<span class="hljs-keyword">from</span> pyspark.sql.dataframe <span class="hljs-keyword">import</span> DataFrame
<span class="hljs-keyword">from</span> pyspark.sql.pandas.types <span class="hljs-keyword">import</span> from_arrow_schema
<span class="hljs-keyword">def</span> <span class="hljs-title function_">_read</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], columns: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>]], filters: <span class="hljs-type">Optional</span>[<span class="hljs-type">Union</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>], <span class="hljs-built_in">list</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>]]]], **kwargs</span>) -&gt; Iterator[pa.RecordBatch]:
<span class="hljs-keyword">for</span> batch <span class="hljs-keyword">in</span> iterator:
paths = batch[<span class="hljs-number">0</span>].to_pylist()
ds = pq.ParquetDataset(paths, **kwargs)
<span class="hljs-keyword">yield</span> <span class="hljs-keyword">from</span> ds._dataset.to_batches(columns=columns, <span class="hljs-built_in">filter</span>=pq.filters_to_expression(filters) <span class="hljs-keyword">if</span> filters <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
<span class="hljs-keyword">def</span> <span class="hljs-title function_">read_parquet</span>(<span class="hljs-params">
path: <span class="hljs-built_in">str</span>,
columns: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">str</span>]] = <span class="hljs-literal">None</span>,
filters: <span class="hljs-type">Optional</span>[<span class="hljs-type">Union</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>], <span class="hljs-built_in">list</span>[<span class="hljs-built_in">list</span>[<span class="hljs-built_in">tuple</span>]]]] = <span class="hljs-literal">None</span>,
**kwargs,
</span>) -&gt; DataFrame:
<span class="hljs-string">&quot;&quot;&quot;
Loads Parquet files from Hugging Face using PyArrow, returning a PySPark `DataFrame`.
It reads Parquet files in a distributed manner.
Access private or gated repositories using `huggingface-cli login` or passing a token
using the `storage_options` argument: `storage_options={&quot;token&quot;: &quot;hf_xxx&quot;}`
Parameters
----------
path : str
Path to the file. Prefix with a protocol like `hf://` to read from Hugging Face.
You can read from multiple files if you pass a globstring.
columns : list, default None
If not None, only these columns will be read from the file.
filters : List[Tuple] or List[List[Tuple]], default None
To filter out data.
Filter syntax: [[(column, op, val), ...],...]
where op is [==, =, &gt;, &gt;=, &lt;, &lt;=, !=, in, not in]
The innermost tuples are transposed into a set of filters applied
through an `AND` operation.
The outer list combines these sets of filters through an `OR`
operation.
A single list of tuples can also be used, meaning that no `OR`
operation between set of filters is to be conducted.
**kwargs
Any additional kwargs are passed to pyarrow.parquet.ParquetDataset.
Returns
-------
DataFrame
DataFrame based on parquet file.
Examples
--------
&gt;&gt;&gt; path = &quot;hf://datasets/username/dataset/data.parquet&quot;
&gt;&gt;&gt; pd.DataFrame({&quot;foo&quot;: range(5), &quot;bar&quot;: range(5, 10)}).to_parquet(path)
&gt;&gt;&gt; read_parquet(path).show()
+---+---+
|foo|bar|
+---+---+
| 0| 5|
| 1| 6|
| 2| 7|
| 3| 8|
| 4| 9|
+---+---+
&gt;&gt;&gt; read_parquet(path, columns=[&quot;bar&quot;]).show()
+---+
|bar|
+---+
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
&gt;&gt;&gt; sel = [(&quot;foo&quot;, &quot;&gt;&quot;, 2)]
&gt;&gt;&gt; read_parquet(path, filters=sel).show()
+---+---+
|foo|bar|
+---+---+
| 3| 8|
| 4| 9|
+---+---+
&quot;&quot;&quot;</span>
filesystem: HfFileSystem = kwargs.pop(<span class="hljs-string">&quot;filesystem&quot;</span>) <span class="hljs-keyword">if</span> <span class="hljs-string">&quot;filesystem&quot;</span> <span class="hljs-keyword">in</span> kwargs <span class="hljs-keyword">else</span> HfFileSystem(**kwargs.pop(<span class="hljs-string">&quot;storage_options&quot;</span>, {}))
paths = filesystem.glob(path)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> paths:
<span class="hljs-keyword">raise</span> FileNotFoundError(<span class="hljs-string">f&quot;Counldn&#x27;t find any file at <span class="hljs-subst">{path}</span>&quot;</span>)
rdd = spark.sparkContext.parallelize([{<span class="hljs-string">&quot;path&quot;</span>: path} <span class="hljs-keyword">for</span> path <span class="hljs-keyword">in</span> paths], <span class="hljs-built_in">len</span>(paths))
df = spark.createDataFrame(rdd)
arrow_schema = pq.read_schema(filesystem.<span class="hljs-built_in">open</span>(paths[<span class="hljs-number">0</span>]))
schema = pa.schema([field <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> arrow_schema <span class="hljs-keyword">if</span> (columns <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> field.name <span class="hljs-keyword">in</span> columns)], metadata=arrow_schema.metadata)
<span class="hljs-keyword">return</span> df.mapInArrow(
partial(_read, columns=columns, filters=filters, filesystem=filesystem, schema=arrow_schema, **kwargs),
from_arrow_schema(schema),
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1vx9tic">Here is how we can use this on the <a href="https://huggingface.co/datasets/BAAI/Infinity-Instruct" rel="nofollow">BAAI/Infinity-Instruct</a> dataset.
It is a gated repository, users have to accept the terms of use before accessing it.</p> <div class="flex justify-center" data-svelte-h="svelte-1ct11n9"><img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-7M-min.png"> <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-7M-dark-min.png"></div> <p data-svelte-h="svelte-1p4gwvr">We use the <code>read_parquet</code> function to read data from the dataset, compute the number of dialogue per language and filter the dataset.</p> <p data-svelte-h="svelte-19d7o0j">After logging-in to access the gated repository, we can run:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-meta">&gt;&gt;&gt; </span>spark = SparkSession.builder.appName(<span class="hljs-string">&quot;demo&quot;</span>).getOrCreate()
<span class="hljs-meta">&gt;&gt;&gt; </span>df = read_parquet(<span class="hljs-string">&quot;hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet&quot;</span>)
<span class="hljs-meta">&gt;&gt;&gt; </span>df.show()
+---+----------------------------+-----+----------+--------------------+
| <span class="hljs-built_in">id</span>| conversations|label|langdetect| source|
+---+----------------------------+-----+----------+--------------------+
| <span class="hljs-number">0</span>| [{human, <span class="hljs-keyword">def</span> <span class="hljs-title function_">exti</span>...| | en| code_exercises|
| <span class="hljs-number">1</span>| [{human, See the ...| | en| flan|
| <span class="hljs-number">2</span>| [{human, This <span class="hljs-keyword">is</span> ...| | en| flan|
| <span class="hljs-number">3</span>| [{human, If you d...| | en| flan|
| <span class="hljs-number">4</span>| [{human, In a Uni...| | en| flan|
| <span class="hljs-number">5</span>| [{human, Read the...| | en| flan|
| <span class="hljs-number">6</span>| [{human, You are ...| | en| code_bagel|
| <span class="hljs-number">7</span>| [{human, I want y...| | en| Subjective|
| <span class="hljs-number">8</span>| [{human, Given th...| | en| flan|
| <span class="hljs-number">9</span>|[{human, 因果联系原则是法...| | zh-cn| Subjective|
| <span class="hljs-number">10</span>| [{human, Provide ...| | en|self-oss-instruct...|
| <span class="hljs-number">11</span>| [{human, The univ...| | en| flan|
| <span class="hljs-number">12</span>| [{human, Q: I am ...| | en| flan|
| <span class="hljs-number">13</span>| [{human, What <span class="hljs-keyword">is</span> ...| | en| OpenHermes-<span class="hljs-number">2.5</span>|
| <span class="hljs-number">14</span>| [{human, In react...| | en| flan|
| <span class="hljs-number">15</span>| [{human, Write Py...| | en| code_exercises|
| <span class="hljs-number">16</span>| [{human, Find the...| | en| MetaMath|
| <span class="hljs-number">17</span>| [{human, Three of...| | en| MetaMath|
| <span class="hljs-number">18</span>| [{human, Chandra ...| | en| MetaMath|
| <span class="hljs-number">19</span>|[{human, 用经济学知识分析...| | zh-cn| Subjective|
+---+----------------------------+-----+----------+--------------------+<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-q1u30o">To compute the number of dialogues per language we run this code.
The <code>columns</code> argument is useful to only load the data we need, since PySpark doesn’t enable predicate push-down in this case.
There is also a <code>filters</code> argument to only load data with values within a certain range.</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span>df_langdetect_only = read_parquet(<span class="hljs-string">&quot;hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet&quot;</span>, columns=[<span class="hljs-string">&quot;langdetect&quot;</span>])
<span class="hljs-meta">&gt;&gt;&gt; </span>df_langdetect_only.groupBy(<span class="hljs-string">&quot;langdetect&quot;</span>).count().show()
+----------+-------+
|langdetect| count|
+----------+-------+
| en|<span class="hljs-number">6697793</span>|
| zh-cn| <span class="hljs-number">751313</span>|
+----------+-------+<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-e2gcj5">To filter the dataset and only keep dialogues in Chinese:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span>criteria = [(<span class="hljs-string">&quot;langdetect&quot;</span>, <span class="hljs-string">&quot;=&quot;</span>, <span class="hljs-string">&quot;zh-cn&quot;</span>)]
<span class="hljs-meta">&gt;&gt;&gt; </span>df_chinese_only = read_parquet(<span class="hljs-string">&quot;hf://datasets/BAAI/Infinity-Instruct/7M/*.parquet&quot;</span>, filters=criteria)
<span class="hljs-meta">&gt;&gt;&gt; </span>df_chinese_only.show()
+---+----------------------------+-----+----------+----------+
| <span class="hljs-built_in">id</span>| conversations|label|langdetect| source|
+---+----------------------------+-----+----------+----------+
| <span class="hljs-number">9</span>|[{human, 因果联系原则是法...| | zh-cn|Subjective|
| <span class="hljs-number">19</span>|[{human, 用经济学知识分析...| | zh-cn|Subjective|
| <span class="hljs-number">38</span>| [{human, 某个考试共有A、...| | zh-cn|Subjective|
| <span class="hljs-number">39</span>|[{human, 撰写一篇关于斐波...| | zh-cn|Subjective|
| <span class="hljs-number">57</span>|[{human, 总结世界历史上的...| | zh-cn|Subjective|
| <span class="hljs-number">61</span>|[{human, 生成一则广告词。...| | zh-cn|Subjective|
| <span class="hljs-number">66</span>|[{human, 描述一个有效的团...| | zh-cn|Subjective|
| <span class="hljs-number">94</span>|[{human, 如果比利和蒂芙尼...| | zh-cn|Subjective|
|<span class="hljs-number">102</span>|[{human, 生成一句英文名言...| | zh-cn|Subjective|
|<span class="hljs-number">106</span>|[{human, 写一封感谢信,感...| | zh-cn|Subjective|
|<span class="hljs-number">118</span>| [{human, 生成一个故事。}...| | zh-cn|Subjective|
|<span class="hljs-number">174</span>|[{human, 高胆固醇水平的后...| | zh-cn|Subjective|
|<span class="hljs-number">180</span>|[{human, 基于以下角色信息...| | zh-cn|Subjective|
|<span class="hljs-number">192</span>|[{human, 请写一篇文章,概...| | zh-cn|Subjective|
|<span class="hljs-number">221</span>|[{human, 以诗歌形式表达对...| | zh-cn|Subjective|
|<span class="hljs-number">228</span>|[{human, 根据给定的指令,...| | zh-cn|Subjective|
|<span class="hljs-number">236</span>|[{human, 打开一个新的生成...| | zh-cn|Subjective|
|<span class="hljs-number">260</span>|[{human, 生成一个有关未来...| | zh-cn|Subjective|
|<span class="hljs-number">268</span>|[{human, 如果有一定数量的...| | zh-cn|Subjective|
|<span class="hljs-number">273</span>| [{human, 题目:小明有<span class="hljs-number">5</span>个...| | zh-cn|Subjective|
+---+----------------------------+-----+----------+----------+<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="write" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#write"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Write</span></h2> <p data-svelte-h="svelte-nrb4bc">We also provide a helper function to write datasets in a distributed manner to a Hugging Face repository.</p> <p data-svelte-h="svelte-104rxy3">You can write a PySpark Dataframe to Hugging Face using this <code>write_parquet</code> helper function based on the <code>huggingface_hub</code> API.
In particular it uses the <code>preupload_lfs_files</code> utility to upload Parquet files in parallel in a distributed manner, and only commits the files once they’re all uploaded:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> math
<span class="hljs-keyword">import</span> pickle
<span class="hljs-keyword">import</span> tempfile
<span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Iterator, <span class="hljs-type">Optional</span>
<span class="hljs-keyword">import</span> pyarrow <span class="hljs-keyword">as</span> pa
<span class="hljs-keyword">import</span> pyarrow.parquet <span class="hljs-keyword">as</span> pq
<span class="hljs-keyword">from</span> huggingface_hub <span class="hljs-keyword">import</span> CommitOperationAdd, HfFileSystem
<span class="hljs-keyword">from</span> pyspark.sql.dataframe <span class="hljs-keyword">import</span> DataFrame
<span class="hljs-keyword">from</span> pyspark.sql.pandas.types <span class="hljs-keyword">import</span> from_arrow_schema, to_arrow_schema
<span class="hljs-keyword">def</span> <span class="hljs-title function_">_preupload</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], path: <span class="hljs-built_in">str</span>, schema: pa.Schema, filesystem: HfFileSystem, row_group_size: <span class="hljs-type">Optional</span>[<span class="hljs-built_in">int</span>] = <span class="hljs-literal">None</span>, **kwargs</span>) -&gt; Iterator[pa.RecordBatch]:
resolved_path = filesystem.resolve_path(path)
<span class="hljs-keyword">with</span> tempfile.NamedTemporaryFile(suffix=<span class="hljs-string">&quot;.parquet&quot;</span>) <span class="hljs-keyword">as</span> temp_file:
<span class="hljs-keyword">with</span> pq.ParquetWriter(temp_file.name, schema=schema, **kwargs) <span class="hljs-keyword">as</span> writer:
<span class="hljs-keyword">for</span> batch <span class="hljs-keyword">in</span> iterator:
writer.write_batch(batch, row_group_size=row_group_size)
addition = CommitOperationAdd(path_in_repo=temp_file.name, path_or_fileobj=temp_file.name)
filesystem._api.preupload_lfs_files(repo_id=resolved_path.repo_id, additions=[addition], repo_type=resolved_path.repo_type, revision=resolved_path.revision)
<span class="hljs-keyword">yield</span> pa.record_batch({<span class="hljs-string">&quot;addition&quot;</span>: [pickle.dumps(addition)]}, schema=pa.schema({<span class="hljs-string">&quot;addition&quot;</span>: pa.binary()}))
<span class="hljs-keyword">def</span> <span class="hljs-title function_">_commit</span>(<span class="hljs-params">iterator: Iterator[pa.RecordBatch], path: <span class="hljs-built_in">str</span>, filesystem: HfFileSystem, max_operations_per_commit=<span class="hljs-number">50</span></span>) -&gt; Iterator[pa.RecordBatch]:
resolved_path = filesystem.resolve_path(path)
additions: <span class="hljs-built_in">list</span>[CommitOperationAdd] = [pickle.loads(addition) <span class="hljs-keyword">for</span> addition <span class="hljs-keyword">in</span> pa.Table.from_batches(iterator, schema=pa.schema({<span class="hljs-string">&quot;addition&quot;</span>: pa.binary()}))[<span class="hljs-number">0</span>].to_pylist()]
num_commits = math.ceil(<span class="hljs-built_in">len</span>(additions) / max_operations_per_commit)
<span class="hljs-keyword">for</span> shard_idx, addition <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(additions):
addition.path_in_repo = resolved_path.path_in_repo.replace(<span class="hljs-string">&quot;{shard_idx:05d}&quot;</span>, <span class="hljs-string">f&quot;<span class="hljs-subst">{shard_idx:05d}</span>&quot;</span>)
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, num_commits):
operations = additions[i * max_operations_per_commit : (i + <span class="hljs-number">1</span>) * max_operations_per_commit]
commit_message = <span class="hljs-string">&quot;Upload using PySpark&quot;</span> + (<span class="hljs-string">f&quot; (part <span class="hljs-subst">{i:05d}</span>-of-<span class="hljs-subst">{num_commits:05d}</span>)&quot;</span> <span class="hljs-keyword">if</span> num_commits &gt; <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-string">&quot;&quot;</span>)
filesystem._api.create_commit(repo_id=resolved_path.repo_id, repo_type=resolved_path.repo_type, revision=resolved_path.revision, operations=operations, commit_message=commit_message)
<span class="hljs-keyword">yield</span> pa.record_batch({<span class="hljs-string">&quot;path&quot;</span>: [addition.path_in_repo <span class="hljs-keyword">for</span> addition <span class="hljs-keyword">in</span> operations]}, schema=pa.schema({<span class="hljs-string">&quot;path&quot;</span>: pa.string()}))
<span class="hljs-keyword">def</span> <span class="hljs-title function_">write_parquet</span>(<span class="hljs-params">df: DataFrame, path: <span class="hljs-built_in">str</span>, **kwargs</span>) -&gt; <span class="hljs-literal">None</span>:
<span class="hljs-string">&quot;&quot;&quot;
Write Parquet files to Hugging Face using PyArrow.
It uploads Parquet files in a distributed manner in two steps:
1. Preupload the Parquet files in parallel in a distributed banner
2. Commit the preuploaded files
Authenticate using `huggingface-cli login` or passing a token
using the `storage_options` argument: `storage_options={&quot;token&quot;: &quot;hf_xxx&quot;}`
Parameters
----------
path : str
Path of the file or directory. Prefix with a protocol like `hf://` to read from Hugging Face.
It writes Parquet files in the form &quot;part-xxxxx.parquet&quot;, or to a single file if `path ends with &quot;.parquet&quot;.
**kwargs
Any additional kwargs are passed to pyarrow.parquet.ParquetWriter.
Returns
-------
DataFrame
DataFrame based on parquet file.
Examples
--------
&gt;&gt;&gt; spark.createDataFrame(pd.DataFrame({&quot;foo&quot;: range(5), &quot;bar&quot;: range(5, 10)}))
&gt;&gt;&gt; # Save to one file
&gt;&gt;&gt; write_parquet(df, &quot;hf://datasets/username/dataset/data.parquet&quot;)
&gt;&gt;&gt; # OR save to a directory (possibly in many files)
&gt;&gt;&gt; write_parquet(df, &quot;hf://datasets/username/dataset&quot;)
&quot;&quot;&quot;</span>
filesystem: HfFileSystem = kwargs.pop(<span class="hljs-string">&quot;filesystem&quot;</span>, HfFileSystem(**kwargs.pop(<span class="hljs-string">&quot;storage_options&quot;</span>, {})))
<span class="hljs-keyword">if</span> path.endswith(<span class="hljs-string">&quot;.parquet&quot;</span>) <span class="hljs-keyword">or</span> path.endswith(<span class="hljs-string">&quot;.pq&quot;</span>):
df = df.coalesce(<span class="hljs-number">1</span>)
<span class="hljs-keyword">else</span>:
path += <span class="hljs-string">&quot;/part-{shard_idx:05d}.parquet&quot;</span>
df.mapInArrow(
partial(_preupload, path=path, schema=to_arrow_schema(df.schema), filesystem=filesystem, **kwargs),
from_arrow_schema(pa.schema({<span class="hljs-string">&quot;addition&quot;</span>: pa.binary()})),
).repartition(<span class="hljs-number">1</span>).mapInArrow(
partial(_commit, path=path, filesystem=filesystem),
from_arrow_schema(pa.schema({<span class="hljs-string">&quot;path&quot;</span>: pa.string()})),
).collect()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1d35p3o">Here is how we can use this function to write the filtered version of the <a href="https://huggingface.co/datasets/BAAI/Infinity-Instruct" rel="nofollow">BAAI/Infinity-Instruct</a> dataset back to Hugging Face.</p> <p data-svelte-h="svelte-wkzo7">First you need to <a href="https://huggingface.co/new-dataset" rel="nofollow">create a dataset repository</a>, e.g. <code>username/Infinity-Instruct-Chinese-Only</code> (you can set it to private if you want).
Then, make sure you are authenticated and you can run:</p> <div class="code-block relative"><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-meta">&gt;&gt;&gt; </span>write_parquet(df_chinese_only, <span class="hljs-string">&quot;hf://datasets/username/Infinity-Instruct-Chinese-Only&quot;</span>)
tmph9jwu9py.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:03&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">14.6</span>MB/s]
tmp0oqt99nc.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.8</span>M/<span class="hljs-number">50.8</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">17.9</span>MB/s]
tmpgnizkwqp.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">19.6</span>MB/s]
tmpanm04k4n.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">51.4</span>M/<span class="hljs-number">51.4</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">22.9</span>MB/s]
tmp14uy9oqb.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.4</span>M/<span class="hljs-number">50.4</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.0</span>MB/s]
tmpcp8t_qdl.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.4</span>M/<span class="hljs-number">50.4</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.5</span>MB/s]
tmpjui5mns8.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.3</span>M/<span class="hljs-number">50.3</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">24.1</span>MB/s]
tmpydqh6od1.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.9</span>M/<span class="hljs-number">50.9</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.8</span>MB/s]
tmp52f2t8tu.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.5</span>M/<span class="hljs-number">50.5</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.7</span>MB/s]
tmpg7egv3ye.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.1</span>M/<span class="hljs-number">50.1</span>M [<span class="hljs-number">00</span>:06&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">7.68</span>MB/s]
tmp2s0fq2hm.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">50.8</span>M/<span class="hljs-number">50.8</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">18.1</span>MB/s]
tmpmj97ab30.parquet: <span class="hljs-number">100</span>%|██████████| <span class="hljs-number">71.3</span>M/<span class="hljs-number">71.3</span>M [<span class="hljs-number">00</span>:02&lt;<span class="hljs-number">00</span>:<span class="hljs-number">00</span>, <span class="hljs-number">23.9</span>MB/s]<!-- HTML_TAG_END --></pre></div> <div class="flex justify-center" data-svelte-h="svelte-tmflad"><img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-min.png"> <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-dark-min.png"></div> <h2 class="relative group"><a id="run-in-jupyterlab-on-hugging-face-spaces" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#run-in-jupyterlab-on-hugging-face-spaces"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Run in JupyterLab on Hugging Face Spaces</span></h2> <p data-svelte-h="svelte-172t2p0">You can duplicate the <a href="https://huggingface.co/spaces/lhoestq/Spark-on-HF-JupyterLab" rel="nofollow">Spark on HF JupyterLab</a> Space to get a Notebook with PySpark and those helper functions pre-installed.</p> <p data-svelte-h="svelte-1rabvow">Click on “Duplicate Space”, choose a name for your Space, select your hardware and you are ready:</p> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spark-on-hf-jupyterlab-screenshot-min.png"> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-spark.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1vatp3t = {
assets: "/docs/hub/main/en",
base: "/docs/hub/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"),
import("/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 40],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
52.8 kB
·
Xet hash:
f206e99ed43ad1d744745434cf309f8ee7aa989d0ad3ba95052058d54f62261c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.