Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Données massives ? 🤗 <i> Datasets </i> à la rescousse !","local":"données-massives---i-datasets-i-à-la-rescousse-","sections":[{"title":"Qu’est-ce que <i> The Pile </i> ?","local":"quest-ce-que-i-the-pile-i-","sections":[],"depth":2},{"title":"La magie du <i> memory mapping </i>","local":"la-magie-du-i-memory-mapping-i","sections":[],"depth":2},{"title":"Jeux de données en continu","local":"jeux-de-données-en-continu","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1069/fr/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/entry/start.cea6db46.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/scheduler.37c15a92.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/singletons.2b29b91f.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/index.18351ede.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/paths.f6fdf97f.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/entry/app.3f6640b1.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/index.2bf4358c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/nodes/0.b777de11.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/nodes/37.e4720ea5.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/Tip.363c041f.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/Youtube.1e50a667.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/CodeBlock.4e987730.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/CourseFloatingBanner.6add7356.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/fr/_app/immutable/chunks/getInferenceSnippets.24b50994.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Données massives ? 🤗 <i> Datasets </i> à la rescousse !","local":"données-massives---i-datasets-i-à-la-rescousse-","sections":[{"title":"Qu’est-ce que <i> The Pile </i> ?","local":"quest-ce-que-i-the-pile-i-","sections":[],"depth":2},{"title":"La magie du <i> memory mapping </i>","local":"la-magie-du-i-memory-mapping-i","sections":[],"depth":2},{"title":"Jeux de données en continu","local":"jeux-de-données-en-continu","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="données-massives---i-datasets-i-à-la-rescousse-" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#données-massives---i-datasets-i-à-la-rescousse-"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Données massives ? 🤗 <i> Datasets </i> à la rescousse !</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"><a href="https://discuss.huggingface.co/t/chapter-5-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"> </button> </div> <div class="relative colab-dropdown "> <button class=" " type="button"> <img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"> </button> </div></div> <p data-svelte-h="svelte-1n0aq50">De nos jours, il n’est pas rare de travailler avec des jeux de données de plusieurs gigaoctets surtout si vous envisagez de pré-entraîner un <em>transformer</em> comme BERT ou GPT-2 à partir de zéro. Dans ces cas, même <em>charger</em> les données peut être un défi. Par exemple, le corpus WebText utilisé pour pré-entraîner GPT-2 se compose de plus de 8 millions de documents et de 40 Go de texte. Le charger dans la RAM de votre ordinateur portable est susceptible de lui donner une crise cardiaque !</p> <p data-svelte-h="svelte-z4dff3">Heureusement, 🤗 <em>Datasets</em> a été conçu pour surmonter ces limitations. Il vous libère des problèmes de gestion de la mémoire en traitant les jeux de données comme des fichiers <em>mappés en mémoire</em>, ainsi que des limites du disque dur en faisant du <em>streaming</em> sur les entrées dans un corpus.</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/JwISwTCPPWo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-1ey6jlq">Dans cette section, nous allons explorer ces fonctionnalités de 🤗 <em>Datasets</em> avec un énorme corpus de 825 Go connu sous le nom de <a href="https://pile.eleuther.ai" rel="nofollow"><em>The Pile</em></a>. Commençons !</p> <h2 class="relative group"><a id="quest-ce-que-i-the-pile-i-" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#quest-ce-que-i-the-pile-i-"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Qu’est-ce que <i> The Pile </i> ?</span></h2> <p data-svelte-h="svelte-xwkwxb"><em>The Pile</em> est un corpus de texte en anglais créé par <a href="https://www.eleuther.ai" rel="nofollow">EleutherAI</a> pour entraîner des modèles de langage à grande échelle. Il comprend une gamme variée de jeux de données, couvrant des articles scientifiques, des référentiels de code GitHub et du texte Web filtré. Le corpus d’entraînement est disponible en <a href="https://the-eye.eu/public/AI/pile/" rel="nofollow">morceaux de 14 Go</a> et vous pouvez aussi télécharger plusieurs des <a href="https://the-eye.eu/public/AI/pile_preliminary_components/" rel="nofollow">composants individuels</a>. Commençons par jeter un coup d’œil au jeu de données <em>PubMed Abstracts</em>, qui est un corpus de résumés de 15 millions de publications biomédicales sur <a href="https://pubmed.ncbi.nlm.nih.gov/" rel="nofollow">PubMed</a>. Le jeu de données est au <a href="https://jsonlines.org" rel="nofollow">format JSON Lines</a> et est compressé à l’aide de la bibliothèque <code>zstandard</code>. Nous devons donc d’abord installer cette bibliothèque :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!pip install zstandard<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1djvo8q">Ensuite, nous pouvons charger le jeu de données en utilisant la méthode pour les fichiers distants que nous avons apprise dans <a href="/course/fr/chapter5/2">section 2</a> :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| <span class="hljs-comment"># Cela prend quelques minutes à exécuter, alors allez prendre un thé ou un café en attendant :)</span> | |
| data_files = <span class="hljs-string">"https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"</span> | |
| pubmed_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files, split=<span class="hljs-string">"train"</span>) | |
| pubmed_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Dataset({ | |
| features: [<span class="hljs-string">'meta'</span>, <span class="hljs-string">'text'</span>], | |
| num_rows: <span class="hljs-number">15518009</span> | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-jgiig2">Nous pouvons voir qu’il y a 15 518 009 lignes et 2 colonnes dans notre jeu de données. C’est beaucoup !</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1v02m">✎ Par défaut, 🤗 <em>Datasets</em> décompresse les fichiers nécessaires pour charger un jeu de données. Si vous souhaitez conserver de l’espace sur le disque dur, vous pouvez passer <code>DownloadConfig(delete_extracted=True)</code> à l’argument <code>download_config</code> de <code>load_dataset()</code>. Voir la <a href="https://huggingface.co/docs/datasets/package_reference/builder_classes#datasets.DownloadConfig" rel="nofollow">documentation</a> pour plus de détails.</p></div> <p data-svelte-h="svelte-1jodsbq">Inspectons le contenu du premier exemple :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pubmed_dataset[<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409574</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'</span> | |
| <span class="hljs-comment"># Épidémiologie de l'hypoxémie chez les enfants souffrant d'une infection aiguë des voies respiratoires inférieures. Déterminer la prévalence de l'hypoxémie chez les enfants de moins de 5 ans souffrant d'une infection aiguë des voies respiratoires inférieures (IAVI), les facteurs de risque de l'hypoxémie chez les enfants de moins de 5 ans souffrant d'une IAVI, et l'association de l'hypoxémie à un risque accru de décès chez les enfants du même âge ...</span> | |
| }<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-qvjcta">Cela ressemble au résumé d’un article médical. Voyons maintenant combien de RAM nous avons utilisé pour charger le jeu de données !</p> <h2 class="relative group"><a id="la-magie-du-i-memory-mapping-i" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#la-magie-du-i-memory-mapping-i"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>La magie du <i> memory mapping </i></span></h2> <p data-svelte-h="svelte-1e4dhll">Un moyen simple de mesurer l’utilisation de la mémoire dans Python consiste à utiliser la bibliothèque <a href="https://psutil.readthedocs.io/en/latest/" rel="nofollow"><code>psutil</code></a> qui peut être installée avec <code>pip</code> comme suit :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!pip install psutil<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-e6epbe">Elle fournit une classe <code>Process</code> qui permet de vérifier l’utilisation de la mémoire du processus en cours :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> psutil | |
| <span class="hljs-comment"># Process.memory_info est exprimé en octets, donc convertir en mégaoctets</span> | |
| <span class="hljs-built_in">print</span>(<span class="hljs-string">f"RAM used: <span class="hljs-subst">{psutil.Process().memory_info().rss / (<span class="hljs-number">1024</span> * <span class="hljs-number">1024</span>):<span class="hljs-number">.2</span>f}</span> MB"</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->RAM used: <span class="hljs-number">5678.33</span> MB<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13cvybs">Ici, l’attribut <code>rss</code> fait référence à la <em>taille de l’ensemble résident</em>, qui est la fraction de mémoire qu’un processus occupe dans la RAM. Cette mesure inclut également la mémoire utilisée par l’interpréteur Python et les bibliothèques que nous avons chargées, de sorte que la quantité réelle de mémoire utilisée pour charger le jeu de données est un peu plus petite. À titre de comparaison, voyons la taille du jeu de données sur le disque en utilisant l’attribut <code>dataset_size</code>. Comme le résultat est exprimé en octets comme précédemment, nous devons le convertir manuellement en gigaoctets :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(<span class="hljs-string">f"Number of files in dataset : <span class="hljs-subst">{pubmed_dataset.dataset_size}</span>"</span>) | |
| size_gb = pubmed_dataset.dataset_size / (<span class="hljs-number">1024</span>**<span class="hljs-number">3</span>) | |
| <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Dataset size (cache file) : <span class="hljs-subst">{size_gb:<span class="hljs-number">.2</span>f}</span> GB"</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Number of files <span class="hljs-keyword">in</span> dataset : <span class="hljs-number">20979437051</span> | |
| Dataset size (cache file) : <span class="hljs-number">19.54</span> GB<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1807tcn">Malgré sa taille de près de 20 Go, nous pouvons charger et accéder au jeu de données avec beaucoup moins de RAM !</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1osdilj">✏️ <strong>Essayez !</strong> Choisissez l’un des <a href="https://the-eye.eu/public/AI/pile_preliminary_components/" rel="nofollow">sous-ensembles</a> de <em>The Pile</em> qui est plus grand que la RAM de votre ordinateur portable ou de bureau. Chargez-le avec 🤗 <em>Datasets</em> et mesurez la quantité de RAM utilisée. Notez que pour obtenir une mesure précise, vous devrez le faire dans un nouveau processus. Vous pouvez trouver les tailles décompressées de chaque sous-ensemble dans le tableau 1 du papier de <a href="https://arxiv.org/abs/2101.00027" rel="nofollow"><em>The Pile</em></a>.</p></div> <p data-svelte-h="svelte-17s56pe">Si vous êtes familier avec Pandas, ce résultat pourrait surprendre en raison de la célèbre <a href="https://wesmckinney.com/blog/apache-arrow-pandas-internals/" rel="nofollow">règle d’or</a> de Wes Kinney selon laquelle vous avez généralement besoin de 5 à 10 fois plus de RAM que la taille de votre jeu de données. Alors, comment 🤗 <em>Datasets</em> résout-il ce problème de gestion de la mémoire ? 🤗 <em>Datasets</em> traite chaque jeu de données comme un <a href="https://en.wikipedia.org/wiki/Memory-mapped_file" rel="nofollow">fichier mappé en mémoire</a>. Cela fournit un mappage entre la RAM et le stockage du système de fichiers permettant à la bibliothèque d’accéder et d’opérer sur des éléments du jeu de données sans avoir besoin de le charger entièrement en mémoire.</p> <p data-svelte-h="svelte-q5ckcr">Les fichiers mappés en mémoire peuvent également être partagés entre plusieurs processus ce qui permet de paralléliser des méthodes telles que <code>Dataset.map()</code> sans avoir à déplacer ou copier le jeu de données. Sous le capot, ces capacités sont toutes réalisées par le format de mémoire <a href="https://arrow.apache.org" rel="nofollow">Apache Arrow</a> et <a href="https://arrow.apache.org/docs/python/index.html" rel="nofollow"><code>pyarrow</code></a>, qui accélèrent le chargement et le traitement des données. (Pour plus de détails sur Apache Arrow et les comparaisons avec Pandas, consultez <a href="https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a" rel="nofollow">l’article de blog de Dejan Simic</a>). Pour voir ceci en action, effectuons un petit test de vitesse en itérant sur tous les éléments du jeu de données <em>PubMed Abstracts</em> :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> timeit | |
| code_snippet = <span class="hljs-string">"""batch_size = 1000 | |
| for idx in range(0, len(pubmed_dataset), batch_size): | |
| _ = pubmed_dataset[idx:idx + batch_size] | |
| """</span> | |
| time = timeit.timeit(stmt=code_snippet, number=<span class="hljs-number">1</span>, <span class="hljs-built_in">globals</span>=<span class="hljs-built_in">globals</span>()) | |
| <span class="hljs-built_in">print</span>( | |
| <span class="hljs-string">f"Iterated over <span class="hljs-subst">{<span class="hljs-built_in">len</span>(pubmed_dataset)}</span> examples (about <span class="hljs-subst">{size_gb:<span class="hljs-number">.1</span>f}</span> GB) in "</span> | |
| <span class="hljs-string">f"<span class="hljs-subst">{time:<span class="hljs-number">.1</span>f}</span>s, i.e. <span class="hljs-subst">{size_gb/time:<span class="hljs-number">.3</span>f}</span> GB/s"</span> | |
| )<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-string">'Iterated over 15518009 examples (about 19.5 GB) in 64.2s, i.e. 0.304 GB/s'</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qf5v39">Ici, nous avons utilisé le module <code>timeit</code> de Python pour mesurer le temps d’exécution pris par <code>code_snippet</code>. Vous pourrez généralement itérer sur un jeu de données à une vitesse de quelques dixièmes de Go/s à plusieurs Go/s. Cela fonctionne très bien pour la grande majorité des applications, mais vous devrez parfois travailler avec un jeu de données trop volumineux pour être même stocké sur le disque dur de votre ordinateur portable. Par exemple, si nous essayions de télécharger <em>The Pile</em> dans son intégralité, nous aurions besoin de 825 Go d’espace disque libre ! Pour gérer ces cas, 🤗 <em>Datasets</em> fournit une fonctionnalité de streaming qui nous permet de télécharger et d’accéder aux éléments à la volée, sans avoir besoin de télécharger l’intégralité du jeu de données. Voyons comment cela fonctionne.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-38kpfm">💡 Dans les <em>notebooks</em> Jupyter, vous pouvez également chronométrer les cellules à l’aide de la fonction magique <a href="https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit" rel="nofollow"><code>%%timeit</code></a>.</p></div> <h2 class="relative group"><a id="jeux-de-données-en-continu" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#jeux-de-données-en-continu"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Jeux de données en continu</span></h2> <p data-svelte-h="svelte-qgureo">Pour activer le streaming du jeu de données, il vous suffit de passer l’argument <code>streaming=True</code> à la fonction <code>load_dataset()</code>. Par exemple, chargeons à nouveau le jeu de données <em>PubMed Abstracts</em> mais en mode streaming :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pubmed_dataset_streamed = load_dataset( | |
| <span class="hljs-string">"json"</span>, data_files=data_files, split=<span class="hljs-string">"train"</span>, streaming=<span class="hljs-literal">True</span> | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ooenp6">Au lieu du familier <code>Dataset</code> que nous avons rencontré ailleurs dans ce chapitre, l’objet retourné avec <code>streaming=True</code> est un <code>IterableDataset</code>. Comme son nom l’indique, pour accéder aux éléments d’un <code>IterableDataset</code>, nous devons parcourir celui-ci. Nous pouvons accéder au premier élément de notre jeu de données diffusé comme suit :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(pubmed_dataset_streamed))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409574</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age ...'</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1aulibg">Les éléments d’un jeu de données diffusé en continu peuvent être traités à la volée à l’aide de <code>IterableDataset.map()</code>, ce qui est utile pendant l’entraînement si vous avez besoin de tokeniser les entrées. Le processus est exactement le même que celui que nous avons utilisé pour tokeniser notre jeu de données dans le <a href="/course/fr/chapter3">chapitre 3</a>, à la seule différence que les sorties sont renvoyées une par une :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"distilbert-base-uncased"</span>) | |
| tokenized_dataset = pubmed_dataset_streamed.<span class="hljs-built_in">map</span>(<span class="hljs-keyword">lambda</span> x: tokenizer(x[<span class="hljs-string">"text"</span>])) | |
| <span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(tokenized_dataset))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'input_ids'</span>: [<span class="hljs-number">101</span>, <span class="hljs-number">4958</span>, <span class="hljs-number">5178</span>, <span class="hljs-number">4328</span>, <span class="hljs-number">6779</span>, ...], <span class="hljs-string">'attention_mask'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, ...]}<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1ej2rqj">💡 Pour accélérer la tokenisation avec le streaming, vous pouvez passer <code>batched=True</code>, comme nous l’avons vu dans la dernière section. Il traitera les exemples batch par batch. La taille de batch par défaut est de 1 000 et peut être spécifiée avec l’argument <code>batch_size</code>.</p></div> <p data-svelte-h="svelte-1936jdb">Vous pouvez également mélanger un jeu de données diffusé en continu à l’aide de <code>IterableDataset.shuffle()</code>, mais contrairement à <code>Dataset.shuffle()</code>, cela ne mélange que les éléments dans un <code>buffer_size</code> prédéfini :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=<span class="hljs-number">10_000</span>, seed=<span class="hljs-number">42</span>) | |
| <span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(shuffled_dataset))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11410799</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Randomized study of dose or schedule modification of granulocyte colony-stimulating factor in platinum-based chemotherapy for elderly patients with lung cancer ...'</span> | |
| <span class="hljs-comment"># Étude randomisée sur la modification de la dose ou du calendrier d'administration du facteur de stimulation des colonies de granulocytes dans le cadre d'une chimiothérapie à base de platine chez les patients âgés atteints de cancer du poumon ...</span> | |
| }<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13iarxl">Dans cet exemple, nous avons sélectionné un exemple aléatoire parmi les 10 000 premiers exemples du tampon. Une fois qu’un exemple est accédé, sa place dans le tampon est remplie avec l’exemple suivant dans le corpus (c’est-à-dire le 10 001e exemple dans le cas ci-dessus). Vous pouvez également sélectionner des éléments d’un jeu de données diffusé en continu à l’aide des fonctions <code>IterableDataset.take()</code> et <code>IterableDataset.skip()</code>, qui agissent de la même manière que <code>Dataset.select()</code>. Par exemple, pour sélectionner les 5 premiers exemples dans le jeu de données <em>PubMed Abstracts</em>, nous pouvons procéder comme suit :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->dataset_head = pubmed_dataset_streamed.take(<span class="hljs-number">5</span>) | |
| <span class="hljs-built_in">list</span>(dataset_head)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409574</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'</span> | |
| <span class="hljs-comment"># Épidémiologie de l'hypoxémie chez les enfants atteints d'une infection aiguë des voies respiratoires inférieures ...},</span> | |
| {<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409575</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Clinical signs of hypoxaemia in children with acute lower respiratory infection: indicators of oxygen therapy ...'</span> | |
| <span class="hljs-comment"># Signes cliniques d'hypoxémie chez les enfants atteints d'une infection aiguë des voies respiratoires inférieures : indicateurs de l'oxygénothérapie ...},</span> | |
| {<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409576</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">"Hypoxaemia in children with severe pneumonia in Papua New Guinea ..."</span> | |
| <span class="hljs-comment"># Hypoxémie chez les enfants atteints de pneumonie grave en Papouasie-Nouvelle-Guinée ...},</span> | |
| {<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409577</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Oxygen concentrators and cylinders ...'</span> | |
| <span class="hljs-comment"># Concentrateurs et bouteilles d'oxygène...},</span> | |
| {<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409578</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Oxygen supply in rural africa: a personal experience ...'</span> | |
| <span class="hljs-comment"># L'approvisionnement en oxygène dans les zones rurales africaines : une expérience personnelle ...}]</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1axizdx">De même, vous pouvez utiliser la fonction <code>IterableDataset.skip()</code> pour créer des fractionnements d’entraînement et de validation à partir d’un jeu de données mélangé comme suit :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># Ignorer les 1 000 premiers exemples et inclure le reste dans l'ensemble d'apprentissage.</span> | |
| train_dataset = shuffled_dataset.skip(<span class="hljs-number">1000</span>) | |
| <span class="hljs-comment"># Prendre les 1 000 premiers exemples pour l'ensemble de validation.</span> | |
| validation_dataset = shuffled_dataset.take(<span class="hljs-number">1000</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1303a2x">Terminons notre exploration du streaming des jeux de données avec une application commune : combiner plusieurs jeux de données pour créer un seul corpus. 🤗 <em>Datasets</em> fournit une fonction <code>interleave_datasets()</code> qui convertit une liste d’objets <code>IterableDataset</code> en un seul <code>IterableDataset</code>, où les éléments du nouveau jeu de données sont obtenus en alternant entre les exemples source. Cette fonction est particulièrement utile lorsque vous essayez de combiner de grands jeux de données. Par exemple, streamons FreeLaw, un sous-ensemble de <em>The Pile</em> et qui est un jeu de données de 51 Go d’avis juridiques de tribunaux américains :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->law_dataset_streamed = load_dataset( | |
| <span class="hljs-string">"json"</span>, | |
| data_files=<span class="hljs-string">"https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst"</span>, | |
| split=<span class="hljs-string">"train"</span>, | |
| streaming=<span class="hljs-literal">True</span>, | |
| ) | |
| <span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(law_dataset_streamed))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'case_ID'</span>: <span class="hljs-string">'110921.json'</span>, | |
| <span class="hljs-string">'case_jurisdiction'</span>: <span class="hljs-string">'scotus.tar.gz'</span>, | |
| <span class="hljs-string">'date_created'</span>: <span class="hljs-string">'2010-04-28T17:12:49Z'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-o783i4">Ce jeu de données est suffisamment volumineux pour solliciter la RAM de la plupart des ordinateurs portables, mais nous avons pu le charger et y accéder sans transpirer ! Combinons maintenant les jeux de données FreeLaw et <em>PubMed Abstracts</em> avec la fonction <code>interleave_datasets()</code> :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> itertools <span class="hljs-keyword">import</span> islice | |
| <span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> interleave_datasets | |
| combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed]) | |
| <span class="hljs-built_in">list</span>(islice(combined_dataset, <span class="hljs-number">2</span>))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pmid'</span>: <span class="hljs-number">11409574</span>, <span class="hljs-string">'language'</span>: <span class="hljs-string">'eng'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'Epidemiology of hypoxaemia in children with acute lower respiratory infection ...'</span>}, | |
| {<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'case_ID'</span>: <span class="hljs-string">'110921.json'</span>, | |
| <span class="hljs-string">'case_jurisdiction'</span>: <span class="hljs-string">'scotus.tar.gz'</span>, | |
| <span class="hljs-string">'date_created'</span>: <span class="hljs-string">'2010-04-28T17:12:49Z'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General...'</span>}]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-tlls1x">Ici, nous avons utilisé la fonction <code>islice()</code> du module <code>itertools</code> de Python pour sélectionner les deux premiers exemples du jeu de données combiné. Nous pouvons voir qu’ils correspondent aux premiers exemples de chacun des deux jeux de données source.</p> <p data-svelte-h="svelte-1lk2l6b">Enfin, si vous souhaitez streamer <em>The Pile</em> dans son intégralité de 825 Go, vous pouvez récupérer tous les fichiers préparés comme suit :</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->base_url = <span class="hljs-string">"https://the-eye.eu/public/AI/pile/"</span> | |
| data_files = { | |
| <span class="hljs-string">"train"</span>: [base_url + <span class="hljs-string">"train/"</span> + <span class="hljs-string">f"<span class="hljs-subst">{idx:02d}</span>.jsonl.zst"</span> <span class="hljs-keyword">for</span> idx <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">30</span>)], | |
| <span class="hljs-string">"validation"</span>: base_url + <span class="hljs-string">"val.jsonl.zst"</span>, | |
| <span class="hljs-string">"test"</span>: base_url + <span class="hljs-string">"test.jsonl.zst"</span>, | |
| } | |
| pile_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files, streaming=<span class="hljs-literal">True</span>) | |
| <span class="hljs-built_in">next</span>(<span class="hljs-built_in">iter</span>(pile_dataset[<span class="hljs-string">"train"</span>]))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'meta'</span>: {<span class="hljs-string">'pile_set_name'</span>: <span class="hljs-string">'Pile-CC'</span>}, | |
| <span class="hljs-string">'text'</span>: <span class="hljs-string">'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web...'</span>}<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-of2k18">✏️ <strong>Essayez !</strong> Utilisez l’un des grands corpus Common Crawl comme <a href="https://huggingface.co/datasets/mc4" rel="nofollow"><code>mc4</code></a> ou <a href="https://huggingface.co/datasets/oscar" rel="nofollow"><code>oscar</code></a> pour créer en streaming un jeu de données multilingue représentant les proportions de langues parlées dans un pays de votre choix. Par exemple, les quatre langues nationales en Suisse sont l’allemand, le français, l’italien et le romanche. Vous pouvez donc essayer de créer un corpus suisse en échantillonnant les sous-ensembles Oscar en fonction de leur proportion parlée.</p></div> <p data-svelte-h="svelte-9kylg4">Vous disposez maintenant de tous les outils dont vous avez besoin pour charger et traiter des jeux de données de toutes formes et tailles. Cependant à moins que vous ne soyez exceptionnellement chanceux, il arrivera un moment dans votre cheminement en traitement du langage naturel où vous devrez réellement créer un jeu de données pour résoudre un problème donné. C’est le sujet de la section suivante !</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/fr/chapter5/4.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1sfisyd = { | |
| assets: "/docs/course/pr_1069/fr", | |
| base: "/docs/course/pr_1069/fr", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1069/fr/_app/immutable/entry/start.cea6db46.js"), | |
| import("/docs/course/pr_1069/fr/_app/immutable/entry/app.3f6640b1.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 37], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 76.4 kB
- Xet hash:
- ef70e2f9220e26eb607109d934c77cea508e971b6dab926d8362637f03bc44fb
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.